Entropy: Difference between revisions
Walterpachl (talk | contribs) m (→version 1: typo corrected) |
(→version 1: re-flagged as incorrect. -- ~~~~) |
||
Line 243: | Line 243: | ||
=={{header|REXX}}== |
=={{header|REXX}}== |
||
===version 1=== |
===version 1=== |
||
{{incorrect|REXX|-- REXX version 1 program is incorrect: |
|||
<br> program won't handle number of characters ≥ 5 (try 1223334444555555555), |
|||
<br> program won't handle number of characters ≤ 3 (try 122333), |
|||
<br> program won't handle characters not contiquous (try 1227774444). |
|||
<br> program won't handle characters that aren't numeric (try aaBBcccDDDD). |
|||
<br>}} |
|||
<lang rexx>/* Rexx *************************************************************** |
<lang rexx>/* Rexx *************************************************************** |
||
* 28.02.2013 Walter Pachl |
* 28.02.2013 Walter Pachl |
Revision as of 19:15, 12 March 2013
You are encouraged to solve this task according to the task description, using any language you may know.
Calculate the information entropy (Shannon entropy) of a given input string.
Entropy is the expected value of the measure of information content in a system. In general, the Shannon entropy of a variable is defined as:
where the information content . If the base of the logarithm , the result is expressed in bits, a unit of information. Therefore, given a string of length where is the relative frequency of each character, the entropy of a string in bits is:
For this task, use "1223334444" as an example. The result should be around 1.84644 bits.
Ada
Uses Ada 2012. <lang Ada>with Ada.Text_IO, Ada.Float_Text_IO, Ada.Numerics.Elementary_Functions;
procedure Count_Entropy is
package TIO renames Ada.Text_IO;
Count: array(Character) of Natural := (others => 0); Sum: Natural := 0; Line: String := "1223334444";
begin
for I in Line'Range loop -- count the characters Count(Line(I)) := Count(Line(I))+1; Sum := Sum + 1; end loop;
declare -- compute the entropy and print it function P(C: Character) return Float is (Float(Count(C)) / Float(Sum)); use Ada.Numerics.Elementary_Functions, Ada.Float_Text_IO; Result: Float := 0.0; begin for Ch in Character loop Result := Result - (if P(Ch)=0.0 then 0.0 else P(Ch) * Log(P(Ch), Base => 2.0)); end loop; Put(Result, Fore => 1, Aft => 5, Exp => 0); end;
end Count_Entropy;</lang>
Burlesque
<lang burlesque>blsq ) "1223334444"F:u[vv^^{1\/?/2\/LG}m[?*++ 1.8464393446710157</lang>
Euler Math Toolbox
<lang EulerMathToolbox>>function entropy (s) ... $ v=strtochar(s); $ m=getmultiplicities(unique(v),v); $ m=m/sum(m); $ return sum(-m*logbase(m,2)) $endfunction >entropy("1223334444")
1.84643934467</lang>
D
<lang d>import std.stdio, std.algorithm, std.math;
double entropy(T)(T[] s) /*pure nothrow*/ if (__traits(compiles, sort(s))) {
return s .sort() .group .map!(g => g[1] / cast(double)s.length) .map!(p => -p * log2(p)) .reduce!q{a + b};
}
void main() {
"1223334444"d.dup.entropy.writeln;
}</lang>
- Output:
1.84644
friendly interactive shell
Sort of hacky, but friendly interactive shell isn't really optimized for mathematic tasks (in fact, it doesn't even have associative arrays).
<lang fishshell>function entropy
for arg in $argv set name count_$arg if not count $$name > /dev/null set $name 0 set values $values $arg end set $name (math $$name + 1) end set entropy 0 for value in $values set name count_$value set entropy (echo " scale = 50 p = "$$name" / "(count $argv)" $entropy - p * l(p) " | bc -l) end echo "$entropy / l(2)" | bc -l
end entropy (echo 1223334444 | fold -w1)</lang>
- Output:
1.84643934467101549345
Haskell
<lang haskell>import Data.List
main = print $ entropy "1223334444"
entropy s =
sum . map lg' . fq' . map (fromIntegral.length) . group . sort $ s where lg' c = (c * ) . logBase 2 $ 1.0 / c fq' c = map (\x -> x / (sum c)) c </lang>
J
Solution:<lang j> entropy=: +/@:-@(* 2&^.)@(#/.~ % #)</lang>
- Example:
<lang j> entropy '1223334444' 1.84644</lang>
Lang5
<lang lang5>: -rot rot rot ; [] '__A set : dip swap __A swap 1 compress append '__A set execute __A -1 extract nip ; : nip swap drop ; : sum '+ reduce ;
- 2array 2 compress ; : comb "" split ; : lensize length nip ;
- <group> #( a -- 'a )
grade subscript dup 's dress distinct strip length 1 2array reshape swap 'A set : `filter(*) A in A swap select ; '`filter apply ;
- elements(*) lensize ;
- entropy #( s -- n )
length "<group> 'elements apply" dip / dup neg swap log * 2 log / sum ;
"1223334444" comb entropy . # 1.84643934467102</lang>
Mathematica
<lang Mathematica>shE[s_String] := -Plus @@ ((# Log[2., #]) & /@ ((Length /@ Gather[#])/
Length[#]) &[Characters[s]])</lang>
- Example:
<lang Mathematica> shE["1223334444"]
1.84644 shE["Rosetta Code"] 3.08496</lang>
Perl
<lang Perl>sub entropy {
my %count; $count{$_}++ for @_; my @p = map $_/@_, values %count; my $entropy = 0; $entropy += - $_ * log $_ for @p; $entropy / log 2
}
print entropy split //, "1223334444";</lang>
Perl 6
<lang Perl 6>sub entropy(@a) {
[+] map -> \p { p * -log p }, @a.bag.values »/» +@a;
}
say log(2) R/ entropy '1223334444'.comb;</lang>
- Output:
1.84643934467102
Python
<lang python>from __future__ import division import math
def hist(source):
hist = {}; l = 0; for e in source: l += 1 if e not in hist: hist[e] = 0 hist[e] += 1 return (l,hist)
def entropy(hist,l):
elist = [] for v in hist.values(): c = v / l elist.append(-c * math.log(c ,2)) return sum(elist)
def printHist(h):
flip = lambda (k,v) : (v,k) h = sorted(h.iteritems(), key = flip) print 'Sym\thi\tfi\tInf' for (k,v) in h: print '%s\t%f\t%f\t%f'%(k,v,v/l,-math.log(v/l, 2))
source = "1223334444" (l,h) = hist(source); print '.[Results].' print 'Length',l print 'Entropy:', entropy(h, l) printHist(h)</lang>
Output:
.[Results]. Length 10 Entropy: 1.84643934467 Sym hi fi Inf 1 1.000000 0.100000 3.321928 2 2.000000 0.200000 2.321928 3 3.000000 0.300000 1.736966 4 4.000000 0.400000 1.321928
R
<lang r>entropy = function(s)
{freq = prop.table(table(strsplit(s, )[1])) -sum(freq * log(freq, base = 2))}
print(entropy("1223334444")) # 1.846439</lang>
Racket
<lang racket>
- lang racket
(require math)
(define (log2 x)
(/ (log x) (log 2)))
(define (digits x)
(for/list ([c (number->string x)]) (- (char->integer c) (char->integer #\0))))
(define (entropy x)
(define ds (digits x)) (define n (length ds)) (- (for/sum ([(d c) (in-hash (samples->hash ds))]) (* (/ d n) (log2 (/ d n))))))
(entropy 1223334444) </lang> Output: 1.8464393446710154
REXX
version 1
<lang rexx>/* Rexx ***************************************************************
- 28.02.2013 Walter Pachl
- 12.03.2013 Walter Pachl typo in log corrected. thanx for testing
- /
s="1223334444" occ.=0 n=0 Do i=1 To length(s)
c=substr(s,i,1) occ.c=occ.c+1 n=n+1 End
do c=1 To 4
p.c=occ.c/n say c p.c End
e=0 Do c=1 To 4
e=e+p.c*log(p.c,,2) End
Say 'Entropy of' s 'is' (-e) Exit
log: Procedure /***********************************************************************
- Return log(x) -- with specified precision and a specified base
- Three different series are used for the ranges 0 to 0.5
- 0.5 to 1.5
- 1.5 to infinity
- 03.09.1992 Walter Pachl
- /
Parse Arg x,prec,b If prec= Then prec=9 Numeric Digits (2*prec) Numeric Fuzz 3 Select When x<=0 Then r='*** invalid argument ***' When x<0.5 Then Do z=(x-1)/(x+1) o=z r=z k=1 Do i=3 By 2 ra=r k=k+1 o=o*z*z r=r+o/i If r=ra Then Leave End r=2*r End When x<1.5 Then Do z=(x-1) o=z r=z k=1 Do i=2 By 1 ra=r k=k+1 o=-o*z r=r+o/i If r=ra Then Leave End End Otherwise /* 1.5<=x */ Do z=(x+1)/(x-1) o=1/z r=o k=1 Do i=3 By 2 ra=r k=k+1 o=o/(z*z) r=r+o/i If r=ra Then Leave End r=2*r End End If b<> Then r=r/log(b) Numeric Digits (prec) r=r+0 Return r
</lang> Output:
1 0.1 2 0.2 3 0.3 4 0.4 Entropy of 1223334444 is 1.84643934
version 2
REXX doesn't have a BIF for LOG or LN, so the subroutine (function) LOG2 is included herein. <lang rexx>/*REXX program calculates the information entropy for a given char str.*/ numeric digits 30 /*use thirty digits for precision*/ parse arg $; if $== then $=1223334444 /*obtain optional input*/ n=0; @.=0; L=length($); $$=
do j=1 for L; _=substr($,j,1) /*process each character in $ str*/ if @._==0 then do; n=n+1 /*if unique, bump char counter. */ $$=$$ || _ /*add this character to the list.*/ end @._ = @._+1 /*keep track of this char count. */ end /*j*/
sum=0 /*calc info entropy for each char*/
do i=1 for n; _=substr($$,i,1) /*obtain a char from unique list.*/ sum=sum - @._/L * log2(@._/L) /*add (negatively) the entropies.*/ end /*i*/
say ' input string: ' $ say 'string length: ' L say ' unique chars: ' n ; say say 'the information entropy of the string ──► ' format(sum,,12) exit /*stick a fork in it, we're done.*/ /*──────────────────────────────────LOG2 subroutine─────────────────────*/ log2: procedure; parse arg x 1 xx; ig= x>1.5; is=1-2*(ig\==1); ii=0 numeric digits digits()+5 /* [↓] precision of E must be > digits().*/ e=2.7182818284590452353602874713526624977572470936999595749669676277240766303535
do while ig & xx>1.5 | \ig&xx<.5; _=e; do k=-1; iz=xx* _**-is if k>=0 & (ig & iz<1 | \ig&iz>.5) then leave; _=_*_; izz=iz; end xx=izz; ii=ii+is*2**k; end; x=x* e**-ii-1; z=0; _=-1; p=z do k=1; _=-_*x; z=z+_/k; if z=p then leave; p=z; end /*k*/ r=z+ii; if arg()==2 then return r; return r/log2(2,0)</lang>
output when using the default input of: 1223334444
input string: 1223334444 string length: 10 unique chars: 4 the information entropy of the string ──► 1.846439344671
output when using the input of: Rosetta Code
input string: Rosetta Code string length: 12 unique chars: 9 the information entropy of the string ──► 3.084962500721
Ruby
<lang ruby>def entropy(s)
counts = Hash.new(0) s.each_char { |c| counts[c] += 1 }
counts.values.reduce(0) do |entropy, count| freq = count / s.length.to_f entropy - freq * Math.log2(freq) end
end</lang> One-liner, same performance (or better): <lang ruby>def entropy2(s)
s.each_char.group_by(&:to_s).values.map { |x| x.length / s.length.to_f }.reduce(0) { |e, x| e - x*Math.log2(x) }
end</lang>
Scala
<lang scala>import scala.math._
def entropy( v:String ) = { v
.groupBy (a => a) .values .map( i => i.length.toDouble / v.length ) .map( p => -p * log10(p) / log10(2)) .sum
}
// Confirm that "1223334444" has an entropy of about 1.84644 assert( math.round( entropy("1223334444") * 100000 ) * 0.00001 == 1.84644 )</lang>
Tcl
<lang tcl>proc entropy {str} {
set log2 [expr log(2)] foreach char [split $str ""] {dict incr counts $char} set entropy 0.0 foreach count [dict values $counts] {
set freq [expr {$count / double([string length $str])}] set entropy [expr {$entropy - $freq * log($freq)/$log2}]
} return $entropy
}</lang> Demonstration: <lang tcl>puts [format "entropy = %.5f" [entropy "1223334444"]] puts [format "entropy = %.5f" [entropy "Rosetta Code"]]</lang>
- Output:
entropy = 1.84644 entropy = 3.08496
XPL0
<lang XPL0>code real RlOut=48, Ln=54; \intrinsic routines string 0; \use zero-terminated strings
func StrLen(A); \Return number of characters in an ASCIIZ string char A; int I; for I:= 0, -1>>1-1 do
if A(I) = 0 then return I;
func real Entropy(Str); \Return Shannon entropy of string char Str; int Len, I, Count(128); real Sum, Prob; [Len:= StrLen(Str); for I:= 0 to 127 do Count(I):= 0; for I:= 0 to Len-1 do \count number of each character in string
Count(Str(I)):= Count(Str(I)) + 1;
Sum:= 0.0; for I:= 0 to 127 do
if Count(I) # 0 then \(avoid Ln(0.0) error) [Prob:= float(Count(I)) / float(Len); \probability of char in string Sum:= Sum + Prob*Ln(Prob); ];
return -Sum/Ln(2.0); ];
RlOut(0, Entropy("1223334444"))</lang>
- Output:
1.84644