Entropy
Calculate the information entropy (Shannon entropy) of a given input string.
You are encouraged to solve this task according to the task description, using any language you may know.
Entropy is the expected value of the measure of information content in a system. In general, the Shannon entropy of a variable is defined as:
where the information content . If the base of the logarithm , the result is expressed in bits, a unit of information. Therefore, given a string of length where is the relative frequency of each character, the entropy of a string in bits is:
For this task, use "1223334444" as an example. The result should be around 1.84644 bits.
Ada
Uses Ada 2012. <lang Ada>with Ada.Text_IO, Ada.Float_Text_IO, Ada.Numerics.Elementary_Functions;
procedure Count_Entropy is
package TIO renames Ada.Text_IO;
Count: array(Character) of Natural := (others => 0); Sum: Natural := 0; Line: String := "1223334444";
begin
for I in Line'Range loop -- count the characters Count(Line(I)) := Count(Line(I))+1; Sum := Sum + 1; end loop;
declare -- compute the entropy and print it function P(C: Character) return Float is (Float(Count(C)) / Float(Sum)); use Ada.Numerics.Elementary_Functions, Ada.Float_Text_IO; Result: Float := 0.0; begin for Ch in Character loop Result := Result - (if P(Ch)=0.0 then 0.0 else P(Ch) * Log(P(Ch), Base => 2.0)); end loop; Put(Result, Fore => 1, Aft => 5, Exp => 0); end;
end Count_Entropy;</lang>
Burlesque
<lang burlesque>blsq ) "1223334444"F:u[vv^^{1\/?/2\/LG}m[?*++ 1.8464393446710157</lang>
Euler Math Toolbox
<lang EulerMathToolbox>>function entropy (s) ... $ v=strtochar(s); $ m=getmultiplicities(unique(v),v); $ m=m/sum(m); $ return sum(-m*logbase(m,2)) $endfunction >entropy("1223334444")
1.84643934467</lang>
D
<lang d>import std.stdio, std.algorithm, std.math;
double entropy(T)(T[] s) /*pure nothrow*/ if (__traits(compiles, sort(s))) {
return s .sort() .group .map!(g => g[1] / cast(double)s.length) .map!(p => -p * log2(p)) .reduce!q{a + b};
}
void main() {
"1223334444"d.dup.entropy.writeln;
}</lang>
- Output:
1.84644
friendly interactive shell
Sort of hacky, but friendly interactive shell isn't really optimized for mathematic tasks (in fact, it doesn't even have associative arrays).
<lang fishshell>function entropy
for arg in $argv set name count_$arg if not count $$name > /dev/null set $name 0 set values $values $arg end set $name (math $$name + 1) end set entropy 0 for value in $values set name count_$value set entropy (echo " scale = 50 p = "$$name" / "(count $argv)" $entropy - p * l(p) " | bc -l) end echo "$entropy / l(2)" | bc -l
end entropy (echo 1223334444 | fold -w1)</lang>
- Output:
1.84643934467101549345
Haskell
<lang haskell>import Data.List
main = print $ entropy "1223334444"
entropy s =
sum . map lg' . fq' . map (fromIntegral.length) . group . sort $ s where lg' c = (c * ) . logBase 2 $ 1.0 / c fq' c = map (\x -> x / (sum c)) c </lang>
J
Solution:<lang j> entropy=: +/@:-@(* 2&^.)@(#/.~ % #)</lang>
- Example:
<lang j> entropy '1223334444' 1.84644</lang>
Lang5
<lang lang5>: -rot rot rot ; [] '__A set : dip swap __A swap 1 compress append '__A set execute __A -1 extract nip ; : nip swap drop ; : sum '+ reduce ;
- 2array 2 compress ; : comb "" split ; : lensize length nip ;
- <group> #( a -- 'a )
grade subscript dup 's dress distinct strip length 1 2array reshape swap 'A set : `filter(*) A in A swap select ; '`filter apply ;
- elements(*) lensize ;
- entropy #( s -- n )
length "<group> 'elements apply" dip / dup neg swap log * 2 log / sum ;
"1223334444" comb entropy . # 1.84643934467102</lang>
Mathematica
<lang Mathematica>shE[s_String] := -Plus @@ ((# Log[2., #]) & /@ ((Length /@ Gather[#])/
Length[#]) &[Characters[s]])</lang>
- Example:
<lang Mathematica> shE["1223334444"]
1.84644 shE["Rosetta Code"] 3.08496</lang>
Perl
<lang Perl>sub entropy {
my %count; $count{$_}++ for @_; my @p = map $_/@_, values %count; my $entropy = 0; $entropy += - $_ * log $_ for @p; $entropy / log 2
}
print entropy split //, "1223334444";</lang>
Perl 6
<lang Perl 6>sub entropy(@a) {
[+] map -> \p { p * -log p }, @a.bag.values »/» +@a;
}
say log(2) R/ entropy '1223334444'.comb;</lang>
- Output:
1.84643934467102
Python
<lang python>from __future__ import division import math
def hist(source):
hist = {}; l = 0; for e in source: l += 1 if e not in hist: hist[e] = 0 hist[e] += 1 return (l,hist)
def entropy(hist,l):
elist = [] for v in hist.values(): c = v / l elist.append(-c * math.log(c ,2)) return sum(elist)
def printHist(h):
flip = lambda (k,v) : (v,k) h = sorted(h.iteritems(), key = flip) print 'Sym\thi\tfi\tInf' for (k,v) in h: print '%s\t%f\t%f\t%f'%(k,v,v/l,-math.log(v/l, 2))
source = "1223334444" (l,h) = hist(source); print '.[Results].' print 'Length',l print 'Entropy:', entropy(h, l) printHist(h)</lang>
Output:
.[Results]. Length 10 Entropy: 1.84643934467 Sym hi fi Inf 1 1.000000 0.100000 3.321928 2 2.000000 0.200000 2.321928 3 3.000000 0.300000 1.736966 4 4.000000 0.400000 1.321928
R
<lang r>entropy = function(s)
{freq = prop.table(table(strsplit(s, )[1])) -sum(freq * log(freq, base = 2))}
print(entropy("1223334444")) # 1.846439</lang>
Racket
<lang racket>
- lang racket
(require math)
(define (log2 x)
(/ (log x) (log 2)))
(define (digits x)
(for/list ([c (number->string x)]) (- (char->integer c) (char->integer #\0))))
(define (entropy x)
(define ds (digits x)) (define n (length ds)) (- (for/sum ([(d c) (in-hash (samples->hash ds))]) (* (/ d n) (log2 (/ d n))))))
(entropy 1223334444) </lang> Output: 1.8464393446710154
REXX
version 1
<lang rexx>/* Rexx ***************************************************************
- 28.02.2013 Walter Pachl
- /
s="1223334444" occ.=0 n=0 Do i=1 To length(s)
c=substr(s,i,1) occ.c=occ.c+1 n=n+1 End
do c=1 To 4
p.c=occ.c/n say c p.c End
e=0 Do c=1 To 4
e=e+p.c*log(p.c,,2) End
Say 'Entropy of' s 'is' (-e) Exit
log: Procedure /***********************************************************************
- Return log(x) -- with specified precision and a specified base
- Three different series are used for the ranges 0 to 0.5
- 0.5 to 1.5
- 1.5 to infinity
- 03.09.1992 Walter Pachl
- /
Parse Arg x,prec,b If prec= Then prec=9 Numeric Digits (2*prec) Numeric Fuzz 3 Select When x<=0 Then r='*** invalid argument ***' When x<0.5 Then Do z=(x-1)/(x+1) o=z r=z k=1 Do i=3 By 2 ra=r k=k+1 o=o*z*z r=r+o/i If r=ra Then Leave End r=2*r End When x<1.5 Then Do z=(x-1) o=z r=z k=1 Do i=2 By 1 ra=r k=k+1 o=-o*z r=r+o/i If r=ra Then Leave End End Otherwise /* 1.5<=x */ Do z=(x+1)/(x-1) o=1/z r=o k=1 Do i=3 By 2 ra=r k=k+1 o=o/(z*z) r=r+o/i If r=ra Then Leave End r=2*r End End If b<> Then r=r/ln(b) Numeric Digits (prec) r=r+0 Return r
</lang> Output:
1 0.1 2 0.2 3 0.3 4 0.4 Entropy of 1223334444 is 1.84643934
version 2
REXX doesn't have a BIF for LOG or LN, so the subroutine (function) LOG2 is included herein. <lang rexx>/*REXX program calculates the information entropy for a given char str.*/ numeric digits 30; pad=left(,9) /*PAD is used for showing blanks.*/ parse arg $ .; if $== then $=1223334444 /*obtain optional input*/ n=0; @.=0; L=length($); $$=
do j=1 for L; _=substr($,j,1) /*process each character in $ str*/ if @._==0 then do; n=n+1 /*if unique, bump char counter. */ $$=$$ || _ /*add this character to the list.*/ end @._ = @._+1 /*keep track of this char count. */ end /*j*/
sum=0 /*calc info entropy for each char*/
do i=1 for n; _=substr($$,i,1) /*obtain a char from unique list.*/ sum=sum - @._/L * log2(@._/L) /*add (negatively) the entropies.*/ end /*i*/
say 'input string: ' $ pad 'string length:' L pad 'unique chars:' n; say say 'the information entropy of the string ──► ' format(sum,,12) exit /*stick a fork in it, we're done.*/ /*──────────────────────────────────LOG2 subroutine─────────────────────*/ log2: procedure; parse arg x 1 xx; ig= x>1.5; is=1-2*(ig\==1); ii=0 numeric digits digits()+5 /* [↓] precision of E must be > digits().*/ e=2.7182818284590452353602874713526624977572470936999595749669676277240766303535
do while ig & xx>1.5 | \ig&xx<.5; _=e; do k=-1; iz=xx* _**-is if k>=0 & (ig & iz<1 | \ig&iz>.5) then leave; _=_*_; izz=iz; end xx=izz; ii=ii+is*2**k; end; x=x* e**-ii-1; z=0; _=-1; p=z do k=1; _=-_*x; z=z+_/k; if z=p then leave; p=z; end /*k*/ r=z+ii; if arg()==2 then return r; return r/log2(2,0)</lang>
output when using the input of: xxx
input string: 1223334444 string length: 10 unique chars: 4 the information entropy of the string ──► 1.846439344671
Ruby
<lang ruby>def entropy(s)
counts = Hash.new(0) s.each_char { |c| counts[c] += 1 }
counts.values.reduce(0) do |entropy, count| freq = count / s.length.to_f entropy - freq * Math.log2(freq) end
end</lang> One-liner, same performance (or better): <lang ruby>def entropy2(s)
s.each_char.group_by(&:to_s).values.map { |x| x.length / s.length.to_f }.reduce(0) { |e, x| e - x*Math.log2(x) }
end</lang>
Scala
<lang scala>import scala.math._
def entropy( v:String ) = { v
.groupBy (a => a) .values .map( i => i.length.toDouble / v.length ) .map( p => -p * log10(p) / log10(2)) .sum
}
// Confirm that "1223334444" has an entropy of about 1.84644 assert( math.round( entropy("1223334444") * 100000 ) * 0.00001 == 1.84644 )</lang>
Tcl
<lang tcl>proc entropy {str} {
set log2 [expr log(2)] foreach char [split $str ""] {dict incr counts $char} set entropy 0.0 foreach count [dict values $counts] {
set freq [expr {$count / double([string length $str])}] set entropy [expr {$entropy - $freq * log($freq)/$log2}]
} return $entropy
}</lang> Demonstration: <lang tcl>puts [format "entropy = %.5f" [entropy "1223334444"]] puts [format "entropy = %.5f" [entropy "Rosetta Code"]]</lang>
- Output:
entropy = 1.84644 entropy = 3.08496
XPL0
<lang XPL0>code real RlOut=48, Ln=54; \intrinsic routines string 0; \use zero-terminated strings
func StrLen(A); \Return number of characters in an ASCIIZ string char A; int I; for I:= 0, -1>>1-1 do
if A(I) = 0 then return I;
func real Entropy(Str); \Return Shannon entropy of string char Str; int Len, I, Count(128); real Sum, Prob; [Len:= StrLen(Str); for I:= 0 to 127 do Count(I):= 0; for I:= 0 to Len-1 do \count number of each character in string
Count(Str(I)):= Count(Str(I)) + 1;
Sum:= 0.0; for I:= 0 to 127 do
if Count(I) # 0 then \(avoid Ln(0.0) error) [Prob:= float(Count(I)) / float(Len); \probability of char in string Sum:= Sum + Prob*Ln(Prob); ];
return -Sum/Ln(2.0); ];
RlOut(0, Entropy("1223334444"))</lang>
- Output:
1.84644