Entropy

From Rosetta Code
Task
Entropy
You are encouraged to solve this task according to the task description, using any language you may know.

Calculate the information entropy (Shannon entropy) of a given input string.

Entropy is the expected value of the measure of information content in a system. In general, the Shannon entropy of a variable is defined as:

where the information content . If the base of the logarithm , the result is expressed in bits, a unit of information. Therefore, given a string of length where is the relative frequency of each character, the entropy of a string in bits is:

For this task, use "1223334444" as an example. The result should be around 1.84644 bits.

Ada

Uses Ada 2012. <lang Ada>with Ada.Text_IO, Ada.Float_Text_IO, Ada.Numerics.Elementary_Functions;

procedure Count_Entropy is

  package TIO renames Ada.Text_IO;
  Count: array(Character) of Natural := (others => 0);
  Sum:   Natural := 0;
  Line: String := "1223334444";

begin

  for I in Line'Range loop   -- count the characters
     Count(Line(I)) := Count(Line(I))+1;
     Sum := Sum + 1;
  end loop;
  declare   -- compute the entropy and print it
     function P(C: Character) return Float is (Float(Count(C)) / Float(Sum));
     use Ada.Numerics.Elementary_Functions, Ada.Float_Text_IO;
     Result: Float := 0.0;
  begin
     for Ch in Character loop
        Result := Result -
         (if P(Ch)=0.0 then 0.0 else P(Ch) * Log(P(Ch), Base => 2.0));
     end loop;
     Put(Result, Fore => 1, Aft => 5, Exp => 0);
  end;

end Count_Entropy;</lang>

Burlesque

<lang burlesque>blsq ) "1223334444"F:u[vv^^{1\/?/2\/LG}m[?*++ 1.8464393446710157</lang>

C

<lang c>#include <stdio.h>

  1. include <stdlib.h>
  2. include <stdbool.h>
  3. include <string.h>
  4. include <math.h>
  1. define MAXLEN 100 //maximum string length

int makehist(char *S,int *hist,int len){ int wherechar[256]; int i,histlen; histlen=0; for(i=0;i<256;i++)wherechar[i]=-1; for(i=0;i<len;i++){ if(wherechar[(int)S[i]]==-1){ wherechar[(int)S[i]]=histlen; histlen++; } hist[wherechar[(int)S[i]]]++; } return histlen; }

double entropy(int *hist,int histlen,int len){ int i; double H; H=0; for(i=0;i<histlen;i++){ H-=(double)hist[i]/len*log2((double)hist[i]/len); } return H; }

int main(void){ char S[MAXLEN]; int len,*hist,histlen; double H; scanf("%[^\n]",S); len=strlen(S); hist=(int*)calloc(len,sizeof(int)); histlen=makehist(S,hist,len); //hist now has no order (known to the program) but that doesn't matter H=entropy(hist,histlen,len); printf("%lf\n",H); return 0; }</lang> Examples: <lang>$ ./entropy 1223334444 1.846439 $ ./entropy Rosetta Code is the best site in the world! 3.646513</lang>

Common Lisp

<lang lisp>(defun entropy(input-string)

   (let ((frequency-table (make-hash-table :test 'equal))
         (entropy 0))
        (map 'nil #'(lambda(c) (setf (gethash c frequency-table) (if (gethash c frequency-table) (+ (gethash c frequency-table) 1) 1))) (coerce input-string 'list))
        (maphash #'(lambda(k v) (setf entropy (+ entropy (* -1 (/ v (length input-string)) (log (/ v (length input-string)) 2))))) frequency-table)
        entropy))

</lang>

D

<lang d>import std.stdio, std.algorithm, std.math;

double entropy(T)(T[] s) /*pure nothrow*/ if (__traits(compiles, sort(s))) {

   return s
          .sort()
          .group
          .map!(g => g[1] / cast(double)s.length)
          .map!(p => -p * log2(p))
          .reduce!q{a + b};

}

void main() {

   "1223334444"d.dup.entropy.writeln;

}</lang>

Output:
1.84644

Euler Math Toolbox

<lang EulerMathToolbox>>function entropy (s) ... $ v=strtochar(s); $ m=getmultiplicities(unique(v),v); $ m=m/sum(m); $ return sum(-m*logbase(m,2)) $endfunction >entropy("1223334444")

1.84643934467</lang>

friendly interactive shell

Sort of hacky, but friendly interactive shell isn't really optimized for mathematic tasks (in fact, it doesn't even have associative arrays).

<lang fishshell>function entropy

   for arg in $argv
       set name count_$arg
       if not count $$name > /dev/null
           set $name 0
           set values $values $arg
       end
       set $name (math $$name + 1)
   end
   set entropy 0
   for value in $values
       set name count_$value
       set entropy (echo "
           scale = 50
           p = "$$name" / "(count $argv)"
           $entropy - p * l(p)
       " | bc -l)
   end
   echo "$entropy / l(2)" | bc -l

end entropy (echo 1223334444 | fold -w1)</lang>

Output:
1.84643934467101549345

FORTRAN

Please find the GNU/linux compilation instructions along with sample run among the comments at the start of the FORTRAN 2008 source. This program acquires input from the command line argument, thereby demonstrating the fairly new get_command_argument intrinsic subroutine. The expression of the algorithm is a rough translated of the j solution. Thank you. <lang FORTRAN> !-*- mode: compilation; default-directory: "/tmp/" -*- !Compilation started at Tue May 21 21:43:12 ! !a=./f && make $a && OMP_NUM_THREADS=2 $a 1223334444 !gfortran -std=f2008 -Wall -ffree-form -fall-intrinsics f.f08 -o f ! Shannon entropy of 1223334444 is 1.84643936 ! !Compilation finished at Tue May 21 21:43:12

program shannonEntropy

 implicit none
 integer :: num, L, status
 character(len=2048) :: s
 num = 1
 call get_command_argument(num, s, L, status)
 if ((0 /= status) .or. (L .eq. 0)) then
   write(0,*)'Expected a command line argument with some length.'
 else
   write(6,*)'Shannon entropy of '//(s(1:L))//' is ', se(s(1:L))
 endif

contains

 !     algebra
 !
 ! 2**x = y
 ! x*log(2) = log(y)
 ! x = log(y)/log(2)
 !   NB. The j solution
 !   entropy=:  +/@:-@(* 2&^.)@(#/.~ % #)
 !   entropy '1223334444'
 !1.84644
 
 real function se(s)
   implicit none
   character(len=*), intent(in) :: s
   integer, dimension(256) :: tallies
   real, dimension(256) :: norm
   tallies = 0
   call TallyKey(s, tallies)
   ! J's #/. works with the set of items in the input.
   ! TallyKey is sufficiently close that, with the merge, gets the correct result.
   norm = tallies / real(len(s))
   se = sum(-(norm*log(merge(1.0, norm, norm .eq. 0))/log(2.0)))
 end function se
 subroutine TallyKey(s, counts)
   character(len=*), intent(in) :: s
   integer, dimension(256), intent(out) :: counts
   integer :: i, j
   counts = 0
   do i=1,len(s)
     j = iachar(s(i:i))
     counts(j) = counts(j) + 1
   end do
 end subroutine TallyKey

end program shannonEntropy </lang>

Haskell

<lang haskell>import Data.List

main = print $ entropy "1223334444"

entropy s =

sum . map lg' . fq' . map (fromIntegral.length) . group . sort $ s
 where lg' c = (c * ) . logBase 2 $ 1.0 / c
       fq' c = map (\x -> x / (sum c)) c </lang>

J

Solution:<lang j> entropy=: +/@:-@(* 2&^.)@(#/.~ % #)</lang>

Example:

<lang j> entropy '1223334444' 1.84644</lang>

Lang5

<lang lang5>: -rot rot rot ; [] '__A set : dip swap __A swap 1 compress append '__A set execute __A -1 extract nip ; : nip swap drop ; : sum '+ reduce ;

2array 2 compress ; : comb "" split ; : lensize length nip ;
<group> #( a -- 'a )
   grade subscript dup 's dress distinct strip
   length 1 2array reshape swap
   'A set
   : `filter(*)  A in A swap select ;
   '`filter apply
   ;
elements(*) lensize ;
entropy #( s -- n )
   length "<group> 'elements apply" dip /
   dup neg swap log * 2 log / sum ;

"1223334444" comb entropy . # 1.84643934467102</lang>

Mathematica

<lang Mathematica>shE[s_String] := -Plus @@ ((# Log[2., #]) & /@ ((Length /@ Gather[#])/

        Length[#]) &[Characters[s]])</lang>
Example:

<lang Mathematica> shE["1223334444"]

1.84644 shE["Rosetta Code"] 3.08496</lang>

PARI/GP

<lang parigp>entropy(s)=s=Vec(s);my(v=vecsort(s,,8));-sum(i=1,#v,(x->x*log(x))(sum(j=1,#s,v[i]==s[j])/#s))/log(2)</lang>

>entropy("1223334444")
%1 = 1.8464393446710154934341977463050452232

Perl

<lang Perl>sub entropy {

   my %count; $count{$_}++ for @_;
   my @p = map $_/@_, values %count;
   my $entropy = 0;
   $entropy += - $_ * log $_ for @p;
   $entropy / log 2

}

print entropy split //, "1223334444";</lang>

Perl 6

<lang Perl 6>sub entropy(@a) {

   [+] map -> \p { p * -log p }, @a.bag.values »/» +@a;

}

say log(2) R/ entropy '1223334444'.comb;</lang>

Output:
1.84643934467102

Python

<lang python>from __future__ import division import math

def hist(source):

   hist = {}; l = 0;
   for e in source:
       l += 1
       if e not in hist:
           hist[e] = 0
       hist[e] += 1
   return (l,hist)

def entropy(hist,l):

   elist = []
   for v in hist.values():
       c = v / l
       elist.append(-c * math.log(c ,2))
   return sum(elist)

def printHist(h):

   flip = lambda (k,v) : (v,k)
   h = sorted(h.iteritems(), key = flip)
   print 'Sym\thi\tfi\tInf'
   for (k,v) in h:
       print '%s\t%f\t%f\t%f'%(k,v,v/l,-math.log(v/l, 2))
   
   

source = "1223334444" (l,h) = hist(source); print '.[Results].' print 'Length',l print 'Entropy:', entropy(h, l) printHist(h)</lang>

Output:

.[Results].
Length 10
Entropy: 1.84643934467
Sym	hi	fi	Inf
1	1.000000	0.100000	3.321928
2	2.000000	0.200000	2.321928
3	3.000000	0.300000	1.736966
4	4.000000	0.400000	1.321928

R

<lang r>entropy = function(s)

  {freq = prop.table(table(strsplit(s, )[1]))
   -sum(freq * log(freq, base = 2))}

print(entropy("1223334444")) # 1.846439</lang>

Racket

<lang racket>

  1. lang racket

(require math)

(define (log2 x)

 (/ (log x) (log 2)))

(define (digits x)

 (for/list ([c (number->string x)])
   (- (char->integer c) (char->integer #\0))))

(define (entropy x)

 (define ds (digits x))
 (define n (length ds))
 (- (for/sum ([(d c) (in-hash (samples->hash ds))])
      (* (/ d n) (log2 (/ d n))))))

(entropy 1223334444) </lang> Output: 1.8464393446710154

REXX

version 1

This example is incorrect. Please fix the code and remove this message.

Details: -- REXX version 1 program is incorrect:


program won't handle number of characters ≥ 5 (try 1223334444555555555),
program won't handle number of characters ≤ 3 (try 122333),
program won't handle characters not contiquous (try 1227774444).
program won't handle characters that aren't numeric (try aaBBcccDDDD).


<lang rexx>/* Rexx ***************************************************************

  • 28.02.2013 Walter Pachl
  • 12.03.2013 Walter Pachl typo in log corrected. thanx for testing
                                                                                                                                            • /

s="1223334444" occ.=0 n=0 Do i=1 To length(s)

 c=substr(s,i,1)
 occ.c=occ.c+1
 n=n+1
 End

do c=1 To 4

 p.c=occ.c/n
 say c p.c
 End

e=0 Do c=1 To 4

 e=e+p.c*log(p.c,,2)
 End

Say 'Entropy of' s 'is' (-e) Exit

log: Procedure /***********************************************************************

  • Return log(x) -- with specified precision and a specified base
  • Three different series are used for the ranges 0 to 0.5
  • 0.5 to 1.5
  • 1.5 to infinity
  • 03.09.1992 Walter Pachl
                                                                                                                                              • /
 Parse Arg x,prec,b
 If prec= Then prec=9
 Numeric Digits (2*prec)
 Numeric Fuzz   3
 Select
   When x<=0 Then r='*** invalid argument ***'
   When x<0.5 Then Do
     z=(x-1)/(x+1)
     o=z
     r=z
     k=1
     Do i=3 By 2
       ra=r
       k=k+1
       o=o*z*z
       r=r+o/i
       If r=ra Then Leave
       End
     r=2*r
     End
   When x<1.5 Then Do
     z=(x-1)
     o=z
     r=z
     k=1
     Do i=2 By 1
       ra=r
       k=k+1
       o=-o*z
       r=r+o/i
       If r=ra Then Leave
       End
     End
   Otherwise /* 1.5<=x */ Do
     z=(x+1)/(x-1)
     o=1/z
     r=o
     k=1
     Do i=3 By 2
       ra=r
       k=k+1
       o=o/(z*z)
       r=r+o/i
       If r=ra Then Leave
       End
     r=2*r
     End
   End
 If b<> Then
   r=r/log(b)
 Numeric Digits (prec)
 r=r+0
 Return r

</lang> Output:

1 0.1
2 0.2
3 0.3
4 0.4
Entropy of 1223334444 is 1.84643934     

version 2

REXX doesn't have a BIF for LOG or LN,   so the subroutine (function) LOG2 is included herein. <lang rexx>/*REXX program calculates the information entropy for a given char str.*/ numeric digits 30 /*use thirty digits for precision*/ parse arg $; if $== then $=1223334444 /*obtain optional input*/ n=0; @.=0; L=length($); $$=

     do j=1  for L;  _=substr($,j,1)  /*process each character in $ str*/
     if @._==0  then do;   n=n+1      /*if unique,  bump char counter. */
                     $$=$$ || _       /*add this character to the list.*/
                     end
     @._ = @._+1                      /*keep track of this char count. */
     end   /*j*/

sum=0 /*calc info entropy for each char*/

     do i=1  for n;  _=substr($$,i,1) /*obtain a char from unique list.*/
     sum=sum  -  @._/L  * log2(@._/L) /*add (negatively) the entropies.*/
     end   /*i*/

say ' input string: ' $ say 'string length: ' L say ' unique chars: ' n ; say say 'the information entropy of the string ──► ' format(sum,,12) " bits." exit /*stick a fork in it, we're done.*/ /*──────────────────────────────────LOG2 subroutine─────────────────────*/ log2: procedure; parse arg x 1 xx; ig= x>1.5; is=1-2*(ig\==1); ii=0 numeric digits digits()+5 /* [↓] precision of E must be > digits().*/ e=2.7182818284590452353602874713526624977572470936999595749669676277240766303535

   do  while  ig & xx>1.5 | \ig&xx<.5;   _=e;     do k=-1;  iz=xx* _**-is
   if k>=0 & (ig & iz<1 | \ig&iz>.5)     then leave;  _=_*_;  izz=iz; end
   xx=izz;  ii=ii+is*2**k;  end;         x=x* e**-ii-1;  z=0;  _=-1;  p=z
     do k=1;  _=-_*x;  z=z+_/k;  if z=p  then leave;  p=z;  end  /*k*/
   r=z+ii;  if arg()==2  then return r;  return r/log2(2,0)</lang>

output when using the default input of: 1223334444

 input string:  1223334444
string length:  10
 unique chars:  4

the information entropy of the string ──►  1.846439344671  bits.

output when using the input of: Rosetta Code

 input string:  Rosetta Code
string length:  12
 unique chars:  9

the information entropy of the string ──►  3.084962500721  bits.

Ruby

Works with: Ruby version 1.9

<lang ruby>def entropy(s)

 counts = Hash.new(0)
 s.each_char { |c| counts[c] += 1 }
 counts.values.reduce(0) do |entropy, count|
   freq = count / s.length.to_f
   entropy - freq * Math.log2(freq)
 end

end</lang> One-liner, same performance (or better): <lang ruby>def entropy2(s)

 s.each_char.group_by(&:to_s).values.map { |x| x.length / s.length.to_f }.reduce(0) { |e, x| e - x*Math.log2(x) }

end</lang>

Scala

<lang scala>import scala.math._

def entropy( v:String ) = { v

 .groupBy (a => a)
 .values
 .map( i => i.length.toDouble / v.length )
 .map( p => -p * log10(p) / log10(2))
 .sum

}

// Confirm that "1223334444" has an entropy of about 1.84644 assert( math.round( entropy("1223334444") * 100000 ) * 0.00001 == 1.84644 )</lang>

Tcl

<lang tcl>proc entropy {str} {

   set log2 [expr log(2)]
   foreach char [split $str ""] {dict incr counts $char}
   set entropy 0.0
   foreach count [dict values $counts] {

set freq [expr {$count / double([string length $str])}] set entropy [expr {$entropy - $freq * log($freq)/$log2}]

   }
   return $entropy

}</lang> Demonstration: <lang tcl>puts [format "entropy = %.5f" [entropy "1223334444"]] puts [format "entropy = %.5f" [entropy "Rosetta Code"]]</lang>

Output:
entropy = 1.84644
entropy = 3.08496

XPL0

<lang XPL0>code real RlOut=48, Ln=54; \intrinsic routines string 0; \use zero-terminated strings

func StrLen(A); \Return number of characters in an ASCIIZ string char A; int I; for I:= 0, -1>>1-1 do

   if A(I) = 0 then return I;

func real Entropy(Str); \Return Shannon entropy of string char Str; int Len, I, Count(128); real Sum, Prob; [Len:= StrLen(Str); for I:= 0 to 127 do Count(I):= 0; for I:= 0 to Len-1 do \count number of each character in string

   Count(Str(I)):= Count(Str(I)) + 1;

Sum:= 0.0; for I:= 0 to 127 do

   if Count(I) # 0 then        \(avoid Ln(0.0) error)
       [Prob:= float(Count(I)) / float(Len);   \probability of char in string
       Sum:= Sum + Prob*Ln(Prob);
       ];

return -Sum/Ln(2.0); ];

RlOut(0, Entropy("1223334444"))</lang>

Output:
    1.84644