Entropy: Difference between revisions

From Rosetta Code
Content added Content deleted
m (→‎version 2: changed the format of the OUTPUT statement to show the default input character string. -- ~~~~)
m (→‎version 2: removed the used of the PAD variable. -- ~~~~)
Line 365: Line 365:
say ' input string: ' $
say ' input string: ' $
say 'string length: ' L
say 'string length: ' L
say ' unique chars: ' n ; say; pad=left('',8) /*used for padding.*/
say ' unique chars: ' n ; say
say 'the information entropy of the string ──► ' format(sum,,12)
say 'the information entropy of the string ──► ' format(sum,,12)
exit /*stick a fork in it, we're done.*/
exit /*stick a fork in it, we're done.*/

Revision as of 04:55, 12 March 2013

Task
Entropy
You are encouraged to solve this task according to the task description, using any language you may know.

Calculate the information entropy (Shannon entropy) of a given input string.

Entropy is the expected value of the measure of information content in a system. In general, the Shannon entropy of a variable is defined as:

where the information content . If the base of the logarithm , the result is expressed in bits, a unit of information. Therefore, given a string of length where is the relative frequency of each character, the entropy of a string in bits is:

For this task, use "1223334444" as an example. The result should be around 1.84644 bits.

Ada

Uses Ada 2012. <lang Ada>with Ada.Text_IO, Ada.Float_Text_IO, Ada.Numerics.Elementary_Functions;

procedure Count_Entropy is

  package TIO renames Ada.Text_IO;
  Count: array(Character) of Natural := (others => 0);
  Sum:   Natural := 0;
  Line: String := "1223334444";

begin

  for I in Line'Range loop   -- count the characters
     Count(Line(I)) := Count(Line(I))+1;
     Sum := Sum + 1;
  end loop;
  declare   -- compute the entropy and print it
     function P(C: Character) return Float is (Float(Count(C)) / Float(Sum));
     use Ada.Numerics.Elementary_Functions, Ada.Float_Text_IO;
     Result: Float := 0.0;
  begin
     for Ch in Character loop
        Result := Result -
         (if P(Ch)=0.0 then 0.0 else P(Ch) * Log(P(Ch), Base => 2.0));
     end loop;
     Put(Result, Fore => 1, Aft => 5, Exp => 0);
  end;

end Count_Entropy;</lang>

Burlesque

<lang burlesque>blsq ) "1223334444"F:u[vv^^{1\/?/2\/LG}m[?*++ 1.8464393446710157</lang>

Euler Math Toolbox

<lang EulerMathToolbox>>function entropy (s) ... $ v=strtochar(s); $ m=getmultiplicities(unique(v),v); $ m=m/sum(m); $ return sum(-m*logbase(m,2)) $endfunction >entropy("1223334444")

1.84643934467</lang>

D

<lang d>import std.stdio, std.algorithm, std.math;

double entropy(T)(T[] s) /*pure nothrow*/ if (__traits(compiles, sort(s))) {

   return s
          .sort()
          .group
          .map!(g => g[1] / cast(double)s.length)
          .map!(p => -p * log2(p))
          .reduce!q{a + b};

}

void main() {

   "1223334444"d.dup.entropy.writeln;

}</lang>

Output:
1.84644

friendly interactive shell

Sort of hacky, but friendly interactive shell isn't really optimized for mathematic tasks (in fact, it doesn't even have associative arrays).

<lang fishshell>function entropy

   for arg in $argv
       set name count_$arg
       if not count $$name > /dev/null
           set $name 0
           set values $values $arg
       end
       set $name (math $$name + 1)
   end
   set entropy 0
   for value in $values
       set name count_$value
       set entropy (echo "
           scale = 50
           p = "$$name" / "(count $argv)"
           $entropy - p * l(p)
       " | bc -l)
   end
   echo "$entropy / l(2)" | bc -l

end entropy (echo 1223334444 | fold -w1)</lang>

Output:
1.84643934467101549345

Haskell

<lang haskell>import Data.List

main = print $ entropy "1223334444"

entropy s =

sum . map lg' . fq' . map (fromIntegral.length) . group . sort $ s
 where lg' c = (c * ) . logBase 2 $ 1.0 / c
       fq' c = map (\x -> x / (sum c)) c </lang>

J

Solution:<lang j> entropy=: +/@:-@(* 2&^.)@(#/.~ % #)</lang>

Example:

<lang j> entropy '1223334444' 1.84644</lang>

Lang5

<lang lang5>: -rot rot rot ; [] '__A set : dip swap __A swap 1 compress append '__A set execute __A -1 extract nip ; : nip swap drop ; : sum '+ reduce ;

2array 2 compress ; : comb "" split ; : lensize length nip ;
<group> #( a -- 'a )
   grade subscript dup 's dress distinct strip
   length 1 2array reshape swap
   'A set
   : `filter(*)  A in A swap select ;
   '`filter apply
   ;
elements(*) lensize ;
entropy #( s -- n )
   length "<group> 'elements apply" dip /
   dup neg swap log * 2 log / sum ;

"1223334444" comb entropy . # 1.84643934467102</lang>

Mathematica

<lang Mathematica>shE[s_String] := -Plus @@ ((# Log[2., #]) & /@ ((Length /@ Gather[#])/

        Length[#]) &[Characters[s]])</lang>
Example:

<lang Mathematica> shE["1223334444"]

1.84644 shE["Rosetta Code"] 3.08496</lang>

Perl

<lang Perl>sub entropy {

   my %count; $count{$_}++ for @_;
   my @p = map $_/@_, values %count;
   my $entropy = 0;
   $entropy += - $_ * log $_ for @p;
   $entropy / log 2

}

print entropy split //, "1223334444";</lang>

Perl 6

<lang Perl 6>sub entropy(@a) {

   [+] map -> \p { p * -log p }, @a.bag.values »/» +@a;

}

say log(2) R/ entropy '1223334444'.comb;</lang>

Output:
1.84643934467102

Python

<lang python>from __future__ import division import math

def hist(source):

   hist = {}; l = 0;
   for e in source:
       l += 1
       if e not in hist:
           hist[e] = 0
       hist[e] += 1
   return (l,hist)

def entropy(hist,l):

   elist = []
   for v in hist.values():
       c = v / l
       elist.append(-c * math.log(c ,2))
   return sum(elist)

def printHist(h):

   flip = lambda (k,v) : (v,k)
   h = sorted(h.iteritems(), key = flip)
   print 'Sym\thi\tfi\tInf'
   for (k,v) in h:
       print '%s\t%f\t%f\t%f'%(k,v,v/l,-math.log(v/l, 2))
   
   

source = "1223334444" (l,h) = hist(source); print '.[Results].' print 'Length',l print 'Entropy:', entropy(h, l) printHist(h)</lang>

Output:

.[Results].
Length 10
Entropy: 1.84643934467
Sym	hi	fi	Inf
1	1.000000	0.100000	3.321928
2	2.000000	0.200000	2.321928
3	3.000000	0.300000	1.736966
4	4.000000	0.400000	1.321928

R

<lang r>entropy = function(s)

  {freq = prop.table(table(strsplit(s, )[1]))
   -sum(freq * log(freq, base = 2))}

print(entropy("1223334444")) # 1.846439</lang>

Racket

<lang racket>

  1. lang racket

(require math)

(define (log2 x)

 (/ (log x) (log 2)))

(define (digits x)

 (for/list ([c (number->string x)])
   (- (char->integer c) (char->integer #\0))))

(define (entropy x)

 (define ds (digits x))
 (define n (length ds))
 (- (for/sum ([(d c) (in-hash (samples->hash ds))])
      (* (/ d n) (log2 (/ d n))))))

(entropy 1223334444) </lang> Output: 1.8464393446710154

REXX

version 1

This example is incorrect. Please fix the code and remove this message.

Details: -- REXX version 1 program is incorrect:


function/subroutine LN is undefined,
program won't handle characters not in order,
program won't handle number of characters ≥ 5,
program won't handle number of characters ≤ 3,
program won't handle characters not contiquous.


<lang rexx>/* Rexx ***************************************************************

  • 28.02.2013 Walter Pachl
                                                                                                                                            • /

s="1223334444" occ.=0 n=0 Do i=1 To length(s)

 c=substr(s,i,1)
 occ.c=occ.c+1
 n=n+1
 End

do c=1 To 4

 p.c=occ.c/n
 say c p.c
 End

e=0 Do c=1 To 4

 e=e+p.c*log(p.c,,2)
 End

Say 'Entropy of' s 'is' (-e) Exit

log: Procedure /***********************************************************************

  • Return log(x) -- with specified precision and a specified base
  • Three different series are used for the ranges 0 to 0.5
  • 0.5 to 1.5
  • 1.5 to infinity
  • 03.09.1992 Walter Pachl
                                                                                                                                              • /
 Parse Arg x,prec,b
 If prec= Then prec=9
 Numeric Digits (2*prec)
 Numeric Fuzz   3
 Select
   When x<=0 Then r='*** invalid argument ***'
   When x<0.5 Then Do
     z=(x-1)/(x+1)
     o=z
     r=z
     k=1
     Do i=3 By 2
       ra=r
       k=k+1
       o=o*z*z
       r=r+o/i
       If r=ra Then Leave
       End
     r=2*r
     End
   When x<1.5 Then Do
     z=(x-1)
     o=z
     r=z
     k=1
     Do i=2 By 1
       ra=r
       k=k+1
       o=-o*z
       r=r+o/i
       If r=ra Then Leave
       End
     End
   Otherwise /* 1.5<=x */ Do
     z=(x+1)/(x-1)
     o=1/z
     r=o
     k=1
     Do i=3 By 2
       ra=r
       k=k+1
       o=o/(z*z)
       r=r+o/i
       If r=ra Then Leave
       End
     r=2*r
     End
   End
 If b<> Then
   r=r/ln(b)
 Numeric Digits (prec)
 r=r+0
 Return r

</lang> Output:

1 0.1
2 0.2
3 0.3
4 0.4
Entropy of 1223334444 is 1.84643934     

version 2

REXX doesn't have a BIF for LOG or LN,   so the subroutine (function) LOG2 is included herein. <lang rexx>/*REXX program calculates the information entropy for a given char str.*/ numeric digits 30 /*use thirty digits for precision*/ parse arg $; if $== then $=1223334444 /*obtain optional input*/ n=0; @.=0; L=length($); $$=

     do j=1  for L;  _=substr($,j,1)  /*process each character in $ str*/
     if @._==0  then do;   n=n+1      /*if unique,  bump char counter. */
                     $$=$$ || _       /*add this character to the list.*/
                     end
     @._ = @._+1                      /*keep track of this char count. */
     end   /*j*/

sum=0 /*calc info entropy for each char*/

     do i=1  for n;  _=substr($$,i,1) /*obtain a char from unique list.*/
     sum=sum  -  @._/L  * log2(@._/L) /*add (negatively) the entropies.*/
     end   /*i*/

say ' input string: ' $ say 'string length: ' L say ' unique chars: ' n ; say say 'the information entropy of the string ──► ' format(sum,,12) exit /*stick a fork in it, we're done.*/ /*──────────────────────────────────LOG2 subroutine─────────────────────*/ log2: procedure; parse arg x 1 xx; ig= x>1.5; is=1-2*(ig\==1); ii=0 numeric digits digits()+5 /* [↓] precision of E must be > digits().*/ e=2.7182818284590452353602874713526624977572470936999595749669676277240766303535

   do  while  ig & xx>1.5 | \ig&xx<.5;   _=e;     do k=-1;  iz=xx* _**-is
   if k>=0 & (ig & iz<1 | \ig&iz>.5)     then leave;  _=_*_;  izz=iz; end
   xx=izz;  ii=ii+is*2**k;  end;         x=x* e**-ii-1;  z=0;  _=-1;  p=z
     do k=1;  _=-_*x;  z=z+_/k;  if z=p  then leave;  p=z;  end  /*k*/
   r=z+ii;  if arg()==2  then return r;  return r/log2(2,0)</lang>

output when using the default input of: 1223334444

 input string:  1223334444
string length:  10
 unique chars:  4

the information entropy of the string ──►  1.846439344671

output when using the input of: Rosetta Code

 input string:  Rosetta Code
string length:  12
 unique chars:  9

the information entropy of the string ──►  3.084962500721

Ruby

Works with: Ruby version 1.9

<lang ruby>def entropy(s)

 counts = Hash.new(0)
 s.each_char { |c| counts[c] += 1 }
 counts.values.reduce(0) do |entropy, count|
   freq = count / s.length.to_f
   entropy - freq * Math.log2(freq)
 end

end</lang> One-liner, same performance (or better): <lang ruby>def entropy2(s)

 s.each_char.group_by(&:to_s).values.map { |x| x.length / s.length.to_f }.reduce(0) { |e, x| e - x*Math.log2(x) }

end</lang>

Scala

<lang scala>import scala.math._

def entropy( v:String ) = { v

 .groupBy (a => a)
 .values
 .map( i => i.length.toDouble / v.length )
 .map( p => -p * log10(p) / log10(2))
 .sum

}

// Confirm that "1223334444" has an entropy of about 1.84644 assert( math.round( entropy("1223334444") * 100000 ) * 0.00001 == 1.84644 )</lang>

Tcl

<lang tcl>proc entropy {str} {

   set log2 [expr log(2)]
   foreach char [split $str ""] {dict incr counts $char}
   set entropy 0.0
   foreach count [dict values $counts] {

set freq [expr {$count / double([string length $str])}] set entropy [expr {$entropy - $freq * log($freq)/$log2}]

   }
   return $entropy

}</lang> Demonstration: <lang tcl>puts [format "entropy = %.5f" [entropy "1223334444"]] puts [format "entropy = %.5f" [entropy "Rosetta Code"]]</lang>

Output:
entropy = 1.84644
entropy = 3.08496

XPL0

<lang XPL0>code real RlOut=48, Ln=54; \intrinsic routines string 0; \use zero-terminated strings

func StrLen(A); \Return number of characters in an ASCIIZ string char A; int I; for I:= 0, -1>>1-1 do

   if A(I) = 0 then return I;

func real Entropy(Str); \Return Shannon entropy of string char Str; int Len, I, Count(128); real Sum, Prob; [Len:= StrLen(Str); for I:= 0 to 127 do Count(I):= 0; for I:= 0 to Len-1 do \count number of each character in string

   Count(Str(I)):= Count(Str(I)) + 1;

Sum:= 0.0; for I:= 0 to 127 do

   if Count(I) # 0 then        \(avoid Ln(0.0) error)
       [Prob:= float(Count(I)) / float(Len);   \probability of char in string
       Sum:= Sum + Prob*Ln(Prob);
       ];

return -Sum/Ln(2.0); ];

RlOut(0, Entropy("1223334444"))</lang>

Output:
    1.84644