Talk:Entropy: Difference between revisions

From Rosetta Code
Content added Content deleted
Line 104: Line 104:


2) Normalized specific entropy: '''H<sub>n</sub> = H / log(n).'''
2) Normalized specific entropy: '''H<sub>n</sub> = H / log(n).'''
The division converts the logarithm base of H to n. Units are entropy/symbol. Ranges from 0 to 1. When it is 1 it means each symbol occurred equally often, n/N times. Near 0 means all symbols except 1 occurred only once, and the rest of a very long file was the other symbol. "Log" is in same base as H.
The division converts the logarithm base of H to n. Units are entropy/symbol or 1/symbol. Ranges from 0 to 1. When it is 1 it means each symbol occurred equally often, n/N times. Near 0 means all symbols except 1 occurred only once, and the rest of a very long file was the other symbol. "Log" is in same base as H.


3) Total entropy '''S' = N * H.'''
3) Total entropy '''S' = N * H.'''
Line 110: Line 110:


4) Normalized total entropy '''S<sub>n</sub>' = N * H / log(n).''' See "gotcha" below in choosing n.
4) Normalized total entropy '''S<sub>n</sub>' = N * H / log(n).''' See "gotcha" below in choosing n.
No units in the same way a ratio does not have units. Notice the formula uses a ratio and the data itself instead of a person chooses the logarithm base. This is the total entropy of the symbols. It varies from 0 to N
Unit is "entropy". It varies from 0 to N


5) Physical entropy S of a binary file when the data is stored perfectly efficiently (using Landauer's limit): '''S = S' * k<sub>B</sub> / log(e)'''
5) Physical entropy S of a binary file when the data is stored perfectly efficiently (using Landauer's limit): '''S = S' * k<sub>B</sub> / log(e)'''

Revision as of 20:22, 22 January 2016

What about a less dull example?

What about using "Rosetta code" as an example instead of the dull "1223334444"?--Grondilu 17:53, 22 February 2013 (UTC)

Or even funnier: write a program who computes its own entropy :) --Grondilu 12:40, 25 February 2013 (UTC)
Better yet, a bonus task of computing the entropy of each solution on the page. :) --TimToady 19:23, 25 February 2013 (UTC)
I like computing the entropy of “Rosetta Code” (it's about 3.08496, assuming my code is right); a more self-referential one is fine too, except it involves features that might block some languages from participating. (The draft child task is a better place for that.) –Donal Fellows 09:31, 26 February 2013 (UTC)
I have added a task Fibonacci_word which I think is an interesting application of Entropy--Nigel Galloway (talk) 15:53, 28 December 2013 (UTC)

Alternate form

Not sure this is very useful, but I was wondering if one could not find a more concise way of writing this.

If we call the length of the string and the number of occurrences of the character c, we have:

In perl6, this allows a slightly simpler formula, i.e. not using hyperoperators:

<lang Perl 6>sub entropy(@a) {

   log(@a) - @a R/ [+] map -> \n { n * log n }, @a.bag.values

}</lang>

For what it's worth.--Grondilu 18:14, 4 March 2013 (UTC)

Description

I think the task is not described correctly. It calculates the average entropy per character, not the entropy of the message as a whole.

For example, I generated this 100-digit string with 0 through 7 from random.org. It has 300 bits of entropy (more conservatively, 299 to 300 bits of entropy, in case the generation is slightly flawed). The task gives it just under 3: <lang parigp>entropy("1652410507230105455225274644011734652475143261241037401534561367707447375435416503021223072520334062") %1 = 2.9758111700170733830744745234131842224</lang>

CRGreathouse (talk) 02:08, 17 April 2013 (UTC)

I don't know. It's not simple. I was a bit embarrassed with this task as I struggled to understand what the entropy of a string is. Normally entropy is defined for a system whose internal state is only known probabilisticly. A string is known exactly so stricto sensu, the entropy of any string is zero. As I understood it, the task defines the entropy of a string as the entropy of whatever system "generated" the string. The string is thus seen as a representative samples of the probabilities of each possible character, seen as a random variable.
In your example, you say the entropy is 300 bits, but I'm not sure it makes much sense to say that. 300 bits is the amount of information in this string, if you consider it as the result of picking a random number from 0 to 10^300. Yet the entropy is still zero, for your string is exactly defined.
The entropy as currently defined in the task does not directly depend on the "size" of the string in memory, because it merely represents the probability table of a stochastic process.
--Grondilu (talk) 15:42, 17 April 2013 (UTC)
What I mean when I say 299-300 bits is that if we knew I was going to pick 100 digits from random.org as described we wouldn't be able to design a scheme that would average fewer than 299 bits to tell you which one I picked, but we could design one which would do it in 300. This task takes a slightly different model (if I was going to generate a 100-character string with a given number of each digit) but the two are similar. Either way the string has about 300 and each character has about 3.
CRGreathouse (talk) 00:55, 18 April 2013 (UTC)

REXX (log2)

"The LOG2 subroutine in only included here for functionality, not to document how to calculate LOG2 using REXX"

What's the difference and where is it / should it be documented?
Difference between what?
The LOG2 function should/could be documented in a Rosetta Code task that asks for a user-written version of LN/LOG2/LOG (as opposed to one that is a BIF).   Documenting a subroutine that is a BIF (built-in function) for most languages would just make the LOG2 function much too bulky and detract from the task's requirement:   solve the entropy problem.   The REXX LOG2 function (which is a modified LN function) was originally written to be compact and written as a one-liner function, intended to be placed at the end of the program that it was needed (it, and another hundred or so subroutines/functions --- that's no exaggeration).   Strictly speaking, the   E   value was originally a function, but it was abbreviated here (about 80 decimal digits as a constant rather than a function call) to save even more space and complexity.   Also omitted were error checks that were in the original LN function, as well as   SIGNAL ON NOVALUE;   SIGNAL ON SYNTAX,   SIGNAL ON HALT, and the mechanism of issuing/displaying the error message(s) and the offending REXX statements.   At lot of crafting went into the writing of the original LN function (this was under CMS REXX and then ported to PC/REXX in 1989, as I recall).   PC/REXX has some restrictions on the number of symbols, the aggregate size of the values of the variables, and the size of the REXX program. -- Gerard Schildberger (talk) 23:47, 28 May 2013 (UTC)
Take 'my' (now corrected - thanks) log as an example --Walterpachl
If this Rosetta Code task was to explain how LN (or LOG2) was calculated, that would be appropriate. -- Gerard Schildberger (talk) 23:47, 28 May 2013 (UTC)
Here's a version of the (REXX) LOG2 function, unrolled and expatiated: -- Gerard Schildberger (talk) 23:47, 28 May 2013 (UTC)

<lang rexx>/*──────────────────────────────────LOG2 subroutine───────────────────────────*/ log2: procedure; parse arg x 1 xx ig= x>1.5 is= 1 - 2*(ig\==1) numeric digits digits()+5 /* [↓] precision of E must be > digits().*/ e=2.7182818284590452353602874713526624977572470936999595749669676277240766303535 ii=0

                 do  while  ig & xx>1.5  |  \ig & xx<.5
                 _=e
                        do j=-1
                        iz=xx* _**-is
                        if j>=0  then  if  ig & iz<1  |  \ig&iz>.5  then leave
                        _=_*_
                        izz=iz
                        end   /*j*/
                 xx=izz
                 ii=ii + is* 2**j
                 end   /*while*/
x=x * e**-ii -1
z=0
_=-1
p=z
                 do k=1
                 _=-_*x
                 z=z+_/k
                 if z=p  then leave
                 p=z
                 end   /*k*/

r=z+ii if arg()==2 then return r

                 return r / log2(2,0)</lang>

This is not exactly "entropy". H is in bits/symbol.

This article is confusing Shannon entropy with information entropy and incorrectly states Shannon entropy H has units of bits.

There are many problems in applying H= -1*sum(p*log(p)) to a string and calling it the entropy of that string. H is called entropy but its units are bits/symbol, or entropy/symbol if the correct log base is chosen. For example, H of 01 and 011100101010101000011110 are exactly the same "entropy", H=1 bit/symbol, even though the 2nd one obviously carries more information entropy than the 1st. Another problem is that if you simply re-express the same data in hexadecimal, H gives a different answer for the same information entropy. The best and real information entropy of a string is 4) below. Applying 4) to binary data gives the entropy in "bits", but since the data was binary, its units are also a true statistical "entropy" without having to specify "bits" as a unit.

Total entropy (in an arbitrarily chosen log base, which is not the best type of "entropy") for a file is S=N*H where N is the length of the file. Many times in Shannon's book he says H is in units of "bits/symbol", "entopy/symbol", and "information/symbol". Some people don't believe Shannon, so here's a modern respected researcher's home page that tries to clear the confusion by stating the units out in the open.

Shannon called H "entropy" when he should have said "specific entropy" which is analogous to physics' S0 that is on a per kg or per mole basis instead of S. On page 13 of Shannon's book, you easily can see Shannon's horrendous error that has resulted in so much confusion. On that page he says H, his "entropy", is in units of "entropy per symbol". This is like saying some function "s" is called "meters" and its results are in "meters/second". He named H after Boltzmann's H-theorem where H is a specific entropy on a per molecule basis. Boltzmann's entropy S = k*N*H = k*ln(states).

There 4 types of entropy of a file of N symbols long with n unique types of symbols:

1) Shannon (specific) entropy H = sum(count_i / N * log(N / count_i)) where count_i is the number of times symbol i occured in N. Units are bits/symbol if log is base 2, nats/symbol if natural log.

2) Normalized specific entropy: Hn = H / log(n). The division converts the logarithm base of H to n. Units are entropy/symbol or 1/symbol. Ranges from 0 to 1. When it is 1 it means each symbol occurred equally often, n/N times. Near 0 means all symbols except 1 occurred only once, and the rest of a very long file was the other symbol. "Log" is in same base as H.

3) Total entropy S' = N * H. Units are bits if log is base 2, nats if ln()).

4) Normalized total entropy Sn' = N * H / log(n). See "gotcha" below in choosing n. No units in the same way a ratio does not have units. Notice the formula uses a ratio and the data itself instead of a person chooses the logarithm base. This is the total entropy of the symbols. It varies from 0 to N

5) Physical entropy S of a binary file when the data is stored perfectly efficiently (using Landauer's limit): S = S' * kB / log(e)

6) Macroscopic information entropy of an ideal gas of N identical molecules in its most likely random state (n=1 and N is known a priori): S' = S / kB / ln(1) = kB*[ln(states^N/N!)] = kB*N* [ln(states/N)+1].

  • Gotcha: a data generator may have the option of using say 256 symbols but only use 200 of those symbols for a set of data. So it becomes a matter of semantics if you chose n=256 or n=200, and neither may work (giving the same entropy when expressed in a different symbol set) because an implicit compression has been applied.