Talk:Entropy: Difference between revisions

m (→‎REXX (log2): removed some blank lines. -- ~~~~)
Line 87:
if arg()==2 then return r
return r / log2(2,0)</lang>
 
== This is not exactly "entropy". H is in bits/symbol. ==
This article is confusing Shannon entropy with information entropy and incorrectly states Shannon entropy H has units of bits.
 
There are many problems in applying H= -1*sum(p*log(p)) to a string and calling it the entropy of that string. H is called entropy but its units are bits/symbol, or entropy/symbol if the correct log base is chosen. For example, H of 01 and 011100101010101000011110 are exactly the same "entropy", H=1 bit/symbol, even though the 2nd one obviously carries more information entropy than the 1st. Another problem is that if you simply re-express the same data in hexadecimal, H gives a different answer for the same information entropy. The best and real information entropy of a string is 4) below.
 
Total entropy (in an arbitrarily chosen log base, which is not the best type of "entropy") for a file is S=N*H where N is the length of the file. Many times in [http://worrydream.com/refs/Shannon%20-%20A%20Mathematical%20Theory%20of%20Communication.pdf Shannon's book] he says H is in units of "bits/symbol", "entopy/symbol", and "information/symbol". Some people don't believe Shannon, so [https://schneider.ncifcrf.gov/ here's a modern respected researcher's home page] that tries to clear the confusion by stating the units out in the open.
 
Shannon called H "entropy" when he should have said "specific entropy" which is analogous to physics' S<sup>0</sup> that is on a per kg or per mole basis instead of S. On page 13 of [http://worrydream.com/refs/Shannon%20-%20A%20Mathematical%20Theory%20of%20Communication.pdf Shannon's book], you easily can see Shannon's horrendous error that has resulted in so much confusion. On that page he says H, his "entropy", is in units of "entropy per symbol". This is like saying some function "s" is called "meters" and its results are in "meters/second". He named H after Boltzmann's H-theorem where H is a specific entropy on a per molecule basis. Boltzmann's entropy S = k*N*H = k*ln(states).
 
There 4 types of entropy of a file of N symbols long with n unique types of symbols:
 
1) Shannon (specific) entropy '''H = sum(count_i / N * log(N / count_i))'''
where count_i is the number of times symbol i occured in N.
Units are bits/symbol if log is base 2, nats/symbol if natural log.
 
2) Normalized specific entropy: '''H<sub>n</sub> = H / log(n).'''
The division converts the logarithm base of H to n. Units are entropy/symbol. Ranges from 0 to 1. When it is 1 it means each symbol occurred equally often, n/N times. Near 0 means all symbols except 1 occurred only once, and the rest of a very long file was the other symbol. "Log" is in same base as H.
 
3) Total entropy '''S' = N * H.'''
Units are bits if log is base 2, nats if ln()).
 
4) Normalized total entropy '''S<sub>n</sub>' = N * H / log(n).''' See "gotcha" below in choosing n.
Unit is "entropy". It varies from 0 to N
 
5) Physical entropy S of a binary file when the data is stored perfectly efficiently (using Landauer's limit): '''S = S' * k<sub>B</sub> / log(e)'''
 
6) Macroscopic information entropy of an ideal gas of N identical molecules in its most likely random state (n=1 and N is known a priori): '''S' = S / k<sub>B</sub> / ln(1)''' = k<sub>B</sub>*[ln(states^N/N!)] = k<sub>B</sub>*N* [ln(states/N)+1].
 
*Gotcha: a data generator may have the option of using say 256 symbols but only use 200 of those symbols for a set of data. So it becomes a matter of semantics if you chose n=256 or n=200, and neither may work (giving the same entropy when expressed in a different symbol set) because an implicit compression has been applied.
Anonymous user