Talk:Arithmetic coding/As a generalized change of radix: Difference between revisions

Content added Content deleted

Inline

Revision as of 21:19, 28 January 2016

Goof?

The J entry gives different values for most of these examples than Go/Perl/Perl 6/Ruby/Sidef.

The difference seems to be related to the difference between "frequency" and "cumulative frequency".

The wikipedia writeup uses frequency when describing the "As a generalized change of radix" mechanism, and then introduces "cumulative frequency" when describing how to reverse the process.

Anyways, it looks like Go/Perl/Perl 6/Ruby/Sidef are using a different algorithm than the one described in wikipedia, as they all use "cumulative frequency" rather than "frequency" when encoding.

Or perhaps the wikipedia page is in error? Do we have anyone with sufficient background in the history of this algorithm to make that determination? --Rdm (talk) 17:44, 27 January 2016 (UTC)

I am not familiar with J. But the number generated could be different if a different cumulative frequency dictionary is used. One commonality of the Go/Perl/Perl 6/Ruby/Sidef solutions is that they generate the cumulative frequencies by accumulating in ascending order of the character codes (because they check keys from 0 to 255), so they always generate the same cumulative frequency dictionary. However, the order of characters when accumulating is not relevant to the correctness of the algorithm. If you accumulate in a different order, you will get a different cumulative frequency dictionary, but as long as decoding uses the same cumulative frequency dictionary (i.e. it generates it by iterating through the frequency dictionary in the same order), it will produce the correct result. I can't read J, but if the J solution simply iterates through the frequency dictionary in an unspecified by consistent (not necessarily ascending) order of the keys, then it could generate a different cumulative frequency dictionary, and hence a different number. I am not sure if this is really what is happening. --Spoon! (talk) 10:40, 28 January 2016 (UTC)

The J implementation does not use cumulative frequency for this task. The wikipedia clearly specifies that frequency and not cumulative frequency should be used for this task. Instead, it introduces cumulative frequency for the decoding operation.

So why are people using cumulative frequency for this task? --Rdm (talk) 16:24, 28 January 2016 (UTC)

I'm not sure if I understood your point correctly, but maybe this will clarify something. As Wikipedia describes it, in the encoding process, the formula to calculate the lower bound

L

, is:

\mathrm {L} =\sum _{i=1}^{n}n^{n-i}C_{i}{\prod _{k=1}^{i-1}f_{k}}

where

\scriptstyle C_{i}

are the cumulative frequencies and

\scriptstyle f_{k}

are the frequencies of occurrences.

After that, we compute the upper bound as follows:

\mathrm {U} =\mathrm {L} +\prod _{k=1}^{n}f_{k}

After this point, it doesn't matter which number

N

we return, as long as

L<=N<U

. --Trizen (talk) 17:28, 28 January 2016 (UTC)

For some reason, math markup isn't working right now, so that's a bit hard to read. But basically you seem to be restating the fact that you are using cumulative frequencies? But my question is why are you using cumulative frequencies? -- the use of cumulative frequencies conflicts with the example and text at wikipedia. --Rdm (talk) 19:35, 28 January 2016 (UTC)

The Wikipedia section uses cumulative frequencies. In the example, "DABDDB", gets mapped to 3, 0, 1, 3, 3, 1, where 0, 1, 3 are the cumulative frequencies for A, B, D, respectively. You can follow how those are used throughout the calculations.

As another example, consider a situation where each character has frequency 1. Then the frequency dictionary is just every key mapped to 1. What you need for encoding is the cumulative frequencies 0, 1, 2, ..., which tells you what value to assign to each character. --Spoon! (talk) 19:43, 28 January 2016 (UTC)

No.

Here's the wikipedia expression for L. It won't render properly right now, but you can go to the wikipedia page to read it:

{\begin{aligned}\mathrm {L} ={}&(6^{5}\times 3)+{}\\&3\times (6^{4}\times 0)+{}\\&(3\times 1)\times (6^{3}\times 1)+{}\\&(3\times 1\times 2)\times (6^{2}\times 3)+{}\\&(3\times 1\times 2\times 3)\times (6^{1}\times 3)+{}\\&(3\times 1\times 2\times 3\times 3)\times (6^{0}\times 1){}\\={}&25002\end{aligned}}

The frequencies used here are 1 (A), 2 (B) and 3 (D). The value 0, and 1 and 3 correspond to the letters A (0), B (1) and D (2) - and you can see that in the wikipedia page where it states "The string is first mapped into the digit string 301331" - so that 0 has nothing to do with the concept of cumulative frequencies.

If I were to criticize the wikipedia page, I'd ding the author for choosing DABDDB as an example (because the letter encodings coincide exactly with the cumulative frequency values - so that's confusing).

But if I were being critical, I'd also criticize whoever designed this "as a generalized change of radix" algorithm for choosing the length of the string as the base for the encoding scheme. If anyone were to try to actually use this encoding scheme, they'd encode THE in base 3, using the encoding values T=19, H=7, E=4.

But, I guess that answers my question about why people are using cumulative frequencies in the encoding implementations: (a) the algorithm itself is basically useless, so no one should care about using it, and (b) the description in wikipedia uses an example where cumulative frequencies are easily confused with the letter encoding values.

All in all, it's a mess. --Rdm (talk) 19:57, 28 January 2016 (UTC)

The example used on Wikipedia is indeed very confusing. Now I understand your point; you're suggesting using the corresponding byte values in encoding (or the alphabetical positions for alphabetical strings), instead of the cumulative frequencies. I don't know if this is better or worse, but if you choose to do that in the encoding process, you also have to do the same in the decoding process, otherwise you will get different results. For example, "ABRACADABRA" encoded with positional values, with result into "4860830 * 10^4" (same output as J). If we decode it back, using cumulative frequency instead, we'll get "AARAAAARAAA", which is incorrect.

After all, the Wikipedia page seems to be contradictory, as it mentions both mapping the string to digits (not saying which digits), and bellow referring to

\scriptstyle C_{i}

as the cumulative frequency of the current symbol. I agree, it's a mess...

(Off-topic: to enable math markup, go to Special:Preferences#mw-prefsection-rendering and select MathML with SVG or PNG fallback at the bottom of the page) --Trizen (talk) 21:04, 28 January 2016 (UTC)

@@ Line 39: / Line 39: @@
 ::::::: But, I guess that answers my question about why people are using cumulative frequencies in the encoding implementations: (a) the algorithm itself is basically useless, so no one should care about using it, and (b) the description in wikipedia uses an example where cumulative frequencies are easily confused with the letter encoding values.
 ::::::: All in all, it's a mess. --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 19:57, 28 January 2016 (UTC)
-:::::::: The example used on Wikipedia is indeed very confusing. Now I understand your point; you're suggesting using the corresponding byte values in encoding (or the alphabetical positions for alphabetical strings), instead of the cumulative frequencies. I'm not sure if this is better or worse, but if you choose to do that in the encoding process, you also have to do the same in the decoding process, otherwise you will get different results. For example, "ABRACADABRA" encoded with positional values, with result into "4860830 * 10^4" (same output as J). If we decode it back, using cumulative frequency instead, we'll get "AARAAAARAAA", which is incorrect.
+:::::::: The example used on Wikipedia is indeed very confusing. Now I understand your point; you're suggesting using the corresponding byte values in encoding (or the alphabetical positions for alphabetical strings), instead of the cumulative frequencies. I don't know if this is better or worse, but if you choose to do that in the encoding process, you also have to do the same in the decoding process, otherwise you will get different results. For example, "ABRACADABRA" encoded with positional values, with result into "4860830 * 10^4" (same output as J). If we decode it back, using cumulative frequency instead, we'll get "AARAAAARAAA", which is incorrect.
-:::::::: After all, the Wikipedia page seems to be contradictory, as it mentions both mapping the string to digits (not saying which digits), and bellow referring to <math>\scriptstyle C_i</math> as the cumulative frequency of the current symbol. I agree, it's a mess... --[[User:Trizen|Trizen]] ([[User talk:Trizen|talk]]) 21:04, 28 January 2016 (UTC)
+:::::::: After all, the Wikipedia page seems to be contradictory, as it mentions both mapping the string to digits (not saying which digits), and bellow referring to <math>\scriptstyle C_i</math> as the cumulative frequency of the current symbol. I agree, it's a mess...
+:::::::: (Off-topic: to enable math markup, go to [[Special:Preferences#mw-prefsection-rendering]] and select ''MathML with SVG or PNG fallback'' at the bottom of the page) --[[User:Trizen|Trizen]] ([[User talk:Trizen|talk]]) 21:04, 28 January 2016 (UTC)