Talk:Arithmetic coding/As a generalized change of radix: Difference between revisions

m
Updated links to M.Nelson articles
No edit summary
m (Updated links to M.Nelson articles)
 
(27 intermediate revisions by 3 users not shown)
Line 42:
:::::::: After all, the Wikipedia page seems to be contradictory, as it mentions both mapping the string to digits (not saying which digits), and bellow referring to <math>\scriptstyle C_i</math> as the cumulative frequency of the current symbol. I agree, it's a mess...
:::::::: (Off-topic: to enable math markup, go to [[Special:Preferences#mw-prefsection-rendering]] and select ''MathML with SVG or PNG fallback'' at the bottom of the page) --[[User:Trizen|Trizen]] ([[User talk:Trizen|talk]]) 21:04, 28 January 2016 (UTC)
::::::::: Thank you, (and actually someone else adjusted that setting on my behalf), that worked around the problem with PNG rendering. --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 18:45, 30 January 2016 (UTC)
:::::::: I don't understand what your confusion is. 0, 1, and 3 are the cumulative frequencies for A, B, and D respectively. The cumulative frequency = the sum of the frequencies of the letters accumulated before it. So if we choose to accumulate in the order A -> B -> D (in order of increasing character code), then for A, the cumulative frequency is 0 (since there's nothing before it); for B, it's the sum of the frequencies for A = 1; for D, it's the sum of the frequencies for A and B = 1 + 2 = 3. That is cumulative frequency. The letters themselves are irrelevant; it's just the frequencies of each and order that we accumulate them that is relevant. It has *nothing* to do with "letter encoding values". --[[User:Spoon!|Spoon!]] ([[User talk:Spoon!|talk]]) 01:15, 29 January 2016 (UTC)
 
:::::::: I don't understand what your confusion is. 0, 1, and 3 are the cumulative frequencies for A, B, and D respectively. The cumulative frequency = the sum of the frequencies of the letters accumulated before it. So if we choose to accumulate in the order A -> B -> D (in order of increasing character code), then for A, the cumulative frequency is 0 (since there's nothing before it); for B, it's the sum of the frequencies for A = 1; for D, it's the sum of the frequencies for A and B = 1 + 2 = 3. If there were another one after that, say Q, its cumulative frequency would be the sum of the frequencies for A, B, and D = 1 + 2 + 3 = 6. That is cumulative frequency. The letters themselves are irrelevant; it's just the frequencies of each and order that we accumulate them that is relevant. It has *''nothing*'' to do with "letter encoding values". --[[User:Spoon!|Spoon!]] ([[User talk:Spoon!|talk]]) 01:15, 29 January 2016 (UTC)
::::::::: I do not think you read what I wrote. --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 23:28, 30 January 2016 (UTC)
:::::::: The purpose of all this is to divide the interval from 0 to 1 (or in this case, the interval from 0 to (length of string) since they want to use all integers) fully into intervals for each of the characters, where the length of the interval for each character is proportional to its frequency. So we want an interval of length 1 for A, length 2 for B, and length 3 for D. The implementations here accomplish that by assigning them from 0 upwards to the characters in ascending character order. So A gets the interval 0-1, B gets the interval 1-3, and D gets the interval 3-6. So each character gets an interval represented by the start being its "cumulative frequency", and the length of the interval being its frequency. --[[User:Spoon!|Spoon!]] ([[User talk:Spoon!|talk]]) 01:15, 29 January 2016 (UTC)
:::::::::: I considered that interpretation, but for it to work you'd need a dictionary to figure out which letters were in use, and that's not part of the definition. Worse, though, this conflicts with the writeup and example at [[wp:Arithmetic_coding#Arithmetic_coding_as_a_generalized_change_of_radix|wikipedia]]. I mean, yes, the choice of the length of the string as the number base sort of implies something like this, but using cumulative frequency to identify the letters is a worse choice than fixed letter values because you do not know which letters have which frequencies. Also, if you read the wikipedia page, you will see that it says "A = 0, B = 1, C = 2, D = 3" and you can't get C = 2 for that example if you are pretending that these numbers represent cumulative frequencies. --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 23:28, 30 January 2016 (UTC)
::::::::::: "Fixed letter values" doesn't make sense. A coding needs to divide the interval 0-1 into non-overlapping intervals for each character. Huffman coding is the optimal length coding where intervals are all restricted to the size 1/(some power of 2). Arithmetic coding is the optimal coding where the intervals can be any size, and mathematically, the optimal length coding is one where each character's interval is the same size as its probability of occurring, so if "DABDDB" were representative of the letter frequencies, then A should be given an interval of size 1/6, B given an interval of size 1/3, and D given an interval of size 1/3. (The "generalized change of radix" representation simply assumes a denominator of "base" (6), so the interval sizes are 1, 2, and 3, respectively.) How does T=19, H=7, E=4 (as you suggest) help us divide up 0-1 into intervals? It doesn't. And this is not just for letters; this is for all unit symbols, so for bytes there are 256 characters. How are you going to divide up 0-1 into intervals? If you are just going to give an interval of size 1/256 for each character, that's not a (remotely optimal) coding -- that's just the original plaintext binary. --[[User:Spoon!|Spoon!]] ([[User talk:Spoon!|talk]]) 18:20, 31 January 2016 (UTC)
::::::::: That said, using cumulative frequencies in place of letter values for your encoding means words like THE, ONE, TWO, DOG, BAD, etc. etc. all encode to the same number. --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 18:45, 30 January 2016 (UTC)
:::::::::: Yup, they SHOULD all encode to the same number. That's the point, because they have the same set of frequencies. It's the dictionary that allows you to decode them. In Huffman Coding, all of those should also similarly encode to the same binary code, and it's the dictionary that allows you to decode. --[[User:Spoon!|Spoon!]] ([[User talk:Spoon!|talk]]) 21:35, 30 January 2016 (UTC)
::::::::::: Well, I guess it's true that huffman coding implies a shared dictionary between the encoder and the decoder. And, there is an explicit analogy drawn here with huffman coding. But if that's the case, we either need to change the wikipedia page or we need to spell out this issue in the task requirements or both. And if this is the "correct definition" but we "cannot change the wikipedia page" then we should also address the conflict in the task description. --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 23:34, 30 January 2016 (UTC)
:::::::::::: I don't see anything wrong with the Wikipedia page. Can you point to any statement on the Wikipedia page you think is incorrect? --[[User:Spoon!|Spoon!]] ([[User talk:Spoon!|talk]]) 18:20, 31 January 2016 (UTC)
:::::::::::::Ok, I will make these points again. Given how many times I've already mentioned the C = 2 issue in the context of the DABDDB example, I do not hold much hope that this will be the last time you ask me to state them, but I don't know what else to tell you. The wikipedia page currently states:
:::::::::::::: we may look at any sequence of symbols:
::::::::::::::
:::::::::::::::<math>DABDDB</math>
::::::::::::::
:::::::::::::: as a number in a certain base presuming that the involved symbols form an ordered set and each symbol in the ordered set denotes a sequential integer ''A''&nbsp;=&nbsp;0, ''B''&nbsp;=&nbsp;1, ''C''&nbsp;=&nbsp;2, ''D''&nbsp;=&nbsp;3, and so on.
::::::::::::: and this conflicts with (or, perhaps, illustrates) your above statement that ''"Fixed letter values" doesn't make sense.'' If you do not see that, please read that again, and notice the '''C = 2''' part as well as the descriptive text which supports C = 2.
::::::::::::: But, also: '''"The cumulative frequency is the total of all frequencies below it in a frequency distribution (a running total of frequencies)"''' on the wikipedia page conflicts with a cumulative frequency encoding of letters and 1150764267498783364 as the encoding for 'TOBEORNOTTOBEORTOBEORNOT'. (To get 1150764267498783364, I needed cumulative frequencies in alphabetic order rather than in frequency order.)
:::::::::::::That said, I am not maintaining that the wikipedia page is incorrect - merely that it conflicts with some parts of a number of the implementations of this task. (Well, that, and mostly useless. But I don't see much use for any implementation of this technique, so I'm ignoring that issue.) --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 23:53, 31 January 2016 (UTC)
:::::::::::::: "involved symbols form an ordered set and each symbol in the ordered set denotes a sequential integer" means that the symbols come from some set with an ordering. In this case, they are using the set of letters and using ascending alphabetic order (and in the Rosetta Code implementation we are using the set of all bytes and using ascending character byte value order). All that matters is that A < B < C < D < ... The actual *values* are COMPLETELY IRRELEVANT to the algorithm -- they are just used in the encoding and decoding dictionaries. The reason that they are requiring that there is an ordering of the symbols is because "cumulative" requires some order of accumulation; that's the only reason why these "sequential integers" is even mentioned. Your confusion comes from the fact that you are wrongfully thinking that these "sequential integers" are somehow used in the algorithm. But nowhere in the Wikipedia description does it say anything to that effect. You are completely imagining that, possibly because the cumulative frequency in this example is coincidentally similar to the integer values they assign. The Wikipedia description shows very clearly where the cumulative frequencies come from and the "sequential integers" never factor into any of that. --[[User:Spoon!|Spoon!]] ([[User talk:Spoon!|talk]]) 04:14, 1 February 2016 (UTC)
::::::::::::::: Why do you claim that some statements in the wikipedia page are irrelevant? If they are irrelevant, how is that different from them being in error? --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 05:40, 1 February 2016 (UTC)
:::::::::::::::: I did not say any statement in the Wikipedia page is not relevant. The statement showing the "sequential integers" of the symbols is relevant in demonstrating an ordering. --[[User:Spoon!|Spoon!]] ([[User talk:Spoon!|talk]]) 05:59, 1 February 2016 (UTC)
::::::::::::::::: Ok, so how is the C = 2 statement that I quoted from the wikipedia relevant? --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 06:06, 1 February 2016 (UTC)
:::::::::::::::::: They are defining an ordering on all the symbols in the letter space, without regard to a particular message, instead of defining an ordering on only the symbols that happen to appear in this message. --[[User:Spoon!|Spoon!]] ([[User talk:Spoon!|talk]]) 06:31, 1 February 2016 (UTC)
::::::::::::::::::: Actually, no. Wait.
::::::::::::::::::: If things are like you say, a problem is that the wikipedia page uses a poorly choosen example. '''DABDDB''' has cumulative frequency values which match the letter values which are stated on the page.
::::::::::::::::::: Your argument, in essence, is that using those letter values is not useful, so we should ignore them and use cumulative frequency values instead.
::::::::::::::::::: But the algorithm itself is not very useful - it's more of an exercise.
::::::::::::::::::: Also, the wikipedia page specifies C=2 and you claim that's only for ordering, but the wikipedia page specifies "each symbol in the ordered set denotes a sequential integer", but cumulative frequencies skip integers in the sequence for all examples used in this task.
::::::::::::::::::: So, as it currently stands, the wikipedia writeup is ambiguous, and uses a poorly choosen example which does not help resolve this ambiguity.
::::::::::::::::::: So... I'm going to have to study some other treatments of this algorithm (http://marknelson.us/1991/02/01/arithmetic-coding-statistical-modeling-data-compression/ looks promising), and I imagine once I've resolved these issues to my satisfaction I'll have to update the task description. --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 09:58, 1 February 2016 (UTC)
 
== Goof? (part 2) ==
 
I have expanded the J entry to illustrate some of my concerns. --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 00:39, 31 January 2016 (UTC)
 
== Usefulness of approach? ==
 
So, ok... at least a part of the "goof" discussed in the above section might have been my own.
 
Still, after having resolved that, I am dissatisfied with this algorithm.
 
First, note that the results generated in this task are not sufficient to recover the original character sequences. To recover the original character sequence we probably need to know which characters were used and the number of times each character appeared.
 
That's something of a problem because this algorithm, and the frequencies used, are both explicitly tied to the length of the character sequence being compressed. It doesn't work with pre-shared parameters unless you limit yourself to a fixed collection of characters. If you are working with D D E G G H O O O T as your pre-shared dictionary (or, [[Run-length_encoding|more compactly]]: 2DE2GH3OT), you cannot represent THEGOODCAT --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 07:59, 1 February 2016 (UTC)
 
== Next steps? ==
 
After reading http://marknelson.us/2014/10/19/data-compression-with-arithmetic-coding/ I think the important thing is that there are many different arithmetic coding algorithms.
So, right now, from my point of view, the implementations in this task have deviated in a minor way from my reading of the wikipedia entry. And, this deviation is motivated by "usefulness", but does not actually go far enough to be useful.
So probably what should happen is the task description should go into more detail about those ambiguities. Or, if we want a useful approach, we should also revise the task (to separate model generation from encryption/decryption using that model - frequencies should probably be normalized to sum to one, also, if we want a generically useful model).
But it would also be possible (or at least less immediate effort, though also probably less useful so perhaps more effort and pain in the long term) to just spell out in the task description that each symbol represents (or is represented by) its cumulative frequency (accumulated in alphabetic order).
Still, that's a choice to be made, and obviously other people have different perspectives from mine. So I guess I'd like some feedback on what other people think we should do about this draft task. --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 23:28, 1 February 2016 (UTC)
 
: If it's of any help, I can add a link in the task description to [http://trizenx.blogspot.ro/2015/05/arithmetic-coding.html a little bit more detailed tutorial] which explains the encoding and the decoding process, using an unambiguous example. [[User:Trizen|Trizen]] ([[User talk:Trizen|talk]]) 23:47, 1 February 2016 (UTC)
 
:'''DEAD LINKS UPDATE''': The links to the articles (1991 and 2014) by Mark Nelson are no longer valid. Here are the the updated links:
:* https://marknelson.us/posts/2014/10/19/data-compression-with-arithmetic-coding.html
:* https://marknelson.us/posts/1991/02/01/arithmetic-coding-statistical-modeling-data-compression.html
1

edit