Talk:Arithmetic coding/As a generalized change of radix: Difference between revisions

m
Updated links to M.Nelson articles
No edit summary
m (Updated links to M.Nelson articles)
 
(12 intermediate revisions by 3 users not shown)
Line 63:
:::::::::::::That said, I am not maintaining that the wikipedia page is incorrect - merely that it conflicts with some parts of a number of the implementations of this task. (Well, that, and mostly useless. But I don't see much use for any implementation of this technique, so I'm ignoring that issue.) --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 23:53, 31 January 2016 (UTC)
:::::::::::::: "involved symbols form an ordered set and each symbol in the ordered set denotes a sequential integer" means that the symbols come from some set with an ordering. In this case, they are using the set of letters and using ascending alphabetic order (and in the Rosetta Code implementation we are using the set of all bytes and using ascending character byte value order). All that matters is that A < B < C < D < ... The actual *values* are COMPLETELY IRRELEVANT to the algorithm -- they are just used in the encoding and decoding dictionaries. The reason that they are requiring that there is an ordering of the symbols is because "cumulative" requires some order of accumulation; that's the only reason why these "sequential integers" is even mentioned. Your confusion comes from the fact that you are wrongfully thinking that these "sequential integers" are somehow used in the algorithm. But nowhere in the Wikipedia description does it say anything to that effect. You are completely imagining that, possibly because the cumulative frequency in this example is coincidentally similar to the integer values they assign. The Wikipedia description shows very clearly where the cumulative frequencies come from and the "sequential integers" never factor into any of that. --[[User:Spoon!|Spoon!]] ([[User talk:Spoon!|talk]]) 04:14, 1 February 2016 (UTC)
::::::::::::::: Why do you claim that some statements in the wikipedia page are irrelevant? If they are irrelevant, how is that different from them being in error? --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 05:40, 1 February 2016 (UTC)
:::::::::::::::: I did not say any statement in the Wikipedia page is not relevant. The statement showing the "sequential integers" of the symbols is relevant in demonstrating an ordering. --[[User:Spoon!|Spoon!]] ([[User talk:Spoon!|talk]]) 05:59, 1 February 2016 (UTC)
::::::::::::::::: Ok, so how is the C = 2 statement that I quoted from the wikipedia relevant? --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 06:06, 1 February 2016 (UTC)
:::::::::::::::::: They are defining an ordering on all the symbols in the letter space, without regard to a particular message, instead of defining an ordering on only the symbols that happen to appear in this message. --[[User:Spoon!|Spoon!]] ([[User talk:Spoon!|talk]]) 06:31, 1 February 2016 (UTC)
::::::::::::::::::: Actually, no. Wait.
::::::::::::::::::: If things are like you say, a problem is that the wikipedia page uses a poorly choosen example. '''DABDDB''' has cumulative frequency values which match the letter values which are stated on the page.
::::::::::::::::::: Your argument, in essence, is that using those letter values is not useful, so we should ignore them and use cumulative frequency values instead.
::::::::::::::::::: But the algorithm itself is not very useful - it's more of an exercise.
::::::::::::::::::: Also, the wikipedia page specifies C=2 and you claim that's only for ordering, but the wikipedia page specifies "each symbol in the ordered set denotes a sequential integer", but cumulative frequencies skip integers in the sequence for all examples used in this task.
::::::::::::::::::: So, as it currently stands, the wikipedia writeup is ambiguous, and uses a poorly choosen example which does not help resolve this ambiguity.
::::::::::::::::::: So... I'm going to have to study some other treatments of this algorithm (http://marknelson.us/1991/02/01/arithmetic-coding-statistical-modeling-data-compression/ looks promising), and I imagine once I've resolved these issues to my satisfaction I'll have to update the task description. --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 09:58, 1 February 2016 (UTC)
 
== Goof? (part 2) ==
 
I have expanded the J entry to illustrate some of my concerns. --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 00:39, 31 January 2016 (UTC)
 
== Usefulness of approach? ==
 
So, ok... at least a part of the "goof" discussed in the above section might have been my own.
 
Still, after having resolved that, I am dissatisfied with this algorithm.
 
First, note that the results generated in this task are not sufficient to recover the original character sequences. To recover the original character sequence we probably need to know which characters were used and the number of times each character appeared.
 
That's something of a problem because this algorithm, and the frequencies used, are both explicitly tied to the length of the character sequence being compressed. It doesn't work with pre-shared parameters unless you limit yourself to a fixed collection of characters. If you are working with D D E G G H O O O T as your pre-shared dictionary (or, [[Run-length_encoding|more compactly]]: 2DE2GH3OT), you cannot represent THEGOODCAT --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 07:59, 1 February 2016 (UTC)
 
== Next steps? ==
 
After reading http://marknelson.us/2014/10/19/data-compression-with-arithmetic-coding/ I think the important thing is that there are many different arithmetic coding algorithms.
So, right now, from my point of view, the implementations in this task have deviated in a minor way from my reading of the wikipedia entry. And, this deviation is motivated by "usefulness", but does not actually go far enough to be useful.
So probably what should happen is the task description should go into more detail about those ambiguities. Or, if we want a useful approach, we should also revise the task (to separate model generation from encryption/decryption using that model - frequencies should probably be normalized to sum to one, also, if we want a generically useful model).
But it would also be possible (or at least less immediate effort, though also probably less useful so perhaps more effort and pain in the long term) to just spell out in the task description that each symbol represents (or is represented by) its cumulative frequency (accumulated in alphabetic order).
Still, that's a choice to be made, and obviously other people have different perspectives from mine. So I guess I'd like some feedback on what other people think we should do about this draft task. --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 23:28, 1 February 2016 (UTC)
 
: If it's of any help, I can add a link in the task description to [http://trizenx.blogspot.ro/2015/05/arithmetic-coding.html a little bit more detailed tutorial] which explains the encoding and the decoding process, using an unambiguous example. [[User:Trizen|Trizen]] ([[User talk:Trizen|talk]]) 23:47, 1 February 2016 (UTC)
 
:'''DEAD LINKS UPDATE''': The links to the articles (1991 and 2014) by Mark Nelson are no longer valid. Here are the the updated links:
:* https://marknelson.us/posts/2014/10/19/data-compression-with-arithmetic-coding.html
:* https://marknelson.us/posts/1991/02/01/arithmetic-coding-statistical-modeling-data-compression.html
1

edit