Talk:Vigenère cipher/Cryptanalysis

From Rosetta Code
Revision as of 18:22, 21 June 2011 by MikeMol (talk | contribs) (→‎Faster?: Should have read it more thoroughly.)

I'm not completely sure how algorithmically feasible this task is, but it at least sounds interesting. (Well, I'm not sure how feasible it is to do this completely automatically.) I'll try to add an example soon. MagiMaster 05:37, 31 May 2011 (UTC)

C++ code added, but I'm not sure about how robust it is. There seems like a decent chance that the first part will pass a too-short key, and the second part doesn't have any checks on the final output. Another chi-squared check doesn't work, which also makes me suspect that the chi-squared check in the first part isn't the best solution either, even if it does ok. MagiMaster 11:28, 31 May 2011 (UTC)
There. I changed the chi-squared test to something that maximizes the correlation instead. Smaller pieces increase the correlation artificially, so I added a weight against them. It seems to work on nearly all the ciphertext I've tried. Long keys are more likely to cause errors, but it still gets the length and most of the characters. A second pass to try and correct any errors by looking for words or common trigrams might not be a bad idea, but I don't think I'll do that myself. MagiMaster 20:26, 31 May 2011 (UTC)

Task specification

I think that there should be a requirement to provide some output. Specifically each solution should output the key length determined, the discovered key, and some of the decrypted text (say a line). --Dgamey 14:14, 3 June 2011 (UTC)

Well, outputting the key gives the length of the key too, but yeah I could add some details to the output specification. MagiMaster 04:48, 4 June 2011 (UTC)

Difficulty and suggested improvements

This is a huge amount of text which makes the task much easier. As a hobby I used to solve these by hand and we would typically get cipher texts of 15-20 5 letter groups. This cipher text has 165 groups!

I suggest to make it more interesting, that we try our algorithms on successively shorter texts (halving the text each time) and see how low we can drive it.--Dgamey 12:28, 3 June 2011 (UTC)

Well, I put more text since this is supposed to be fully autonomous. I think if you can look at the output, look for word pieces, and use that to improve your guess, you wouldn't need as much text. That said, I did test the algorithm I posted on shorter inputs and it did fine. It was only when the key was fairly long compared to the text that it started to make mistakes. Also, the text I gave is a specific chunk of a certain work. I didn't put the output of my program on the page in case anyone wanted to view this as a puzzle. (Adding a second, shorter piece of ciphertext would be fine though.) MagiMaster 04:49, 4 June 2011 (UTC)

Fully autonomous

When trying to generate a fully autonomous version, my program generated a wrong (but "almost right") key, E.g., if ABCDE is the real key, the ciphertext decryption under ABxDEABCDEABCDEyBCDz may have a "better" distribution of letters then decryption under the correct key ABCDE. Perhaps, the length of the key should be provided as an additional input.

No, the key length is one of the critical factors in deciphering a message. If ABCDE is a correct key, then ABCDEABCDE is of course also a correct key; but statictical fluctuaion may make a slight variation of the longer key appear "more correct" because it has a chance of giving higher (but incidental) correlations. This is an inevitable artifact, and it is normal. --Ledrug 17:33, 21 June 2011 (UTC)

Faster?

I'm just curious how the C++ changes that were made make it faster? MagiMaster 02:22, 21 June 2011 (UTC)

*shrug* I can't tell. C++ people are optimization freaks. But on a side note, g++ gives quite a few warnings about ">>" at the end of nested templates (it's ok here, but can be confused with ">>" operator at times), maybe someone should take a look at it. --Ledrug 02:30, 21 June 2011 (UTC)
The >> for templates is a C++0x fix. Before that, it was a syntax error, but I guess g++ does it half way even without the -std flag. MagiMaster 09:55, 21 June 2011 (UTC)
In this program frequency() is a critical function. The length of the vector of pairs 'result' is known at compile time. Avoiding this push_back speeds up this program more than 20%. This optimization doesn't make this program longer or more complex.
It's a good optimization, but a better one might be to avoid allocation of the vector on every function call. Making 'results' function static, passed in by reference, or even file static would all be reasonable options. There's no threading in this program, so thread safety isn't a concern. 'results' is completely recalculated on each call, so there's no need to clear out its contents between calls, either. Finally, it's probably also perfectly appropriate for it to be a pair<char,double> array, rather than a vector, though that's less significant as an optimization as long as per-call reallocation is avoided. --Michael Mol 18:15, 21 June 2011 (UTC)
I should have read the code more thoroughly. I didn't notice that the entire vector was being returned out. Pass-by-reference would be useful, there. The data is used by correlation() with the sort() algorithm, so a simple array is inappropriate. std::vector is the better option. However, the data is only used in frequency() and correlation(), so a class-instance std::vector as a buffer still strikes me as a good optimization. --Michael Mol 18:22, 21 June 2011 (UTC)