Talk:Word frequency

From Rosetta Code

why entered as a task instead of draft task?

Why was this entry entered as a   task   instead of a   draft task?   -- Gerard Schildberger (talk) 03:08, 16 August 2017 (UTC)

... ahhh ...   I see that this task was demoted to a draft task by   Paddy3118.   -- Gerard Schildberger (talk) 08:34, 16 August 2017 (UTC)

task clarification

I assume we are to code programs to handle the general case, not just the file specified/mandated to be used as a test case.

What is a "word"?

A single distinct meaningful element of speech. I speak words. How speech is written is very much language, time and individual dependent. Linguistically Speech, Speach, or even Speych have been used for the same word. Don't mention Donaudampfschifffahrtselektrizitätenhauptbetriebs. For the purpose of this task I would suggest using the concept of 'orthographic word' which works well for English. Not well for Ancient Greek and Egyptian.--Nigel Galloway (talk) 12:46, 17 August 2017 (UTC)

Is 1997 a word?   How about 20?   How about twenty?

What letters can be included in a word?
There are a lot of French accented letters in the prescribed text, but are we to be limited to   just   the French accented letters?
German?     Czech?     Which dialects of Greek?     Logographic kanji?     Kana?

What other characters can be included in a word?

Characters not included in the alphabet are called logograms. They include numbers, foreign letters and mojos. I💖NY is 1 orthographic word. The sentence '1 orthographic word' contains 3 orthographic words'--Nigel Galloway (talk) 12:58, 17 August 2017 (UTC)

Are words that are hyphenated one word or two?

1 orthographic word--Nigel Galloway (talk) 12:52, 17 August 2017 (UTC)

What about words like:     jack-o'-lantern

1 orthographic word--Nigel Galloway (talk) 12:51, 17 August 2017 (UTC)

What about split words across lines   (if there are possi-
bly present)?

Are words that contain an apostrophe to be included   (such as let's)?

What about words that contain non-Latin (Roman) letters?
As it happens, those non-Latin letters don't show up in the   top ten.

What exactly is the text   (start and stop)   that is contained in the web-page to be used?

Should we also use the prologue and epilogue of the   Project Gutenberg   along with the book's text?

Wouldn't it be a lot simpler to have a simple (and complete) text file to download   [with no (de-)assembly, editing, or text massaging required]?

-- Gerard Schildberger (talk) 03:08, 16 August 2017 (UTC)


It seems the original task author used the regexp \w+ in the Clojure and first Python examples. Maybe he should expand on what \w+ matches and define it as the meaning of a word for the purposes of the task? --Paddy3118 (talk) 20:09, 17 August 2017 (UTC)

\w means [A-z0-9]. This could be extended to include accented Latin characters: [A-z0-9À-ÿ]. But this would not change that the answers are wrong. There are 41082 occurrences of the word 'the', not 41036. The text contains for instance "BOOK SECOND--THE FALL". I suspect that the Python and Clojure solutions miss this.--Nigel Galloway (talk) 11:51, 18 August 2017 (UTC)
They are probably missing the two of the three occurrences of 'the' in:
"The beds," pursued the director, "are very much crowded against each
other."

"That is what I observed."

"The halls are nothing but rooms, and it is with difficulty that the air
can be changed in them."

--Nigel Galloway (talk) 12:01, 18 August 2017 (UTC)

So long as there shall exist, by virtue of law and custom, decrees of
damnation pronounced by society, artificially creating hells amid the
civilization of earth, and adding the element of human fate to divine
destiny; so long as the three great problems of the century--the
degradation of man through pauperism, the corruption of woman through
hunger, the crippling of children through lack of light--are unsolved;
so long as social asphyxia is possible in any part of the world;--in
other words, and with a still wider significance, so long as ignorance
and poverty exist on earth, books of the nature of Les Misérables cannot
fail to be of use.
They also need to catch the 'the' after century--Nigel Galloway (talk) 12:06, 18 August 2017 (UTC)

Using Microsoft Word 2010 to count words

I opened this task with MS Word 2010 and asked it to count the occurrences of 'the' (all word forms English). It thinks there are 41082.--Nigel Galloway (talk) 13:17, 17 August 2017 (UTC)