Talk:Word frequency

why entered as a task instead of draft task?

Why was this entry entered as a task instead of a draft task? -- Gerard Schildberger (talk) 03:08, 16 August 2017 (UTC)

... ahhh ... I see that this task was demoted to a draft task by Paddy3118. -- Gerard Schildberger (talk) 08:34, 16 August 2017 (UTC)

task clarification

I assume we are to code programs to handle the general case, not just the file specified/mandated to be used as a test case.

What is a "word"?

Is 1997 a word? How about 20? How about twenty?

What letters can be included in a word?
There are a lot of French accented letters in the prescribed text, but are we to be limited to just the French accented letters?
German? Czech? Which dialects of Greek? Logographic kanji? Kana?

What other characters can be included in a word?

Are words that are hyphenated one word or two?

What about words like: jack-o'-lantern

What about split words across lines (if there are possi-
bly present)?

Are words that contain an apostrophe to be included (such as let's)?

What about words that contain non-Latin (Roman) letters?
As it happens, those non-Latin letters don't show up in the top ten.

What exactly is the text (start and stop) that is contained in the web-page to be used?

Should we also use the prologue and epilogue of the Project Gutenberg along with the book's text?

Wouldn't it be a lot simpler to have a simple (and complete) text file to download [with no (de-)assembly, editing, or text massaging required]?

-- Gerard Schildberger (talk) 03:08, 16 August 2017 (UTC)