Word frequency: Difference between revisions

m
Line 2,273:
This example shows some of the subtle and non-obvious power of Frink in processing text files in a language-aware and Unicode-aware fashion:
* Frink has a Unicode-aware function, <CODE>wordList[''str'']</CODE>, which intelligently enumerates through the words in a string (and correctly handles compound words, hyphenated words, accented characters, etc.) It returns words, spaces, and punctuation marks separately. For the purposes of this program, "words" that do not contain any alphanumeric characters (as decided by the Unicode standard) are filtered out. These are likely punctuation and spaces. There is also a two-argument function, <CODE>wordList[''str'', ''lang'']</CODE> which allows you to specify a language code ''e.g.'' <CODE>"fr"</CODE> to use the rules of French (or many other human languages) to perform correct word-breaking according to the rules of that language!
* The file fetched from Project Gutenberg is supposed to be encoded in UTF-8 character encoding, but their servers incorrectly send either that it is Windows-1252 encoded or send no character encoding at all, so this program fixes that.
490

edits