Word frequency: Difference between revisions

Line 1,649:
 
=={{header|Frink}}==
This example shows some of the subtle and non-obvious power of Frink in processing text files in a language-aware and Unicode-aware fashion:
* Frink has a Unicode-aware function, <CODE>wordList[''str'']</CODE>, which intelligently enumerates through the words in a string (and correctly handles compound words, hyphenated words, accented characters, etc.) It returns words, spaces, and punctuation marks separately. ResultsFor the purposes of this program, "words" that do not contain any alphanumeric characters are filtered out.(as decided by the Unicode standard) are filtered out. These are likely punctuation and spaces.
* The file fetched from Project Gutenberg is supposed to be encoded in UTF-8 character encoding, but their servers incorrectly send that it is Windows-1252 encoded, so this program fixes that.
* Frink has a Unicode-aware lowercase function, <CODE>lc[''str'']</CODE> that correctly handles accented characters and may even make a string longer.
 
* Frink can normalize Unicode characters with its <CODE>normalizeUnicode</CODE> function so the same word encoded two different ways in Unicode can be treated consistently. For example, a Unicode string can use various methods to encode what is essentially the same character/glyph. For example, the character <CODE>ô</CODE> can be represented as either <CODE>"\u00F4"</CODE> or <CODE>"\u006F\u0302"</CODE>. The former is a "precomposed" character, <CODE>"LATIN SMALL LETTER O WITH CIRCUMFLEX"</CODE>, and the latter is two Unicode codepoints, an <CODE>o</CODE> (<CODE>LATIN SMALL LETTER O</CODE>) followed by <CODE>"COMBINING CIRCUMFLEX ACCENT"</CODE>. (This is usually referred to as a "decomposed" representation.) Unicode normalization rules can convert these "equivalent" encodings into a canonical representation. This makes two different strings which look equivalent to a human (but are very different in their codepoints) be treated as the same to a computer, and these programs will count them the same. Even if the Project Gutenberg document uses precomposed and decomposed representations for the same words, this program will fix it and count them the same!
 
 
How many other languages in this page do all this correctly?
 
First, a simple but powerful method that works in old versions of Frink:
<lang frink>d = new dict
for w = select[wordList[read[normalizeUnicode["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]], %r/[[:alnum:]]/ ]
d.increment[lc[w], 1]
 
Line 1,674 ⟶ 1,682:
[that, 7901]
[it, 6641]
</pre>
 
Next, a "showing off" one-liner that works in recent versions of Frink that uses the <CODE>countToArray</CODE> function which easily creates sorted frequency lists and the <CODE>formatTable</CODE> function that formats into a nice table with columns lined up, and performs full Unicode-aware normalization, capitalization, and word-breaking:
 
<lang frink>formatTable[first[countToArray[select[wordList[normalizeUnicode[read["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]], %r/[[:alnum:]]/ ]], 10], "right"]</lang>
 
{{out}}
<pre>
the 36629
of 19602
and 14063
a 13447
to 13345
in 10259
was 8541
that 7303
he 6812
had 6133
</pre>
 
490

edits