Revision as of 18:58, 4 July 2020 (view source) Adgarza (talk \| contribs) (→‎{{header\|QB64}}) ← Older edit		Revision as of 07:41, 5 July 2020 (view source) Eliasen (talk \| contribs) (→‎{{header\|Frink}}) Newer edit →
Line 1,649: =={{header\|Frink}}== This example shows some of the subtle and non-obvious power of Frink in processing text files in a language-aware and Unicode-aware fashion: * Frink has a Unicode-aware function, <CODE>wordList[''str'']</CODE>, which intelligently enumerates through the words in a string (and correctly handles compound words, hyphenated words, accented characters, etc.) It returns words, spaces, and punctuation marks separately. ~~Results~~For the purposes of this program, "words" that do not contain any alphanumeric characters ~~are filtered out.~~(as decided by the Unicode standard) are filtered out. These are likely punctuation and spaces. * The file fetched from Project Gutenberg is supposed to be encoded in UTF-8 character encoding, but their servers incorrectly send that it is Windows-1252 encoded, so this program fixes that. * Frink has a Unicode-aware lowercase function, <CODE>lc[''str'']</CODE> that correctly handles accented characters and may even make a string longer. * Frink can normalize Unicode characters with its <CODE>normalizeUnicode</CODE> function so the same word encoded two different ways in Unicode can be treated consistently. For example, a Unicode string can use various methods to encode what is essentially the same character/glyph. For example, the character <CODE>ô</CODE> can be represented as either <CODE>"\u00F4"</CODE> or <CODE>"\u006F\u0302"</CODE>. The former is a "precomposed" character, <CODE>"LATIN SMALL LETTER O WITH CIRCUMFLEX"</CODE>, and the latter is two Unicode codepoints, an <CODE>o</CODE> (<CODE>LATIN SMALL LETTER O</CODE>) followed by <CODE>"COMBINING CIRCUMFLEX ACCENT"</CODE>. (This is usually referred to as a "decomposed" representation.) Unicode normalization rules can convert these "equivalent" encodings into a canonical representation. This makes two different strings which look equivalent to a human (but are very different in their codepoints) be treated as the same to a computer, and these programs will count them the same. Even if the Project Gutenberg document uses precomposed and decomposed representations for the same words, this program will fix it and count them the same! How many other languages in this page do all this correctly? First, a simple but powerful method that works in old versions of Frink: <lang frink>d = new dict for w = select[wordList[read[normalizeUnicode["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]], %r/[[:alnum:]]/ ] d.increment[lc[w], 1] Line 1,674 ⟶ 1,682: [that, 7901] [it, 6641] </pre> Next, a "showing off" one-liner that works in recent versions of Frink that uses the <CODE>countToArray</CODE> function which easily creates sorted frequency lists and the <CODE>formatTable</CODE> function that formats into a nice table with columns lined up, and performs full Unicode-aware normalization, capitalization, and word-breaking: <lang frink>formatTable[first[countToArray[select[wordList[normalizeUnicode[read["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]], %r/[[:alnum:]]/ ]], 10], "right"]</lang> {{out}} <pre> the 36629 of 19602 and 14063 a 13447 to 13345 in 10259 was 8541 that 7303 he 6812 had 6133 </pre>

Word frequency: Difference between revisions

Word frequency (view source)

Revision as of 07:41, 5 July 2020