Letter frequency: Difference between revisions

Frink
(→‎Pascal: standardize [i.e. use ISO standards as basis])
(Frink)
Line 2,836:
for the “yearly clean up,” Linux might be the perfect platform for you. Linux has evolved into one of the most reliable computer
ecosystems on the planet. Combine that reliability with zero cost of entry and you have the perfect solution for a desktop platform.
</pre>
 
=={{header|Frink}}==
This example shows some of the subtle and non-obvious power of Frink in processing text files in a language-aware and Unicode-aware fashion:
* Frink has a Unicode-aware function, <CODE>graphemeList[''str'']</CODE>, which intelligently enumerates through what a human would consider to be a single visible character, including "characters" composed of multiple Unicode codepoints.
* The file fetched from Project Gutenberg is supposed to be encoded in UTF-8 character encoding, but their servers incorrectly send either that it is Windows-1252 encoded or send no character encoding at all, so this program fixes that.
* Frink has a Unicode-aware lowercase function, <CODE>lc[''str'']</CODE> that correctly handles accented characters and may even make a string longer.
 
* This uses full Unicode tables to determine what is a "letter."
 
* This works with high Unicode characters, that is above \uFFFF.
 
* Frink can normalize Unicode characters with its <CODE>normalizeUnicode</CODE> function so the same grapheme encoded two different ways in Unicode can be treated consistently. For example, a Unicode string can use various methods to encode what is essentially the same character/glyph. For example, the character <CODE>ô</CODE> can be represented as either <CODE>"\u00F4"</CODE> or <CODE>"\u006F\u0302"</CODE>. The former is a "precomposed" character, <CODE>"LATIN SMALL LETTER O WITH CIRCUMFLEX"</CODE>, and the latter is two Unicode codepoints, an <CODE>o</CODE> (<CODE>LATIN SMALL LETTER O</CODE>) followed by <CODE>"COMBINING CIRCUMFLEX ACCENT"</CODE>. (This is usually referred to as a "decomposed" representation.) Unicode normalization rules can convert these "equivalent" encodings into a canonical representation. This makes two different strings which look equivalent to a human (but are very different in their codepoints) be treated as the same to a computer, and these programs will count them the same. Even if the Project Gutenberg document uses precomposed and decomposed representations for the same characters, this program will fix it and count them the same! See the [[http://unicode.org/reports/tr15/ Unicode Normal Forms]] specification for more about these normalization rules. Frink implements all of them (NFC, NFD, NFKC, NFKD). NFC is the default in <CODE>normalizeUnicode[''str'', ''encoding=NFC'']</CODE>. They're interesting!
 
How many other languages in this page do all or any of this correctly?
<lang frink>print[formatTable[countToArray[select[graphemeList[lc[normalizeUnicode[read["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"],"NFC"]]], %r/[[:alpha:]]/ ]] , "right"]]</lang>
{{out}}
<pre>
e 330603
t 235571
a 207101
o 184385
h 176823
i 175320
n 169922
s 162043
r 148632
d 108724
l 99567
u 68295
c 67332
m 62212
w 56507
f 56187
g 48543
p 43366
y 39183
b 37461
v 26258
k 14427
j 5838
x 4026
q 2533
z 1905
é 1473
è 299
æ 116
ê 74
à 64
â 56
ç 50
ü 39
î 39
œ 38
ô 34
ù 18
ï 18
û 9
ë 5
ñ 2
</pre>
 
490

edits