Word frequency: Difference between revisions

m
→‎{{header|Wren}}: Minor tidy and rerun
(add PicoLisp)
m (→‎{{header|Wren}}: Minor tidy and rerun)
 
(217 intermediate revisions by 72 users not shown)
Line 1:
{{draft task}}
[[Category:Text processing]]
 
;Task:
Given a text file and an integer   '''n''',   print/display the   '''n'''   most
common words in the file   (and the number of their occurrences)   in decreasing frequency.
 
their occurrences) in decreasing frequency.
 
For the purposes of this task:
*   A word is a sequence of one or more contiguous letters.
*   You are free to define what a   ''letter''   is.
* You are free to define what a letter is. Underscores, accented letters, apostrophes, and other special characters can be handled at the example writer's discretion. For example, you may treat a compound word like "well-dressed" as either one word or two. The word "it's" could also be one or two words as you see fit. You may also choose not to support non US-ASCII characters. Feel free to explicitly state the thoughts behind the program decisions.
*   Underscores, accented letters, apostrophes, hyphens, and other special characters can be handled at your discretion.
* Assume words will not span multiple lines.
*   You may treat a compound word like   '''well-dressed'''   as either one word or two.
* Do not worry about normalization of word spelling differences. Treat "color" and "colour" as two distinct words.
*   The word   '''it's'''   could also be one or two words as you see fit.
* Uppercase letters are considered equivalent to their lowercase counterparts
*   You may also choose not to support non US-ASCII characters.
* Words of equal frequency can be listed in any order
*   Assume words will not span multiple lines.
*   Don't worry about normalization of word spelling differences.
*   Treat   '''color'''   and   '''colour'''   as two distinct words.
*   Uppercase letters are considered equivalent to their lowercase counterparts.
*   Words of equal frequency can be listed in any order.
*   Feel free to explicitly state the thoughts behind the program decisions.
 
 
Show example output using [http://www.gutenberg.org/files/135/135-0.txt Les Misérables from Project Gutenberg] as the text file input and display the top   '''10'''   most used words.
 
Show example output using [http://www.gutenberg.org/files/135/135-0.txt Les Misérables from Project Gutenberg] as the text file input and display the top 10 most used words.
 
<br>
;History:
This task was originally taken from programming pearls from [https://doi.org/10.1145/5948.315654 Communications of the ACM June 1986 Volume 29 Number 6]
where this problem is solved by Donald Knuth using literate programming and then critiqued by Doug McIlroy,
demonstrating solving the problem in a 6 line Unix shell script (provided as an example below).
 
<br>
 
;References:
*[http://franklinchen.com/blog/2011/12/08/revisiting-knuth-and-mcilroys-word-count-programs/ McIlroy's program]
 
 
 
{{Template:Strings}}
<br><br>
 
=={{header|11l}}==
<syntaxhighlight lang="11l">DefaultDict[String, Int] cnt
L(word) re:‘\w+’.find_strings(File(‘135-0.txt’).read().lowercase())
cnt[word]++
print(sorted(cnt.items(), key' wordc -> wordc[1], reverse' 1B)[0.<10])</syntaxhighlight>
 
{{out}}
<pre>
[(the, 41045), (of, 19953), (and, 14939), (a, 14527), (to, 13942), (in, 11210), (he, 9646), (was, 8620), (that, 7922), (it, 6659)]
</pre>
 
=={{header|Ada}}==
 
This version uses a character set to match valid characters in a token. Another version could use a pointer to a function returning a boolean to match valid characters (allowing to use functions such as Is_Alphanumeric), but AFAIK there is no "Find_Token" method that uses one.
 
{{works with|Ada|Ada|2012}}
 
<syntaxhighlight lang="ada">with Ada.Command_Line;
with Ada.Text_IO;
with Ada.Integer_Text_IO;
with Ada.Strings.Maps;
with Ada.Strings.Fixed;
with Ada.Characters.Handling;
with Ada.Containers.Indefinite_Ordered_Maps;
with Ada.Containers.Indefinite_Ordered_Sets;
with Ada.Containers.Ordered_Maps;
 
procedure Word_Frequency is
package TIO renames Ada.Text_IO;
 
package String_Counters is new Ada.Containers.Indefinite_Ordered_Maps(String, Natural);
package String_Sets is new Ada.Containers.Indefinite_Ordered_Sets(String);
package Sorted_Counters is new Ada.Containers.Ordered_Maps
(Natural,
String_Sets.Set,
"=" => String_Sets."=",
"<" => ">");
-- for sorting by decreasing number of occurrences and ascending lexical order
 
procedure Increment(Key : in String; Element : in out Natural) is
begin
Element := Element + 1;
end Increment;
 
path : constant String := Ada.Command_Line.Argument(1);
how_many : Natural := 10;
set : constant Ada.Strings.Maps.Character_Set := Ada.Strings.Maps.To_Set(ranges => (('a', 'z'), ('0', '9')));
F : TIO.File_Type;
first : Positive;
last : Natural;
from : Positive;
counter : String_Counters.Map;
sorted_counts : Sorted_Counters.Map;
C1 : String_Counters.Cursor;
C2 : Sorted_Counters.Cursor;
tmp_set : String_Sets.Set;
begin
-- read file and count words
TIO.Open(F, name => path, mode => TIO.In_File);
while not TIO.End_Of_File(F) loop
declare
line : constant String := Ada.Characters.Handling.To_Lower(TIO.Get_Line(F));
begin
from := line'First;
loop
Ada.Strings.Fixed.Find_Token(line(from .. line'Last), set, Ada.Strings.Inside, first, last);
exit when last < First;
C1 := counter.Find(line(first .. last));
if String_Counters.Has_Element(C1) then
counter.Update_Element(C1, Increment'Access);
else
counter.Insert(line(first .. last), 1);
end if;
from := last + 1;
end loop;
end;
end loop;
TIO.Close(F);
 
-- fill Natural -> StringSet Map
C1 := counter.First;
while String_Counters.Has_Element(C1) loop
if sorted_counts.Contains(String_Counters.Element(C1)) then
tmp_set := sorted_counts.Element(String_Counters.Element(C1));
tmp_set.Include(String_Counters.Key(C1));
else
sorted_counts.Include(String_Counters.Element(C1), String_Sets.To_Set(String_Counters.Key(C1)));
end if;
String_Counters.Next(C1);
end loop;
 
-- output
C2 := sorted_counts.First;
while Sorted_Counters.Has_Element(C2) loop
for Item of Sorted_Counters.Element(C2) loop
Ada.Integer_Text_IO.Put(TIO.Standard_Output, Sorted_Counters.Key(C2), width => 9);
TIO.Put(TIO.Standard_Output, " ");
TIO.Put_Line(Item);
end loop;
Sorted_Counters.Next(C2);
how_many := how_many - 1;
exit when how_many = 0;
end loop;
end Word_Frequency;
</syntaxhighlight>
{{out}}
<pre>
$ ./word_frequency 135-0.txt
41093 the
19954 of
14943 and
14558 a
13953 to
11219 in
9649 he
8622 was
7924 that
6661 it
</pre>
 
=={{header|ALGOL 68}}==
{{works with|ALGOL 68G|Any - tested with release 2.8.3.win32}}
Uses the associative array implementations in [[ALGOL_68/prelude]].
<syntaxhighlight lang="algol68"># find the n most common words in a file #
# use the associative array in the Associate array/iteration task #
# but with integer values #
PR read "aArrayBase.a68" PR
MODE AAKEY = STRING;
MODE AAVALUE = INT;
AAVALUE init element value = 0;
# returns text converted to upper case #
OP TOUPPER = ( STRING text )STRING:
BEGIN
STRING result := text;
FOR ch pos FROM LWB result TO UPB result DO
IF is lower( result[ ch pos ] ) THEN result[ ch pos ] := to upper( result[ ch pos ] ) FI
OD;
result
END # TOUPPER # ;
# returns text converted to an INT or -1 if text is not a number #
OP TOINT = ( STRING text )INT:
BEGIN
INT result := 0;
BOOL is numeric := TRUE;
FOR ch pos FROM UPB text BY -1 TO LWB text WHILE is numeric DO
CHAR c = text[ ch pos ];
is numeric := is numeric AND c >= "0" AND c <= "9";
IF is numeric THEN ( result *:= 10 ) +:= ABS c - ABS "0" FI
OD;
IF is numeric THEN result ELSE -1 FI
END # TOINT # ;
# returns TRUE if c is a letter, FALSE otherwise #
OP ISLETTER = ( CHAR c )BOOL:
IF ( c >= "a" AND c <= "z" )
OR ( c >= "A" AND c <= "Z" )
THEN TRUE
ELSE char in string( c, NIL, "ÇåçêëÆôöÿÖØáóÔ" )
FI # ISLETER # ;
# get the file name and number of words from then commmand line #
STRING file name := "pg-les-misrables.txt";
INT number of words := 10;
FOR arg pos TO argc - 1 DO
STRING arg upper = TOUPPER argv( arg pos );
IF arg upper = "FILE" THEN
file name := argv( arg pos + 1 )
ELIF arg upper = "NUMBER" THEN
number of words := TOINT argv( arg pos + 1 )
FI
OD;
IF FILE input file;
open( input file, file name, stand in channel ) /= 0
THEN
# failed to open the file #
print( ( "Unable to open """ + file name + """", newline ) )
ELSE
# file opened OK #
print( ( "Processing: ", file name, newline ) );
BOOL at eof := FALSE;
BOOL at eol := FALSE;
# set the EOF handler for the file #
on logical file end( input file, ( REF FILE f )BOOL:
BEGIN
# note that we reached EOF on the #
# latest read #
at eof := TRUE;
# return TRUE so processing can continue #
TRUE
END
);
# set the end-of-line handler for the file so get word can see line boundaries #
on line end( input file
, ( REF FILE f )BOOL:
BEGIN
# note we reached end-of-line #
at eol := TRUE;
# return FALSE to use the default eol handling #
# i.e. just get the next charactefr #
FALSE
END
);
# get the words from the file and store the counts in an associative array #
REF AARRAY words := INIT LOC AARRAY;
INT word count := 0;
CHAR c := " ";
WHILE get( input file, ( c ) );
NOT at eof
DO
WHILE NOT ISLETTER c AND NOT at eof DO get( input file, ( c ) ) OD;
STRING word := "";
at eol := FALSE;
WHILE ISLETTER c AND NOT at eol AND NOT at eof DO word +:= c; get( input file, ( c ) ) OD;
word count +:= 1;
words // TOUPPER word +:= 1
OD;
close( input file );
print( ( file name, " contains ", whole( word count, 0 ), " words", newline ) );
# find the most used words #
[ number of words ]STRING top words;
[ number of words ]INT top counts;
FOR i TO number of words DO top words[ i ] := ""; top counts[ i ] := 0 OD;
REF AAELEMENT w := FIRST words;
WHILE w ISNT nil element DO
INT count = value OF w;
STRING word = key OF w;
BOOL found := FALSE;
FOR i TO number of words WHILE NOT found DO
IF count > top counts[ i ] THEN
# found a word that is used nore than a current #
# most used word #
found := TRUE;
# move the other words down one place #
FOR move pos FROM number of words BY - 1 TO i + 1 DO
top counts[ move pos ] := top counts[ move pos - 1 ];
top words [ move pos ] := top words [ move pos - 1 ]
OD;
# install the new word #
top counts[ i ] := count;
top words [ i ] := word
FI
OD;
w := NEXT words
OD;
print( ( whole( number of words, 0 ), " most used words:", newline ) );
print( ( " count word", newline ) );
FOR i TO number of words DO
print( ( whole( top counts[ i ], -6 ), ": ", top words[ i ], newline ) )
OD
FI</syntaxhighlight>
{{out}}
<pre>
Processing: pg-les-misrables.txt
pg-les-misrables.txt contains 578381 words
10 most used words:
count word
39333: THE
19154: OF
14628: AND
14229: A
13431: TO
11275: HE
10879: IN
8236: WAS
7527: THAT
6491: IT
</pre>
 
=={{header|APL}}==
{{works with|GNU APL}}
 
<syntaxhighlight lang="apl">
⍝⍝ NOTE: input text is assumed to be encoded in ISO-8859-1
⍝⍝ (The suggested example '135-0.txt' of Les Miserables on
⍝⍝ Project Gutenberg is in UTF-8.)
⍝⍝
⍝⍝ Use Unix 'iconv' if required
⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝
∇r ← lowerAndStrip s;stripped;mixedCase
⍝⍝ Convert text to lowercase, punctuation and newlines to spaces
stripped ← ' abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*'
mixedCase ← ⎕av[11],' ,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
r ← stripped[mixedCase ⍳ s]
 
⍝⍝ Return the _n_ most frequent words and a count of their occurrences
⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝
∇r ← n wordCount fname ;D;wl;sidx;swv;pv;wc;uw;sortOrder
D ← lowerAndStrip (⎕fio['read_file'] fname) ⍝ raw text with newlines
wl ← (~ D ∊ ' ') ⊂ D
sidx ← ⍒wl
swv ← wl[sidx]
pv ← +\ 1,~2 ≡/ swv
wc ← ∊ ⍴¨ pv ⊂ pv
uw ← 1 ⊃¨ pv ⊂ swv
sortOrder ← ⍒wc
r ← n↑[2] uw[sortOrder],[0.5]wc[sortOrder]
 
5 wordCount '135-0.txt'
the of and a to
41042 19952 14938 14526 13942
</syntaxhighlight>
 
=={{header|AppleScript}}==
 
<syntaxhighlight lang="applescript">(*
For simplicity here, words are considered to be uninterrupted sequences of letters and/or digits.
The set text is too messy to warrant faffing around with anything more sophisticated.
The first letter in each word is upper-cased and the rest lower-cased for case equivalence and presentation.
Where more than n words qualify for the top n or fewer places, all are included in the result.
*)
 
use AppleScript version "2.4" -- OS X 10.10 (Yosemite) or later
use framework "Foundation"
use scripting additions
 
on wordFrequency(filePath, n)
set |⌘| to current application
-- Get the text and "capitalize" it (lower-case except for the first letters in words).
set theText to |⌘|'s class "NSString"'s stringWithContentsOfFile:(filePath) usedEncoding:(missing value) |error|:(missing value)
set theText to theText's capitalizedStringWithLocale:(|⌘|'s class "NSLocale"'s currentLocale()) -- Yosemite compatible.
-- Split it at the non-word characters.
set nonWordCharacters to |⌘|'s class "NSCharacterSet"'s alphanumericCharacterSet()'s invertedSet()
set theWords to theText's componentsSeparatedByCharactersInSet:(nonWordCharacters)
-- Use a counted set to count the individual words' occurrences.
set countedSet to |⌘|'s class "NSCountedSet"'s alloc()'s initWithArray:(theWords)
-- Build a list of word/frequency records, excluding any empty strings left over from the splitting above.
set mutableSet to |⌘|'s class "NSMutableSet"'s setWithSet:(countedSet)
tell mutableSet to removeObject:("")
script o
property discreteWords : mutableSet's allObjects() as list
property wordsAndFrequencies : {}
end script
set discreteWordCount to (count o's discreteWords)
repeat with i from 1 to discreteWordCount
set thisWord to item i of o's discreteWords
set end of o's wordsAndFrequencies to {thisWord:thisWord, frequency:(countedSet's countForObject:(thisWord)) as integer}
end repeat
-- Convert to NSMutableArray, reverse-sort the result on the frequencies, and convert back to list.
set wordsAndFrequencies to |⌘|'s class "NSMutableArray"'s arrayWithArray:(o's wordsAndFrequencies)
set descendingByFrequency to |⌘|'s class "NSSortDescriptor"'s sortDescriptorWithKey:("frequency") ascending:(false)
tell wordsAndFrequencies to sortUsingDescriptors:({descendingByFrequency})
set o's wordsAndFrequencies to wordsAndFrequencies as list
if (discreteWordCount > n) then
-- If there are more than n records, check for any immediately following the nth which may have the same frequency as it.
set nthHighestFrequency to frequency of item n of o's wordsAndFrequencies
set qualifierCount to n
repeat with i from (n + 1) to discreteWordCount
if (frequency of item i of o's wordsAndFrequencies = nthHighestFrequency) then
set qualifierCount to i
else
exit repeat
end if
end repeat
else
-- Otherwise reduce n to the actual number of discrete words.
set n to discreteWordCount
set qualifierCount to discreteWordCount
end if
-- Compose a text report from the qualifying words and frequencies.
if (qualifierCount = n) then
set output to {"The " & n & " most frequently occurring words in the file are:"}
else
set output to {(qualifierCount as text) & " words share the " & ((n as text) & " highest frequencies in the file:")}
end if
repeat with i from 1 to qualifierCount
set {thisWord:thisWord, frequency:frequency} to item i of o's wordsAndFrequencies
set end of output to thisWord & ": " & (tab & frequency)
end repeat
set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to linefeed
set output to output as text
set AppleScript's text item delimiters to astid
return output
end wordFrequency
 
-- Test code:
set filePath to POSIX path of ((path to desktop as text) & "www.rosettacode.org:Word frequency:135-0.txt")
set n to 10
return wordFrequency(filePath, n)</syntaxhighlight>
 
{{output}}
<syntaxhighlight lang="applescript">"The 10 most frequently occurring words in the file are:
The: 41092
Of: 19954
And: 14943
A: 14545
To: 13953
In: 11219
He: 9649
Was: 8622
That: 7924
It: 6661"</syntaxhighlight>
 
=={{header|Arturo}}==
 
<syntaxhighlight lang="rebol">findFrequency: function [file, count][
freqs: #[]
r: {/[[:alpha:]]+/}
loop flatten map split.lines read file 'l -> match lower l r 'word [
if not? key? freqs word -> freqs\[word]: 0
freqs\[word]: freqs\[word] + 1
]
freqs: sort.values.descending freqs
result: new []
loop 0..dec count 'x [
'result ++ @[@[get keys freqs x, get values freqs x]]
]
return result
]
 
loop findFrequency "https://www.gutenberg.org/files/135/135-0.txt" 10 'pair [
print pair
]</syntaxhighlight>
 
{{out}}
 
<pre>the 41096
of 19955
and 14939
a 14558
to 13954
in 11218
he 9649
was 8622
that 7924
it 6661</pre>
 
=={{header|AutoHotkey}}==
<syntaxhighlight lang="autohotkey">URLDownloadToFile, http://www.gutenberg.org/files/135/135-0.txt, % A_temp "\tempfile.txt"
FileRead, H, % A_temp "\tempfile.txt"
FileDelete, % A_temp "\tempfile.txt"
words := []
while pos := RegExMatch(H, "\b[[:alpha:]]+\b", m, A_Index=1?1:pos+StrLen(m))
words[m] := words[m] ? words[m] + 1 : 1
for word, count in words
list .= count "`t" word "`r`n"
Sort, list, RN
loop, parse, list, `n, `r
{
result .= A_LoopField "`r`n"
if A_Index = 10
break
}
MsgBox % "Freq`tWord`n" result
return</syntaxhighlight>
Outputs:<pre>Freq Word
41036 The
19946 of
14940 and
14589 A
13939 TO
11204 in
9645 HE
8619 WAS
7922 THAT
6659 it</pre>
 
=={{header|AWK}}==
<syntaxhighlight lang="awk">
# syntax: GAWK -f WORD_FREQUENCY.AWK [-v show=x] LES_MISERABLES.TXT
#
# A word is anything separated by white space.
# Therefor "this" and "this." are different.
# But "This" and "this" are identical.
# As I am "free to define what a letter is" I have chosen to allow
# numerics and all special characters as they are usually considered
# parts of words in text processing applications.
#
{ nbytes += length($0) + 2 # +2 for CR/LF
nfields += NF
$0 = tolower($0)
for (i=1; i<=NF; i++) {
arr[$i]++
}
}
END {
show = (show == "") ? 10 : show
width1 = length(show)
PROCINFO["sorted_in"] = "@val_num_desc"
for (i in arr) {
if (width2 == 0) { width2 = length(arr[i]) }
if (n++ >= show) { break }
printf("%*d %*d %s\n",width1,n,width2,arr[i],i)
}
printf("input: %d records, %d bytes, %d words of which %d are unique\n",NR,nbytes,nfields,length(arr))
exit(0)
}
</syntaxhighlight>
{{out}}
<pre>
1 40372 the
2 19868 of
3 14472 and
4 14278 a
5 13589 to
6 11024 in
7 9213 he
8 8347 was
9 7250 that
10 6414 his
input: 73829 records, 3369772 bytes, 568744 words of which 50394 are unique
</pre>
 
=={{header|BASIC}}==
==={{header|QB64}}===
This is a rather long code. I fulfilled the requirement with QB64. It "cleans" each word so it takes as a word anything that begins and ends with a letter. It works with arrays. Amazing the speed of QB64 to do this job with such a big file as Les Miserables.txt.
<syntaxhighlight lang="qbasic">
OPTION _EXPLICIT
 
' SUBs and FUNCTIONs
DECLARE SUB CountWords (FromString AS STRING)
DECLARE SUB QuickSort (lLeftN AS LONG, lRightN AS LONG, iMode AS INTEGER)
DECLARE SUB ShowResults ()
DECLARE SUB ShowCompletion ()
DECLARE SUB TopCounted ()
DECLARE FUNCTION InsertWord& (WhichWord AS STRING)
DECLARE FUNCTION BinarySearch& (LookFor AS STRING, RetPos AS INTEGER)
DECLARE FUNCTION CleanWord$ (WhichWord AS STRING)
 
' Var
DIM iFile AS INTEGER
DIM iCol AS INTEGER
DIM iFil AS INTEGER
DIM iStep AS INTEGER
DIM iBar AS INTEGER
DIM iBlock AS INTEGER
DIM lIni AS LONG
DIM lEnd AS LONG
DIM lLines AS LONG
DIM lLine AS LONG
DIM lLenF AS LONG
DIM iRuns AS INTEGER
DIM lMaxWords AS LONG
DIM sTimer AS SINGLE
DIM strFile AS STRING
DIM strKey AS STRING
DIM strText AS STRING
DIM strDate AS STRING
DIM strTime AS STRING
DIM strBar AS STRING
DIM lWords AS LONG
DIM strWords AS STRING
CONST AddWords = 100
CONST TopCount = 10
CONST FALSE = 0, TRUE = NOT FALSE
 
' Initialize
iFile = FREEFILE
lIni = 1
strDate = DATE$
strTime = TIME$
lEnd = 0
lMaxWords = 1000
REDIM strWords(lIni TO lMaxWords) AS STRING
REDIM lWords(lIni TO lMaxWords) AS LONG
REDIM lTopWords(1) AS LONG
REDIM strTopWords(1) AS STRING
 
' ---Main program loop
$RESIZE:SMOOTH
DO
DO
CLS
PRINT "This program will count how many words are in a text file and shows the 10"
PRINT "most used of them."
PRINT
INPUT "Document to open (TXT file) (f=see files): ", strFile
IF UCASE$(strFile) = "F" THEN
strFile = ""
FILES
DO: LOOP UNTIL INKEY$ <> ""
END IF
LOOP UNTIL strFile <> ""
OPEN strFile FOR BINARY AS #iFile
IF LOF(iFile) > 0 THEN
iRuns = iRuns + 1
CLOSE #iFile
 
' Opens the document file to analyze it
sTimer = TIMER
ON TIMER(1) GOSUB ShowAdvance
OPEN strFile FOR INPUT AS #iFile
lLenF = LOF(iFile)
PRINT "Looking for words in "; strFile; ". File size:"; STR$(lLenF); ". ";: iCol = POS(0): PRINT "Initializing";
COLOR 23: PRINT "...";: COLOR 7
 
' Count how many lines has the file
lLines = 0
DO WHILE NOT EOF(iFile)
LINE INPUT #iFile, strText
lLines = lLines + 1
LOOP
CLOSE #iFile
 
' Shows the bar
LOCATE , iCol: PRINT "Initialization complete."
PRINT
PRINT "Processing"; lLines; "lines";: COLOR 23: PRINT "...": COLOR 7
iFil = CSRLIN
iCol = POS(0)
iBar = 80
iBlock = 80 / lLines
IF iBlock = 0 THEN iBlock = 1
PRINT STRING$(iBar, 176)
lLine = 0
iStep = lLines * iBlock / iBar
IF iStep = 0 THEN iStep = 1
IF iStep > 20 THEN
TIMER ON
END IF
OPEN strFile FOR INPUT AS #iFile
DO WHILE NOT EOF(iFile)
lLine = lLine + 1
IF (lLine MOD iStep) = 0 THEN
strBar = STRING$(iBlock * (lLine / iStep), 219)
LOCATE iFil, 1
PRINT strBar
ShowCompletion
END IF
LINE INPUT #iFile, strText
CountWords strText
strKey = INKEY$
LOOP
ShowCompletion
CLOSE #iFile
TIMER OFF
LOCATE iFil - 1, 1
PRINT "Done!" + SPACE$(70)
strBar = STRING$(iBar, 219)
LOCATE iFil, 1
PRINT strBar
LOCATE iFil + 5, 1
PRINT "Finishing";: COLOR 23: PRINT "...";: COLOR 7
ShowResults
 
' Frees the RAM
lMaxWords = 1000
lEnd = 0
REDIM strWords(lIni TO lMaxWords) AS STRING
REDIM lWords(lIni TO lMaxWords) AS LONG
 
ELSE
CLOSE #iFile
KILL strFile
PRINT
PRINT "Document does not exist."
END IF
PRINT
PRINT "Try again? (Y/n)"
DO
strKey = UCASE$(INKEY$)
LOOP UNTIL strKey = "Y" OR strKey = "N" OR strKey = CHR$(13) OR strKey = CHR$(27)
LOOP UNTIL strKey = "N" OR strKey = CHR$(27) OR iRuns >= 99
 
CLS
IF iRuns >= 99 THEN
PRINT "Maximum runs reached for this session."
END IF
 
PRINT "End of program"
PRINT "Start date/time: "; strDate; " "; strTime
PRINT "End date/time..: "; DATE$; " "; TIME$
END
' ---End main program
 
ShowAdvance:
ShowCompletion
RETURN
 
FUNCTION BinarySearch& (LookFor AS STRING, RetPos AS INTEGER)
' Var
DIM lFound AS LONG
DIM lLow AS LONG
DIM lHigh AS LONG
DIM lMid AS LONG
DIM strLookFor AS STRING
SHARED lIni AS LONG
SHARED lEnd AS LONG
SHARED lMaxWords AS LONG
SHARED strWords() AS STRING
SHARED lWords() AS LONG
 
' Binary search for stated word in the list
lLow = lIni
lHigh = lEnd
lFound = 0
strLookFor = UCASE$(LookFor)
DO WHILE (lFound < 1) AND (lLow <= lHigh)
lMid = (lHigh + lLow) / 2
IF strWords(lMid) = strLookFor THEN
lFound = lMid
ELSEIF strWords(lMid) > strLookFor THEN
lHigh = lMid - 1
ELSE
lLow = lMid + 1
END IF
LOOP
 
' Should I return the position if not found?
IF lFound = 0 AND RetPos THEN
IF lEnd < 1 THEN
lFound = 1
ELSEIF strWords(lMid) > strLookFor THEN
lFound = lMid
ELSE
lFound = lMid + 1
END IF
END IF
 
' Return the value
BinarySearch = lFound
END FUNCTION
 
FUNCTION CleanWord$ (WhichWord AS STRING)
' Var
DIM iSeek AS INTEGER
DIM iStep AS INTEGER
DIM bOK AS INTEGER
DIM strWord AS STRING
DIM strChar AS STRING
 
strWord = WhichWord
 
' Look for trailing wrong characters
strWord = LTRIM$(RTRIM$(strWord))
IF LEN(strWord) > 0 THEN
iStep = 0
DO
' Determines if step will be forward or backwards
IF iStep = 0 THEN
iStep = -1
ELSE
iStep = 1
END IF
 
' Sets the initial value of iSeek
IF iStep = -1 THEN
iSeek = LEN(strWord)
ELSE
iSeek = 1
END IF
 
bOK = FALSE
DO
strChar = MID$(strWord, iSeek, 1)
SELECT CASE strChar
CASE "A" TO "Z"
bOK = TRUE
CASE CHR$(129) TO CHR$(154)
bOK = TRUE
CASE CHR$(160) TO CHR$(165)
bOK = TRUE
END SELECT
 
' If it is not a character valid as a letter, please move one space
IF NOT bOK THEN
iSeek = iSeek + iStep
END IF
 
' If no letter was recognized, then exit the loop
IF iSeek < 1 OR iSeek > LEN(strWord) THEN
bOK = TRUE
END IF
LOOP UNTIL bOK
 
IF iStep = -1 THEN
' Reviews if a word was encountered
IF iSeek > 0 THEN
strWord = LEFT$(strWord, iSeek)
ELSE
strWord = ""
END IF
ELSEIF iStep = 1 THEN
IF iSeek <= LEN(strWord) THEN
strWord = MID$(strWord, iSeek)
ELSE
strWord = ""
END IF
END IF
LOOP UNTIL iStep = 1 OR strWord = ""
END IF
 
' Return the result
CleanWord = strWord
 
END FUNCTION
 
SUB CountWords (FromString AS STRING)
' Var
DIM iStart AS INTEGER
DIM iLenW AS INTEGER
DIM iLenS AS INTEGER
DIM iLenD AS INTEGER
DIM strString AS STRING
DIM strWord AS STRING
DIM lWhichWord AS LONG
SHARED lEnd AS LONG
SHARED lMaxWords AS LONG
SHARED strWords() AS STRING
SHARED lWords() AS LONG
' Converts to Upper Case and cleans leading and trailing spaces
strString = UCASE$(FromString)
strString = LTRIM$(RTRIM$(strString))
 
' Get words from string
iStart = 1
iLenW = 0
iLenS = LEN(strString)
DO WHILE iStart <= iLenS
iLenW = INSTR(iStart, strString, " ")
IF iLenW = 0 AND iStart <= iLenS THEN
iLenW = iLenS + 1
END IF
strWord = MID$(strString, iStart, iLenW - iStart)
 
' Will remove mid dashes or apostrophe or "â€"
iLenD = INSTR(strWord, "-")
IF iLenD < 1 THEN
iLenD = INSTR(strWord, "'")
IF iLenD < 1 THEN
iLenD = INSTR(strWord, "â€")
END IF
END IF
IF iLenD >= 2 THEN
strWord = LEFT$(strWord, iLenD - 1)
iLenW = iStart + (iLenD - 1)
END IF
strWord = CleanWord(strWord)
 
IF strWord <> "" THEN
' Look for the word to be counted
lWhichWord = BinarySearch(strWord, FALSE)
 
' If the word doesn't exist in the list, add it
IF lWhichWord = 0 THEN
lWhichWord = InsertWord(strWord)
END IF
 
' Count the word
IF lWhichWord > 0 THEN
lWords(lWhichWord) = lWords(lWhichWord) + 1
END IF
END IF
iStart = iLenW + 1
LOOP
 
END SUB
 
' Here a word will be inserted in the list
FUNCTION InsertWord& (WhichWord AS STRING)
' Var
DIM lFound AS LONG
DIM lWord AS LONG
DIM strWord AS STRING
SHARED lIni AS LONG
SHARED lEnd AS LONG
SHARED lMaxWords AS LONG
SHARED strWords() AS STRING
SHARED lWords() AS LONG
 
' Look for the word in the list
strWord = UCASE$(WhichWord)
lFound = BinarySearch(WhichWord, TRUE)
IF lFound > 0 THEN
' Add one word
lEnd = lEnd + 1
 
' Verifies if there is still room for a new word
IF lEnd > lMaxWords THEN
lMaxWords = lMaxWords + AddWords ' Other words
IF lMaxWords > 32767 THEN
IF lEnd <= 32767 THEN
lMaxWords = 32767
ELSE
lFound = -1
END IF
END IF
 
IF lFound > 0 THEN
REDIM _PRESERVE strWords(lIni TO lMaxWords) AS STRING
REDIM _PRESERVE lWords(lIni TO lMaxWords) AS LONG
END IF
END IF
 
IF lFound > 0 THEN
' Move the words below this
IF lEnd > 1 THEN
FOR lWord = lEnd TO lFound + 1 STEP -1
strWords(lWord) = strWords(lWord - 1)
lWords(lWord) = lWords(lWord - 1)
NEXT lWord
END IF
' Insert the word in the position
strWords(lFound) = strWord
lWords(lFound) = 0
END IF
END IF
 
InsertWord = lFound
END FUNCTION
 
SUB QuickSort (lLeftN AS LONG, lRightN AS LONG, iMode AS INTEGER)
' Var
DIM lPivot AS LONG
DIM lLeftNIdx AS LONG
DIM lRightNIdx AS LONG
SHARED lWords() AS LONG
SHARED strWords() AS STRING
 
' Clasifies from highest to lowest
lLeftNIdx = lLeftN
lRightNIdx = lRightN
IF (lRightN - lLeftN) > 0 THEN
lPivot = (lLeftN + lRightN) / 2
DO WHILE (lLeftNIdx <= lPivot) AND (lRightNIdx >= lPivot)
IF iMode = 0 THEN ' Ascending
DO WHILE (lWords(lLeftNIdx) < lWords(lPivot)) AND (lLeftNIdx <= lPivot)
lLeftNIdx = lLeftNIdx + 1
LOOP
DO WHILE (lWords(lRightNIdx) > lWords(lPivot)) AND (lRightNIdx >= lPivot)
lRightNIdx = lRightNIdx - 1
LOOP
ELSE ' Descending
DO WHILE (lWords(lLeftNIdx) > lWords(lPivot)) AND (lLeftNIdx <= lPivot)
lLeftNIdx = lLeftNIdx + 1
LOOP
DO WHILE (lWords(lRightNIdx) < lWords(lPivot)) AND (lRightNIdx >= lPivot)
lRightNIdx = lRightNIdx - 1
LOOP
END IF
SWAP lWords(lLeftNIdx), lWords(lRightNIdx)
SWAP strWords(lLeftNIdx), strWords(lRightNIdx)
lLeftNIdx = lLeftNIdx + 1
lRightNIdx = lRightNIdx - 1
IF (lLeftNIdx - 1) = lPivot THEN
lRightNIdx = lRightNIdx + 1
lPivot = lRightNIdx
ELSEIF (lRightNIdx + 1) = lPivot THEN
lLeftNIdx = lLeftNIdx - 1
lPivot = lLeftNIdx
END IF
LOOP
QuickSort lLeftN, lPivot - 1, iMode
QuickSort lPivot + 1, lRightN, iMode
END IF
END SUB
 
SUB ShowCompletion ()
' Var
SHARED iFil AS INTEGER
SHARED lLine AS LONG
SHARED lLines AS LONG
SHARED lEnd AS LONG
 
LOCATE iFil + 1, 1
PRINT "Lines analyzed :"; lLine
PRINT USING "% of completion: ###%"; (lLine / lLines) * 100
PRINT "Words found....:"; lEnd
END SUB
 
SUB ShowResults ()
' Var
DIM iMaxL AS INTEGER
DIM iMaxW AS INTEGER
DIM lWord AS LONG
DIM lHowManyWords AS LONG
DIM strString AS STRING
DIM strFileR AS STRING
SHARED lIni AS LONG
SHARED lEnd AS LONG
SHARED lLenF AS LONG
SHARED lMaxWords AS LONG
SHARED sTimer AS SINGLE
SHARED strFile AS STRING
SHARED strWords() AS STRING
SHARED lWords() AS LONG
SHARED strTopWords() AS STRING
SHARED lTopWords() AS LONG
SHARED iRuns AS INTEGER
 
' Show results
 
' Creates file name
lWord = INSTR(strFile, ".")
IF lWord = 0 THEN lWord = LEN(strFile)
strFileR = LEFT$(strFile, lWord)
IF RIGHT$(strFileR, 1) <> "." THEN strFileR = strFileR + "."
 
' Retrieves the longest word found and the highest count
FOR lWord = lIni TO lEnd
' Gets the longest word found
IF LEN(strWords(lWord)) > iMaxL THEN
iMaxL = LEN(strWords(lWord))
END IF
 
lHowManyWords = lHowManyWords + lWords(lWord)
NEXT lWord
IF iMaxL > 60 THEN iMaxW = 60 ELSE iMaxW = iMaxL
 
' Gets top counted
TopCounted
 
' Shows the results
CLS
PRINT "File analyzed : "; strFile
PRINT "Length of file:"; lLenF
PRINT "Time lapse....:"; TIMER - sTimer;"seconds"
PRINT "Words found...:"; lHowManyWords; "(Unique:"; STR$(lEnd); ")"
PRINT "Longest word..:"; iMaxL
PRINT
PRINT "The"; TopCount; "most used are:"
PRINT STRING$(iMaxW, "-"); "+"; STRING$(80 - (iMaxW + 1), "-")
PRINT " Word"; SPACE$(iMaxW - 5); "| Count"
PRINT STRING$(iMaxW, "-"); "+"; STRING$(80 - (iMaxW + 1), "-")
strString = "\" + SPACE$(iMaxW - 2) + "\| #########,"
FOR lWord = lIni TO TopCount
PRINT USING strString; strTopWords(lWord); lTopWords(lWord)
NEXT lWord
PRINT STRING$(iMaxW, "-"); "+"; STRING$(80 - (iMaxW + 1), "-")
PRINT "See files "; strFileR + "S" + LTRIM$(STR$(iRuns)); " and "; strFileR + "C" + LTRIM$(STR$(iRuns)); " to see the full list."
END SUB
 
SUB TopCounted ()
' Var
DIM lWord AS LONG
DIM strFileR AS STRING
DIM iFile AS INTEGER
CONST Descending = 1
SHARED lIni AS LONG
SHARED lEnd AS LONG
SHARED lMaxWords AS LONG
SHARED strWords() AS STRING
SHARED lWords() AS LONG
SHARED strTopWords() AS STRING
SHARED lTopWords() AS LONG
SHARED iRuns AS INTEGER
SHARED strFile AS STRING
 
' Assigns new dimmentions
REDIM strTopWords(lIni TO TopCount) AS STRING
REDIM lTopWords(lIni TO TopCount) AS LONG
 
' Saves the current values
lWord = INSTR(strFile, ".")
IF lWord = 0 THEN lWord = LEN(strFile)
strFileR = LEFT$(strFile, lWord)
IF RIGHT$(strFileR, 1) <> "." THEN strFileR = strFileR + "."
iFile = FREEFILE
OPEN strFileR + "S" + LTRIM$(STR$(iRuns)) FOR OUTPUT AS #iFile
FOR lWord = lIni TO lEnd
WRITE #iFile, strWords(lWord), lWords(lWord)
NEXT lWord
CLOSE #iFile
 
' Classifies the counted in descending order
QuickSort lIni, lEnd, Descending
 
' Now, saves the required values in the arrays
FOR lWord = lIni TO TopCount
strTopWords(lWord) = strWords(lWord)
lTopWords(lWord) = lWords(lWord)
NEXT lWord
 
' Now, saves the order from the file
OPEN strFileR + "C" + LTRIM$(STR$(iRuns)) FOR OUTPUT AS #iFile
FOR lWord = lIni TO lEnd
WRITE #iFile, strWords(lWord), lWords(lWord)
NEXT lWord
CLOSE #iFile
 
END SUB
</syntaxhighlight>
 
{{output}}
<pre>
This program will count how many words are in a text file and shows the 10
most used of them.
 
Document to open (TXT file) (f=see files): miserabl.txt
Looking for words in miserabl.txt. File size: 3369775. Initialization complete.
 
Processing... Done!
Lines analyzed : 72917
% of completion: 100%
Words found....: 23288
 
Finishing...
 
Lenght of file: 3369775
Time lapse....: 35 seconds
Words found...: 578614 (Unique: 23538)
Longest word..: 25
 
The 10 most used are:
---------------------------+------------------------------------------------------------------------
Word | Count
---------------------------+------------------------------------------------------------------------
THE | 40,751
OF | 19,949
AND | 14,891
A | 14,430
TO | 13,923
IN | 11,189
HE | 9,605
WAS | 8,617
THAT | 7,833
IT | 6.579
---------------------------+------------------------------------------------------------------------
See files miserabl.S1 and miserabl.C1 to see the full list.
 
Try again? (Y/n)
</pre>
 
==={{header|BaCon}}===
Removing all punctuation, digits, tabs and carriage returns. So "This", "this" and "this." are the same. Full support for UTF8 characters in words. The code itself could be smaller, but for sake of clarity all has been written explicitly.
<syntaxhighlight lang="bacon">' We do not count superfluous spaces as words
OPTION COLLAPSE TRUE
 
' Optional: use TRE regex library to speed up the program
PRAGMA RE tre INCLUDE <tre/regex.h> LDFLAGS -ltre
 
' We're using associative arrays
DECLARE frequency ASSOC NUMBER
 
' Load the text and remove all punctuation, digits, tabs and cr
book$ = EXTRACT$(LOAD$("miserables.txt"), "[[:punct:]]|[[:digit:]]|[\t\r]", TRUE)
 
' Count each word in lowercase
FOR word$ IN REPLACE$(book$, NL$, CHR$(32))
INCR frequency(LCASE$(word$))
NEXT
 
' Sort the associative array and then map the index to a string array
LOOKUP frequency TO term$ SIZE x SORT DOWN
 
' Show results
FOR i = 0 TO 9
PRINT term$[i], " : ", frequency(term$[i])
NEXT</syntaxhighlight>
{{output}}
<pre>
the : 40440
of : 19903
and : 14738
a : 14306
to : 13630
in : 11083
he : 9452
was : 8605
that : 7535
his : 6434
</pre>
 
=={{header|Batch File}}==
 
This takes a very long time per word thus I have chosen to feed it a 200 line sample and go from there.
 
You could cut the length of this down drastically if you didn't need to be able to recall the word at nth position and wished only to display the top 10 words.
 
<syntaxhighlight lang="dos">
@echo off
 
call:wordCount 1 2 3 4 5 6 7 8 9 10 42 101
 
pause>nul
exit
 
:wordCount
setlocal enabledelayedexpansion
 
set word=100000
set line=0
for /f "delims=" %%i in (input.txt) do (
set /a line+=1
for %%j in (%%i) do (
if not !skip%%j!==true (
echo line !line! ^| word !word:~-5! - "%%~j"
type input.txt | find /i /c "%%~j" > count.tmp
set /p tmpvar=<count.tmp
set tmpvar=000000000!tmpvar!
set tmpvar=!tmpvar:~-10!
set count[!word!]=!tmpvar! %%~j
set "skip%%j=true"
set /a word+=1
)
)
)
del count.tmp
 
set wordcount=0
for /f "tokens=1,2 delims= " %%i in ('set count ^| sort /+14 /r') do (
set /a wordcount+=1
for /f "tokens=2 delims==" %%k in ("%%i") do (
set word[!wordcount!]=!wordcount!. %%j - %%k
)
)
 
cls
for %%i in (%*) do echo !word[%%i]!
endlocal
goto:eof
</syntaxhighlight>
 
 
'''Output'''
 
<pre>
1. - 0000000140 I
2. - 0000000140 a
3. - 0000000118 He
4. - 0000000100 the
5. - 0000000080 an
6. - 0000000075 in
7. - 0000000066 at
8. - 0000000062 is
9. - 0000000058 on
10. - 0000000058 as
42. - 0000000010 with
101. - 0000000004 ears
</pre>
 
=={{header|Bracmat}}==
This solution assumes that words consists of characters that exist in a lowercase and a highercase version. So it won't work with many non-European alphabets.
 
The built-in <code>vap</code> function can take either two or three arguments. The first argument must be the name of a function or a function definition. The second argument must be a string. The two-argument version maps the function to each character in the string. The three-argument version splits the string at each occurrence of the third argument, which must be a single character, and applies the function to the intervening substrings. The output of <code>vap</code> is a space-separated list of results from the function argument.
 
The expression <code>!('($arg:?A [($pivot) ?Z))</code> must be read as follows:
 
The subexpression <code>'($arg:?A [($pivot) ?Z)</code> is a macro expression. The symbols <code>arg</code> and <code>pivot</code>, which are the right hand sides of <code>$</code> operators with empty left hand side, are replaced by the actual values of <code>!arg</code> and <code>!pivot</code>. The whole subexpression is made the right hand side of a <code>=</code> operator with empty left hand side, e.g.
<code>=a b c d e:?A [2 ?Z</code>. The <code>=</code> operator protects the subexpression against evaluation. By prefixing the expression with the <code>!</code> unary operator (which normally is used to obtain the value of a variable), the pattern match operation <code>a b c d e:?A [2 ?Z</code> is executed, assigning <code>a b</code> to <code>A</code> and assigning <code>c d e</code> to <code>Z</code>.
 
The reason for using a macro expression is that the evaluation of a pattern match operation with pattern variable as in <code>!arg:?A [!pivot ?Z</code> is unecessary slow, since <code>!pivot</code> is evaluated up to <code>!pivot+1</code> times.
 
 
<syntaxhighlight lang="bracmat"> ( 10-most-frequent-words
= MergeSort { Local variable declarations. }
types
sorted-words
frequency
type
most-frequent-words
. ( MergeSort { Definition of function MergeSort. }
= A N Z pivot
. !arg:? [?N { [?N is a subpattern that counts the number of preceding elements }
& ( !N:>1 { if N at least 2 ... }
& div$(!N.2):?pivot { divide N by 2 ... }
& !('($arg:?A [($pivot) ?Z)) { split list in two halves A and Z ... }
& MergeSort$!A+MergeSort$!Z { sort each of A and Z and return sum }
| !arg { else just return a single element}
)
)
& MergeSort { Sort }
$ ( vap { Split second argument at each occurrence of third character and apply first argument to each chunk. }
$ ( (=.low$!arg) { Return input, lowercased. }
. str
$ ( vap { Vaporize second argument in UTF-8 or Latin-1 characters and apply first argument to each of them. }
$ ( (
=
. upp$!arg:low$!arg&\n { Return newline instead of non-alphabetic character. }
| !arg { Return (Euro-centric) alphabetic character.}
)
. get$(!arg,NEW STR) { Read input text as a single string. }
)
)
. \n { Split at newlines }
)
)
: ?sorted-words { Assign sum of (frequency*lowercasedword) terms to sorted-words. }
& :?types { Initialize types as an empty list. }
& whl { Loop until right hand side fails. }
' ( !sorted-words:#?frequency*%@?type+?sorted-words { Extract first frequency*type term from sum. }
& (!frequency.!type) !types:?types { Prepend (frequency.type) pair to types list}
)
& MergeSort$!types { Sort the list of (frequency.type) pairs. }
: (?+[-11+?most-frequent-words|?most-frequent-words) { Pick the last 10 terms from the sum returned by MergeSort. }
& !most-frequent-words { Return the last 10 terms. }
)
& out$(10-most-frequent-words$"135-0.txt") { Call 10-most-frequent-words with name of inout file and print result to screen. }</syntaxhighlight>
'''Output'''
<pre> (6661.it)
+ (7924.that)
+ (8622.was)
+ (9649.he)
+ (11219.in)
+ (13953.to)
+ (14546.a)
+ (14943.and)
+ (19954.of)
+ (41092.the)</pre>
 
=={{header|C}}==
{{libheader|GLib}}
Words are defined by the regular expression "\w+".
<syntaxhighlight lang="c">#include <stdbool.h>
#include <stdio.h>
#include <glib.h>
 
typedef struct word_count_tag {
const char* word;
size_t count;
} word_count;
 
int compare_word_count(const void* p1, const void* p2) {
const word_count* w1 = p1;
const word_count* w2 = p2;
if (w1->count > w2->count)
return -1;
if (w1->count < w2->count)
return 1;
return 0;
}
 
bool get_top_words(const char* filename, size_t count) {
GError* error = NULL;
GMappedFile* mapped_file = g_mapped_file_new(filename, FALSE, &error);
if (mapped_file == NULL) {
fprintf(stderr, "%s\n", error->message);
g_error_free(error);
return false;
}
const char* text = g_mapped_file_get_contents(mapped_file);
if (text == NULL) {
fprintf(stderr, "File %s is empty\n", filename);
g_mapped_file_unref(mapped_file);
return false;
}
gsize file_size = g_mapped_file_get_length(mapped_file);
// Store word counts in a hash table
GHashTable* ht = g_hash_table_new_full(g_str_hash, g_str_equal,
g_free, g_free);
GRegex* regex = g_regex_new("\\w+", 0, 0, NULL);
GMatchInfo* match_info;
g_regex_match_full(regex, text, file_size, 0, 0, &match_info, NULL);
while (g_match_info_matches(match_info)) {
char* word = g_match_info_fetch(match_info, 0);
char* lower = g_utf8_strdown(word, -1);
g_free(word);
size_t* count = g_hash_table_lookup(ht, lower);
if (count != NULL) {
++*count;
g_free(lower);
} else {
count = g_new(size_t, 1);
*count = 1;
g_hash_table_insert(ht, lower, count);
}
g_match_info_next(match_info, NULL);
}
g_match_info_free(match_info);
g_regex_unref(regex);
g_mapped_file_unref(mapped_file);
 
// Sort words in decreasing order of frequency
size_t size = g_hash_table_size(ht);
word_count* words = g_new(word_count, size);
GHashTableIter iter;
gpointer key, value;
g_hash_table_iter_init(&iter, ht);
for (size_t i = 0; g_hash_table_iter_next(&iter, &key, &value); ++i) {
words[i].word = key;
words[i].count = *(size_t*)value;
}
qsort(words, size, sizeof(word_count), compare_word_count);
 
// Print the most common words
if (count > size)
count = size;
printf("Top %lu words\n", count);
printf("Rank\tCount\tWord\n");
for (size_t i = 0; i < count; ++i)
printf("%lu\t%lu\t%s\n", i + 1, words[i].count, words[i].word);
g_free(words);
g_hash_table_destroy(ht);
return true;
}
 
int main(int argc, char** argv) {
if (argc != 2) {
fprintf(stderr, "usage: %s file\n", argv[0]);
return EXIT_FAILURE;
}
if (!get_top_words(argv[1], 10))
return EXIT_FAILURE;
return EXIT_SUCCESS;
}</syntaxhighlight>
 
{{out}}
<pre>
Top 10 words
Rank Count Word
1 41039 the
2 19951 of
3 14942 and
4 14527 a
5 13941 to
6 11209 in
7 9646 he
8 8620 was
9 7922 that
10 6659 it
</pre>
 
=={{header|C sharp|C#}}==
{{trans|D}}
<syntaxhighlight lang="csharp">using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
 
namespace WordCount {
class Program {
static void Main(string[] args) {
var text = File.ReadAllText("135-0.txt").ToLower();
 
var match = Regex.Match(text, "\\w+");
Dictionary<string, int> freq = new Dictionary<string, int>();
while (match.Success) {
string word = match.Value;
if (freq.ContainsKey(word)) {
freq[word]++;
} else {
freq.Add(word, 1);
}
 
match = match.NextMatch();
}
 
Console.WriteLine("Rank Word Frequency");
Console.WriteLine("==== ==== =========");
int rank = 1;
foreach (var elem in freq.OrderByDescending(a => a.Value).Take(10)) {
Console.WriteLine("{0,2} {1,-4} {2,5}", rank++, elem.Key, elem.Value);
}
}
}
}</syntaxhighlight>
{{out}}
<pre>Rank Word Frequency
==== ==== =========
1 the 41035
2 of 19946
3 and 14940
4 a 14577
5 to 13939
6 in 11204
7 he 9645
8 was 8619
9 that 7922
10 it 6659</pre>
 
=={{header|C++}}==
<syntaxhighlight lang="cpp">#include <algorithm>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>
#include <unordered_map>
#include <vector>
 
int main(int ac, char** av) {
std::ios::sync_with_stdio(false);
int head = (ac > 1) ? std::atoi(av[1]) : 10;
std::istreambuf_iterator<char> it(std::cin), eof;
std::filebuf file;
if (ac > 2) {
if (file.open(av[2], std::ios::in), file.is_open()) {
it = std::istreambuf_iterator<char>(&file);
} else return std::cerr << "file " << av[2] << " open failed\n", 1;
}
auto alpha = [](unsigned c) { return c-'A' < 26 || c-'a' < 26; };
auto lower = [](unsigned c) { return c | '\x20'; };
std::unordered_map<std::string, int> counts;
std::string word;
for (; it != eof; ++it) {
if (alpha(*it)) {
word.push_back(lower(*it));
} else if (!word.empty()) {
++counts[word];
word.clear();
}
}
if (!word.empty()) ++counts[word]; // if file ends w/o ws
std::vector<std::pair<const std::string,int> const*> out;
for (auto& count : counts) out.push_back(&count);
std::partial_sort(out.begin(),
out.size() < head ? out.end() : out.begin() + head,
out.end(), [](auto const* a, auto const* b) {
return a->second > b->second;
});
if (out.size() > head) out.resize(head);
for (auto const& count : out) {
std::cout << count->first << ' ' << count->second << '\n';
}
return 0;
}
</syntaxhighlight>
 
{{out}}
<pre>
$ ./a.out 10 135-0.txt
the 41093
of 19954
and 14943
a 14558
to 13953
in 11219
he 9649
was 8622
that 7924
it 6661
</pre>
===Alternative===
{{trans|C#}}
<syntaxhighlight lang="cpp">#include <algorithm>
#include <iostream>
#include <fstream>
#include <map>
#include <regex>
#include <string>
#include <vector>
 
int main() {
std::regex wordRgx("\\w+");
std::map<std::string, int> freq;
std::string line;
const int top = 10;
 
std::ifstream in("135-0.txt");
if (!in.is_open()) {
std::cerr << "Failed to open file\n";
return 1;
}
while (std::getline(in, line)) {
auto words_itr = std::sregex_iterator(
line.cbegin(), line.cend(), wordRgx);
auto words_end = std::sregex_iterator();
while (words_itr != words_end) {
auto match = *words_itr;
auto word = match.str();
if (word.size() > 0) {
transform (word.begin(), word.end(), word.begin(), ::tolower);
auto entry = freq.find(word);
if (entry != freq.end()) {
entry->second++;
} else {
freq.insert(std::make_pair(word, 1));
}
}
words_itr = std::next(words_itr);
}
}
in.close();
 
std::vector<std::pair<std::string, int>> pairs;
for (auto iter = freq.cbegin(); iter != freq.cend(); ++iter) {
pairs.push_back(*iter);
}
std::sort(pairs.begin(), pairs.end(), [](auto& a, auto& b) {
return a.second > b.second;
});
 
std::cout << "Rank Word Frequency\n"
"==== ==== =========\n";
int rank = 1;
for (auto iter = pairs.cbegin(); iter != pairs.cend() && rank <= top; ++iter) {
std::printf("%2d %4s %5d\n", rank++, iter->first.c_str(), iter->second);
}
 
return 0;
}</syntaxhighlight>
{{out}}
<pre>Rank Word Frequency
==== ==== =========
1 the 36636
2 of 19615
3 and 14079
4 to 13535
5 a 13527
6 in 10256
7 was 8543
8 that 7324
9 he 6814
10 had 6139</pre>
 
===C++20===
{{trans|C#}}
<syntaxhighlight lang="cpp">#include <algorithm>
#include <iostream>
#include <format>
#include <fstream>
#include <map>
#include <ranges>
#include <regex>
#include <string>
#include <vector>
 
int main() {
std::ifstream in("135-0.txt");
std::string text{
std::istreambuf_iterator<char>{in}, std::istreambuf_iterator<char>{}
};
in.close();
 
std::regex word_rx("\\w+");
std::map<std::string, int> freq;
for (const auto& a : std::ranges::subrange(
std::sregex_iterator{ text.cbegin(),text.cend(), word_rx }, std::sregex_iterator{}
))
{
auto word = a.str();
transform(word.begin(), word.end(), word.begin(), ::tolower);
freq[word]++;
}
 
std::vector<std::pair<std::string, int>> pairs;
for (const auto& elem : freq)
{
pairs.push_back(elem);
}
 
std::ranges::sort(pairs, std::ranges::greater{}, &std::pair<std::string, int>::second);
 
std::cout << "Rank Word Frequency\n"
"==== ==== =========\n";
for (int rank=1; const auto& [word, count] : pairs | std::views::take(10))
{
std::cout << std::format("{:2} {:>4} {:5}\n", rank++, word, count);
}
}</syntaxhighlight>
{{out}}
<pre>Rank Word Frequency
==== ==== =========
0 the 41043
1 of 19952
2 and 14938
3 a 14539
4 to 13942
5 in 11208
6 he 9646
7 was 8620
8 that 7922
9 it 6659</pre>
 
=={{header|Clojure}}==
<langsyntaxhighlight lang="clojure">(defn count-words [file n]
(->> file
slurp
Line 35 ⟶ 1,706:
frequencies
(sort-by val >)
(take n)))</langsyntaxhighlight>
 
{{Out}}
Line 44 ⟶ 1,715:
</pre>
 
=={{header|COBOL}}==
<syntaxhighlight lang="cobol">
IDENTIFICATION DIVISION.
PROGRAM-ID. WordFrequency.
AUTHOR. Bill Gunshannon.
DATE-WRITTEN. 30 Jan 2020.
************************************************************
** Program Abstract:
** Given a text file and an integer n, print the n most
** common words in the file (and the number of their
** occurrences) in decreasing frequency.
**
** A file named Parameter.txt provides this information.
** Format is:
** 12345678901234567890123456789012345678901234567890
** |------------------|----|
** ^^^^^^^^^^^^^^^^ ^^^^
** | |
** Source Text File Number of words with count
** 20 Characters 5 digits with leading zeroes
**
**
************************************************************
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
SELECT Parameter-File ASSIGN TO "Parameter.txt"
ORGANIZATION IS LINE SEQUENTIAL.
SELECT Input-File ASSIGN TO Source-Text
ORGANIZATION IS LINE SEQUENTIAL.
SELECT Word-File ASSIGN TO "Word.txt"
ORGANIZATION IS LINE SEQUENTIAL.
SELECT Output-File ASSIGN TO "Output.txt"
ORGANIZATION IS LINE SEQUENTIAL.
SELECT Print-File ASSIGN TO "Printer.txt"
ORGANIZATION IS LINE SEQUENTIAL.
SELECT Sort-File ASSIGN TO DISK.
DATA DIVISION.
FILE SECTION.
FD Parameter-File
DATA RECORD IS Parameter-Record.
01 Parameter-Record.
05 Source-Text PIC X(20).
05 How-Many PIC 99999.
 
FD Input-File
DATA RECORD IS Input-Record.
01 Input-Record.
05 Input-Line PIC X(80).
 
FD Word-File
DATA RECORD IS Word-Record.
01 Word-Record.
05 Input-Word PIC X(20).
 
FD Output-File
DATA RECORD IS Output-Rec.
01 Output-Rec.
05 Output-Rec-Word PIC X(20).
05 Output-Rec-Word-Cnt PIC 9(5).
 
FD Print-File
DATA RECORD IS Print-Rec.
01 Print-Rec.
05 Print-Rec-Word PIC X(20).
05 Print-Rec-Word-Cnt PIC 9(5).
SD Sort-File.
01 Sort-Rec.
05 Sort-Word PIC X(20).
05 Sort-Word-Cnt PIC 9(5).
WORKING-STORAGE SECTION.
01 Eof PIC X VALUE 'F'.
01 InLine PIC X(80).
01 Word1 PIC X(20).
01 Current-Word PIC X(20).
01 Current-Word-Cnt PIC 9(5).
01 Pos PIC 99
VALUE 1.
01 Cnt PIC 99.
01 Report-Rank.
05 IRank PIC 99999
VALUE 1.
05 Rank PIC ZZZZ9.
PROCEDURE DIVISION.
Main-Program.
**
** Read the Parameters
**
OPEN INPUT Parameter-File.
READ Parameter-File.
CLOSE Parameter-File.
 
**
** Open Files for first stage
**
OPEN INPUT Input-File.
OPEN OUTPUT Word-File.
 
**
** Pare\se the Source Text into a file of invidual words
**
PERFORM UNTIL Eof = 'T'
READ Input-File
AT END MOVE 'T' TO Eof
END-READ
 
PERFORM Parse-a-Words
 
MOVE SPACES TO Input-Record
MOVE 1 TO Pos
END-PERFORM.
**
** Cleanup from the first stage
**
CLOSE Input-File Word-File
 
**
** Sort the individual words in alphabetical order
**
SORT Sort-File
ON ASCENDING KEY Sort-Word
USING Word-File
GIVING Word-File.
 
**
** Count each time a word is used
**
PERFORM Collect-Totals.
 
**
** Sort data by number of usages per word
**
SORT Sort-File
ON DESCENDING KEY Sort-Word-Cnt
USING Output-File
GIVING Print-File.
 
**
** Show the work done
**
OPEN INPUT Print-File.
DISPLAY " Rank Word Frequency"
PERFORM How-Many TIMES
READ Print-File
MOVE IRank TO Rank
DISPLAY Rank " " Print-Rec
ADD 1 TO IRank
END-PERFORM.
 
**
** Cleanup
**
CLOSE Print-File.
CALL "C$DELETE" USING "Word.txt" ,0
CALL "C$DELETE" USING "Output.txt" ,0
 
STOP RUN.
 
Parse-a-Words.
INSPECT Input-Record CONVERTING '-.,"();:/[]{}!?|' TO SPACE
PERFORM UNTIL Pos > FUNCTION STORED-CHAR-LENGTH(Input-Record)
 
 
UNSTRING Input-Record DELIMITED BY SPACE INTO Word1
WITH POINTER Pos TALLYING IN Cnt
MOVE FUNCTION TRIM(FUNCTION LOWER-CASE(Word1)) TO Word-Record
IF Word-Record NOT EQUAL SPACES AND Word-Record IS ALPHABETIC
THEN WRITE Word-Record
END-IF
 
END-PERFORM.
 
Collect-Totals.
MOVE 'F' to Eof
OPEN INPUT Word-File
OPEN OUTPUT Output-File
READ Word-File
MOVE Input-Word TO Current-Word
MOVE 1 to Current-Word-Cnt
PERFORM UNTIL Eof = 'T'
READ Word-File
AT END MOVE 'T' TO Eof
END-READ
 
IF FUNCTION TRIM(Word-Record)
EQUAL
FUNCTION TRIM(Current-Word)
THEN
ADD 1 to Current-Word-Cnt
ELSE
MOVE Current-Word TO Output-Rec-Word
MOVE Current-Word-Cnt TO Output-Rec-Word-Cnt
WRITE Output-Rec
MOVE 1 to Current-Word-Cnt
MOVE Word-Record TO Current-Word
MOVE SPACES TO Input-Record
END-IF
END-PERFORM.
CLOSE Word-File Output-File.
END-PROGRAM.
</syntaxhighlight>
 
{{Out}}
<pre>
Rank Word Frequency
1 the 40551
2 of 19806
3 and 14730
4 a 14351
5 to 13775
6 in 11074
7 he 09480
8 was 08613
9 that 07632
10 his 06446
11 it 06335
12 had 06181
13 is 06097
14 which 05135
15 with 04469
</pre>
 
=={{header|Common Lisp}}==
<syntaxhighlight lang="lisp">
(defun count-word (n pathname)
(with-open-file (s pathname :direction :input)
(loop for line = (read-line s nil nil) while line
nconc (list-symb (drop-noise line)) into words
finally (return (subseq (sort (pair words)
#'> :key #'cdr)
0 n)))))
 
(defun list-symb (s)
(let ((*read-eval* nil))
(read-from-string (concatenate 'string "(" s ")"))))
 
(defun drop-noise (s)
(delete-if-not #'(lambda (x) (or (alpha-char-p x)
(equal x #\space)
(equal x #\-))) s))
 
(defun pair (words &aux (hash (make-hash-table)) ac)
(dolist (word words) (incf (gethash word hash 0)))
(maphash #'(lambda (e n) (push `(,e . ,n) ac)) hash) ac)
</syntaxhighlight>
 
{{Out}}
<pre>
> (count-word 10 "c:/temp/135-0.txt")
((THE . 40738) (OF . 19922) (AND . 14878) (A . 14419) (TO . 13702) (IN . 11172)
(HE . 9577) (WAS . 8612) (THAT . 7768) (IT . 6467))
</pre>
 
=={{header|Crystal}}==
<syntaxhighlight lang="ruby">require "http/client"
require "regex"
 
# Get the text from the internet
response = HTTP::Client.get "https://www.gutenberg.org/files/135/135-0.txt"
text = response.body
 
text
.downcase
.scan(/[a-zA-ZáéíóúÁÉÍÓÚâêôäüöàèìòùñ']+/)
.reduce({} of String => Int32) { |hash, match|
word = match[0]
hash[word] = hash.fetch(word, 0) + 1 # using fetch to set a default value (1) to the new found word
hash
}
.to_a # convert the returned hash to an array of tuples (String, Int32) -> {word, sum}
.sort { |a, b| b[1] <=> a[1] }[0..9] # sort and get the first 10 elements
.each_with_index(1) { |(word, n), i| puts "#{i} \t #{word} \t #{n}" } # print the result
</syntaxhighlight>
 
{{out}}
<pre>
1 the 41092
2 of 19954
3 and 14943
4 a 14556
5 to 13953
6 in 11219
7 he 9649
8 was 8622
9 that 7924
10 it 6661
</pre>
 
=={{header|D}}==
<syntaxhighlight lang="d">import std.algorithm : sort;
import std.array : appender, split;
import std.range : take;
import std.stdio : File, writefln, writeln;
import std.typecons : Tuple;
import std.uni : toLower;
 
//Container for a word and how many times it has been seen
alias Pair = Tuple!(string, "k", int, "v");
 
void main() {
int[string] wcnt;
 
//Read the file line by line
foreach (line; File("135-0.txt").byLine) {
//Split the words on whitespace
foreach (word; line.split) {
//Increment the times the word has been seen
wcnt[word.toLower.idup]++;
}
}
 
//Associative arrays cannot be sort, so put the key/value in an array
auto wb = appender!(Pair[]);
foreach(k,v; wcnt) {
wb.put(Pair(k,v));
}
Pair[] sw = wb.data.dup;
 
//Sort the array, and display the top ten values
writeln("Rank Word Frequency");
int rank=1;
foreach (word; sw.sort!"a.v>b.v".take(10)) {
writefln("%4s %-10s %9s", rank++, word.k, word.v);
}
}</syntaxhighlight>
 
{{out}}
<pre>Rank Word Frequency
1 the 40368
2 of 19863
3 and 14470
4 a 14277
5 to 13587
6 in 11019
7 he 9212
8 was 8346
9 that 7251
10 his 6414</pre>
=={{header|Delphi}}==
{{libheader| System.SysUtils}}
{{libheader| System.IOUtils}}
{{libheader| System.Generics.Collections}}
{{libheader| System.Generics.Defaults}}
{{libheader| System.RegularExpressions}}
{{Trans|C#}}
<syntaxhighlight lang="delphi">
program Word_frequency;
 
{$APPTYPE CONSOLE}
 
uses
System.SysUtils,
System.IOUtils,
System.Generics.Collections,
System.Generics.Defaults,
System.RegularExpressions;
 
type
TWords = TDictionary<string, Integer>;
 
TFreqPair = TPair<string, Integer>;
 
TFreq = TArray<TFreqPair>;
 
function CreateValueCompare: IComparer<TFreqPair>;
begin
Result := TComparer<TFreqPair>.Construct(
function(const Left, Right: TFreqPair): Integer
begin
Result := Right.Value - Left.Value;
end);
end;
 
function WordFrequency(const Text: string): TFreq;
var
words: TWords;
match: TMatch;
w: string;
begin
words := TWords.Create();
match := TRegEx.Match(Text, '\w+');
while match.Success do
begin
w := match.Value;
if words.ContainsKey(w) then
words[w] := words[w] + 1
else
words.Add(w, 1);
match := match.NextMatch();
end;
 
Result := words.ToArray;
words.Free;
TArray.Sort<TFreqPair>(Result, CreateValueCompare);
end;
 
var
Text: string;
rank: integer;
Freq: TFreq;
w: TFreqPair;
 
begin
Text := TFile.ReadAllText('135-0.txt').ToLower();
 
Freq := WordFrequency(Text);
 
Writeln('Rank Word Frequency');
Writeln('==== ==== =========');
 
for rank := 1 to 10 do
begin
w := Freq[rank - 1];
Writeln(format('%2d %6s %5d', [rank, w.Key, w.Value]));
end;
 
readln;
end.
</syntaxhighlight>
{{out}}
<pre>
Rank Word Frequency
==== ==== =========
1 the 41040
2 of 19951
3 and 14942
4 a 14539
5 to 13941
6 in 11209
7 he 9646
8 was 8620
9 that 7922
10 it 6659
</pre>
=={{header|F Sharp}}==
<langsyntaxhighlight lang="fsharp">
open System.IO
open System.Text.RegularExpressions
let g=Regex("[A-Za-zÀ-ÿ]+").Matches(File.ReadAllText "135-0.txt")
[for n in g do yield n.Value.ToLower()]|>List.countBy(id)|>List.sortBy(fun n->(-(snd n)))|>List.take 10|>List.iter(fun n->printfn "%A" n)
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 66 ⟶ 2,185:
</pre>
 
=={{header|JuliaFactor}}==
This program expects stdin to read from a file via the command line. ( e.g. invoking the program in Windows: <tt>>factor word-count.factor < input.txt</tt> ) The definition of a word here is simply any string surrounded by some combination of spaces, punctuation, or newlines.
<lang julia># v0.6
<syntaxhighlight lang="factor">
USING: ascii io math.statistics prettyprint sequences
splitting ;
IN: rosetta-code.word-count
 
lines " " join " .,?!:;()\"-" split harvest [ >lower ] map
sorted-histogram <reversed> 10 head .
</syntaxhighlight>
{{out}}
<pre>
{
{ "the" 41021 }
{ "of" 19945 }
{ "and" 14938 }
{ "a" 14522 }
{ "to" 13938 }
{ "in" 11201 }
{ "he" 9600 }
{ "was" 8618 }
{ "that" 7822 }
{ "it" 6532 }
}
</pre>
 
=={{header|FreeBASIC}}==
<syntaxhighlight lang="freebasic">
#Include "file.bi"
type tally
as string s
as long l
end type
Sub quicksort(array() As String,begin As Long,Finish As Long)
Dim As Long i=begin,j=finish
Dim As String x =array(((I+J)\2))
While I <= J
While array(I) < X :I+=1:Wend
While array(J) > X :J-=1:Wend
If I<=J Then Swap array(I),array(J): I+=1:J-=1
Wend
If J >begin Then quicksort(array(),begin,J)
If I <Finish Then quicksort(array(),I,Finish)
End Sub
 
Sub tallysort(array() As tally,begin As Long,Finish As long)
Dim As Long i=begin,j=finish
Dim As tally x =array(((I+J)\2))
While I <= J
While array(I).l > X .l:I+=1:Wend
While array(J).l < X .l:J-=1:Wend
If I<=J Then Swap array(I),array(J): I+=1:J-=1
Wend
If J >begin Then tallysort(array(),begin,J)
If I <Finish Then tallysort(array(),I,Finish)
End Sub
 
 
Function loadfile(file As String) As String
If Fileexists(file)=0 Then Print file;" not found":Sleep:End
Dim As Long f=Freefile
Open file For Binary Access Read As #f
Dim As String text
If Lof(f) > 0 Then
text = String(Lof(f), 0)
Get #f, , text
End If
Close #f
Return text
End Function
 
Function String_Split(s_in As String,chars As String,result() As String) As Long
Dim As Long ctr,ctr2,k,n,LC=Len(chars)
Dim As boolean tally(Len(s_in))
#macro check_instring()
n=0
While n<Lc
If chars[n]=s_in[k] Then
tally(k)=true
If (ctr2-1) Then ctr+=1
ctr2=0
Exit While
End If
n+=1
Wend
#endmacro
#macro splice()
If tally(k) Then
If (ctr2-1) Then ctr+=1:result(ctr)=Mid(s_in,k+2-ctr2,ctr2-1)
ctr2=0
End If
#endmacro
'================== LOOP TWICE =======================
For k =0 To Len(s_in)-1
ctr2+=1:check_instring()
Next k
If ctr=0 Then
If Len(s_in) Andalso Instr(chars,Chr(s_in[0])) Then ctr=1':
End If
If ctr Then Redim result(1 To ctr): ctr=0:ctr2=0 Else Return 0
For k =0 To Len(s_in)-1
ctr2+=1:splice()
Next k
'===================== Last one ========================
If ctr2>0 Then
Redim Preserve result(1 To ctr+1)
result(ctr+1)=Mid(s_in,k+1-ctr2,ctr2)
End If
Return Ubound(result)
End Function
 
Redim As String s()
redim as tally t()
dim as string p1,p2,deliminators
dim as long count,jmp
dim as double tm=timer
 
Var L=loadfile("rosettalesmiserables.txt")
L=lcase(L)
'get deliminators
for n as long=1 to 96
p1+=chr(n)
next
for n as long=123 to 255
p2+=chr(n)
next
 
deliminators=p1+p2
 
string_split(L,deliminators,s())
 
quicksort(s(),lbound(s),ubound(s))
 
For n As Long=lbound(s) To ubound(s)-1
if s(n+1)=s(n) then jmp+=1
if s(n+1)<>s(n) then
count+=1
redim preserve t(1 to count)
t(count).s=s(n)
t(count).l=jmp
jmp=0
end if
Next
 
tallysort(t(),lbound(t),ubound(t))'sort by frequency
print "frequency","word"
print
for n as long=lbound(t) to lbound(t)+9
print t(n).l,t(n).s
next
 
Print
print "time for operation ";timer-tm;" seconds"
sleep
</syntaxhighlight>
{{out}}
<pre>
I saved and reloaded the file as ascii text.
frequency word
 
41098 the
19955 of
14939 and
14557 a
13953 to
11219 in
9648 he
8621 was
7923 that
6660 it
 
time for operation 1.099869600031525 seconds
 
</pre>
 
=={{header|Frink}}==
This example shows some of the subtle and non-obvious power of Frink in processing text files in a language-aware and Unicode-aware fashion:
* Frink has a Unicode-aware function, <CODE>wordList[''str'']</CODE>, which intelligently enumerates through the words in a string (and correctly handles compound words, hyphenated words, accented characters, etc.) It returns words, spaces, and punctuation marks separately. For the purposes of this program, "words" that do not contain any alphanumeric characters (as decided by the Unicode standard) are filtered out. These are likely punctuation and spaces. There is also a two-argument function, <CODE>wordList[''str'', ''lang'']</CODE> which allows you to specify a language code ''e.g.'' <CODE>"fr"</CODE> to use the rules of French (or many other human languages) to perform correct word-breaking according to the rules of that language!
* The file fetched from Project Gutenberg is supposed to be encoded in UTF-8 character encoding, but their servers incorrectly send either that it is Windows-1252 encoded or send no character encoding at all, so this program fixes that.
* Frink has a Unicode-aware lowercase function, <CODE>lc[''str'']</CODE> that correctly handles accented characters and may even make a string longer.
 
* Frink can normalize Unicode characters with its <CODE>normalizeUnicode</CODE> function so the same word encoded two different ways in Unicode can be treated consistently. For example, a Unicode string can use various methods to encode what is essentially the same character/glyph. For example, the character <CODE>ô</CODE> can be represented as either <CODE>"\u00F4"</CODE> or <CODE>"\u006F\u0302"</CODE>. The former is a "precomposed" character, <CODE>"LATIN SMALL LETTER O WITH CIRCUMFLEX"</CODE>, and the latter is two Unicode codepoints, an <CODE>o</CODE> (<CODE>LATIN SMALL LETTER O</CODE>) followed by <CODE>"COMBINING CIRCUMFLEX ACCENT"</CODE>. (This is usually referred to as a "decomposed" representation.) Unicode normalization rules can convert these "equivalent" encodings into a canonical representation. This makes two different strings which look equivalent to a human (but are very different in their codepoints) be treated as the same to a computer, and these programs will count them the same. Even if the Project Gutenberg document uses precomposed and decomposed representations for the same words, this program will fix it and count them the same! See the [[http://unicode.org/reports/tr15/ Unicode Normal Forms]] specification for more about these normalization rules. Frink implements all of them (NFC, NFD, NFKC, NFKD). NFC is the default in <CODE>normalizeUnicode[''str'', ''encoding=NFC'']</CODE>. They're interesting!
 
 
How many other languages in this page do all or any of this correctly?
 
There are two sample programs below. First, a simple but powerful method that works in old versions of Frink:
<syntaxhighlight lang="frink">d = new dict
for w = select[wordList[read[normalizeUnicode["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]], %r/[[:alnum:]]/ ]
d.increment[lc[w], 1]
 
println[join["\n", first[reverse[sort[array[d], {|a,b| a@1 <=> b@1}]], 10]]]</syntaxhighlight>
 
{{out}}
<pre>
[the, 40802]
[of, 19933]
[and, 14924]
[a, 14450]
[to, 13719]
[in, 11184]
[he, 9636]
[was, 8617]
[that, 7901]
[it, 6641]
</pre>
 
Next, a "showing off" one-liner that works in recent versions of Frink that uses the <CODE>countToArray</CODE> function which easily creates sorted frequency lists and the <CODE>formatTable</CODE> function that formats into a nice table with columns lined up, and still performs full Unicode-aware normalization, capitalization, and word-breaking:
 
<syntaxhighlight lang="frink">formatTable[first[countToArray[select[wordList[lc[normalizeUnicode[read["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]]], %r/[[:alnum:]]/ ]], 10], "right"]</syntaxhighlight>
 
{{out}}
<pre>
the 36629
of 19602
and 14063
a 13447
to 13345
in 10259
was 8541
that 7303
he 6812
had 6133
</pre>
 
=={{header|FutureBasic}}==
Task said: "Feel free to explicitly state the thoughts behind the program decisions." Thus the heavy comments.
<syntaxhighlight lang="futurebasic">
include "NSLog.incl"
 
local fn WordFrequency( textStr as CFStringRef, caseSensitive as Boolean, ascendingOrder as Boolean ) as CFStringRef
'~'1
CFStringRef wrd
CFDictionaryRef dict
 
// Depending on the value of the caseSensitive Boolean function parameter above, lowercase incoming text
if caseSensitive == NO then textStr = fn StringLowercaseString( textStr )
 
// Trim non-alphabetic characters from string, and separate individual words with a space
CFStringRef tempStr = fn ArrayComponentsJoinedByString( fn StringComponentsSeparatedByCharactersInSet( textStr, fn CharacterSetInvertedSet( fn CharacterSetLetterSet ) ), @" " )
 
// Prepare separators to parse string into array
CFMutableCharacterSetRef separators = fn MutableCharacterSetInit
 
// Informally, this set is the set of all non-whitespace characters used to separate linguistic units in scripts, such as periods, dashes, parentheses, and so on.
MutableCharacterSetFormUnionWithCharacterSet( separators, fn CharacterSetPunctuationSet )
 
// A character set containing all the whitespace and newline characters including characters in Unicode General Category Z*, U+000A U+000D, and U+0085.
MutableCharacterSetFormUnionWithCharacterSet( separators, fn CharacterSetWhitespaceAndNewlineSet )
 
// Create array of separated words
CFArrayRef tempArr = fn StringComponentsSeparatedByCharactersInSet( tempStr, separators )
 
// Create a counted set with each word and its frequency
CountedSetRef freqencies = fn CountedSetWithArray( tempArr )
 
// Enumerate each word-frequency pair in the counted set...
EnumeratorRef enumRef = fn CountedSetObjectEnumerator( freqencies )
 
// .. and use it to create array of words in counted set
CFArrayRef array = fn EnumeratorAllObjects( enumRef )
 
// Create an empty mutable array
CFMutableArrayRef wordArr = fn MutableArrayWithCapacity( 0 )
 
// Create word counter
NSInteger totalWords = 0
// Enumerate each unique word, get its frequency, create its own key/value pair dictionary, add each dictionary into master array
for wrd in array
totalWords++
// Create dictionary with frequency and matching word
dict = @{ @"count":fn NumberWithUnsignedInteger( fn CountedSetCountForObject( freqencies, wrd ) ), @"object":wrd }
// Add each dictionary to the master mutable array, checking for a valid word by length
if ( fn StringLength( wrd ) != 0 )
MutableArrayAddObject( wordArr, dict )
end if
next
 
// Store the total words as a global application property
AppSetProperty( @"totalWords", fn StringWithFormat( @"%d", totalWords - 1 ) )
 
// Sort the array in ascending or descending order as determined by the ascendingOrder Boolean function input parameter
SortDescriptorRef descriptors = fn SortDescriptorWithKey( @"count", ascendingOrder )
CFArrayRef sortedArray = fn ArraySortedArrayUsingDescriptors( wordArr, @[descriptors] )
 
// Create an empty mutable string
CFMutableStringRef mutStr = fn MutableStringWithCapacity( 0 )
 
// Use each dictionary in sorted array to build the formatted output string
NSInteger count = 1
for dict in sortedArray
MutableStringAppendString( mutStr, fn StringWithFormat( @"%-7d %-7lu %@\n", count, fn StringIntegerValue( fn DictionaryValueForKey( dict, @"count" ) ), fn DictionaryValueForKey( dict, @"object" ) ) )
count++
next
 
// Create an immutable output string from mutable the string
CFStringRef resultStr = fn StringWithFormat( @"%@", mutStr )
end fn = resultStr
 
 
local fn ParseTextFromWebsite( webSite as CFStringRef )
// Convert incoming string to URL
CFURLRef textURL = fn URLWithString( webSite )
// Read contents of URL into a string
CFStringRef textStr = fn StringWithContentsOfURL( textURL, NSUTF8StringEncoding, NULL )
 
// Start timer
CFAbsoluteTime startTime = fn CFAbsoluteTimeGetCurrent
// Calculate frequency of words in text and sort by occurrence
CFStringRef frequencyStr = fn WordFrequency( textStr, NO, NO )
// Log results and post post processing time
NSLogClear
NSLog( @"%@", frequencyStr )
NSLog( @"Total unique words in document: %@", fn AppProperty( @"totalWords" ) )
// Stop timer and log elapsed processing time
NSLog( @"Elapsed time: %f milliseconds.", ( fn CFAbsoluteTimeGetCurrent - startTime ) * 1000.0 )
end fn
 
dispatchglobal
// Pass url for Les Misérables on Project Gutenberg and parse in background
fn ParseTextFromWebsite( @"https://www.gutenberg.org/files/135/135-0.txt" )
dispatchend
 
HandleEvents
</syntaxhighlight>
{{output}}
<pre>
1 41095 the
2 19955 of
3 14939 and
4 14546 a
5 13954 to
6 11218 in
7 9649 he
8 8622 was
9 7924 that
10 6661 it
11 6470 his
12 6193 is
 
//-------------------
 
22900 1 millstones
22901 1 fumbles
22902 1 shunned
22903 1 avoids
22904 1 poitevin
22905 1 muleteer
22906 1 idolizes
22907 1 lapsed
22908 1 reptitalmus
22909 1 bled
22910 1 isabella
 
Total unique words in document: 22910
Elapsed time: 595.407963 milliseconds.
</pre>
 
=={{header|Go}}==
{{trans|Kotlin}}
<syntaxhighlight lang="go">package main
 
import (
"fmt"
"io/ioutil"
"log"
"regexp"
"sort"
"strings"
)
 
type keyval struct {
key string
val int
}
 
func main() {
reg := regexp.MustCompile(`\p{Ll}+`)
bs, err := ioutil.ReadFile("135-0.txt")
if err != nil {
log.Fatal(err)
}
text := strings.ToLower(string(bs))
matches := reg.FindAllString(text, -1)
groups := make(map[string]int)
for _, match := range matches {
groups[match]++
}
var keyvals []keyval
for k, v := range groups {
keyvals = append(keyvals, keyval{k, v})
}
sort.Slice(keyvals, func(i, j int) bool {
return keyvals[i].val > keyvals[j].val
})
fmt.Println("Rank Word Frequency")
fmt.Println("==== ==== =========")
for rank := 1; rank <= 10; rank++ {
word := keyvals[rank-1].key
freq := keyvals[rank-1].val
fmt.Printf("%2d %-4s %5d\n", rank, word, freq)
}
}</syntaxhighlight>
 
{{out}}
<pre>
Rank Word Frequency
==== ==== =========
1 the 41088
2 of 19949
3 and 14942
4 a 14596
5 to 13951
6 in 11214
7 he 9648
8 was 8621
9 that 7924
10 it 6661
</pre>
 
=={{header|Groovy}}==
Solution:
<syntaxhighlight lang="groovy">def topWordCounts = { String content, int n ->
def mapCounts = [:]
content.toLowerCase().split(/\W+/).each {
mapCounts[it] = (mapCounts[it] ?: 0) + 1
}
def top = (mapCounts.sort { a, b -> b.value <=> a.value }.collect{ it })[0..<n]
println "Rank Word Frequency\n==== ==== ========="
(0..<n).each { printf ("%4d %-4s %9d\n", it+1, top[it].key, top[it].value) }
}</syntaxhighlight>
 
Test:
<syntaxhighlight lang="groovy">def rawText = "http://www.gutenberg.org/files/135/135-0.txt".toURL().text
topWordCounts(rawText, 10)</syntaxhighlight>
 
Output:
<pre>Rank Word Frequency
==== ==== =========
1 the 41036
2 of 19946
3 and 14940
4 a 14589
5 to 13939
6 in 11204
7 he 9645
8 was 8619
9 that 7922
10 it 6659</pre>
 
=={{header|Haskell}}==
===Lazy IO with pure Map, arrows===
{{trans|Clojure}}
<syntaxhighlight lang="haskell">module Main where
 
import Control.Category -- (>>>)
import Data.Char -- toLower, isSpace
import Data.List -- sortBy, (Foldable(foldl')), filter -- '
import Data.Ord -- Down
import System.IO -- stdin, ReadMode, openFile, hClose
import System.Environment -- getArgs
 
-- containers
import Data.Map.Strict (Map)
import qualified Data.Map.Strict as M
import qualified Data.IntMap.Strict as IM
 
-- text
import Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Text.IO as T
 
frequencies :: Ord a => [a] -> Map a Integer
frequencies = foldl' (\m k -> M.insertWith (+) k 1 m) M.empty -- '
{-# SPECIALIZE frequencies :: [Text] -> Map Text Integer #-}
 
main :: IO ()
main = do
args <- getArgs
(n,hand,filep) <- case length args of
0 -> return (10,stdin,False)
1 -> return (read $ head args,stdin,False)
_ -> let (ns:fp:_) = args
in fmap (\h -> (read ns,h,True)) (openFile fp ReadMode)
T.hGetContents hand >>=
(T.map toLower
>>> T.split isSpace
>>> filter (not <<< T.null)
>>> frequencies
>>> M.toList
>>> sortBy (comparing (Down <<< snd)) -- sort the opposite way
>>> take n
>>> print)
when filep (hClose hand)</syntaxhighlight>
{{Out}}
<pre>
$ ./word_count 10 < ~/doc/les_miserables*
[("the",40368),("of",19863),("and",14470),("a",14277),("to",13587),("in",11019),("he",9212),("was",8346),("that",7251),("his",6414)]
</pre>
 
===Lazy IO, map of IORefs===
Using IORefs as values in the map seems to give a ~2x speedup on large files. The below code is based on https://github.com/composewell/streamly-examples/blob/master/examples/WordFrequency.hs , but still using lazy IO to avoid the extra library dependency (in production you should [https://stackoverflow.com/questions/5892653/whats-so-bad-about-lazy-i-o use a streaming library] like streamly/conduit/io-streams):
<syntaxhighlight lang="haskell">
module Main where
 
import Control.Monad (foldM, when)
import Data.Char (isSpace, toLower)
import Data.List (sortOn, filter)
import Data.Ord (Down(..))
import System.IO (stdin, IOMode(..), openFile, hClose)
import System.Environment (getArgs)
import Data.IORef (IORef(..), newIORef, readIORef, modifyIORef') -- '
 
-- containers
import Data.HashMap.Strict (HashMap)
import qualified Data.HashMap.Strict as M
 
-- text
import Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Text.IO as T
 
frequencies :: [Text] -> IO (HashMap Text (IORef Int))
frequencies = foldM (flip (M.alterF alter)) M.empty
where
alter Nothing = Just <$> newIORef (1 :: Int)
alter (Just ref) = modifyIORef' ref (+ 1) >> return (Just ref) -- '
 
main :: IO ()
main = do
args <- getArgs
when (length args /= 1) (error "expecting 1 arg (number of words to print)")
let maxw = read $ head args -- no error handling, to simplify the example
T.hGetContents stdin >>= \contents -> do
freqtable <- frequencies $ filter (not . T.null) $ T.split isSpace $ T.map toLower contents
counts <-
let readRef (w, ref) = do
cnt <- readIORef ref
return (w, cnt)
in mapM readRef $ M.toList freqtable
print $ take maxw $ sortOn (Down . snd) counts
</syntaxhighlight>
{{Out}}
<pre>
$ ./word_count 10 < ~/doc/les_miserables*
[("the",40378),("of",19869),("and",14468),("a",14278),("to",13590),("in",11025),("he",9213),("was",8347),("that",7249),("his",6414)]
</pre>
 
===Lazy IO, short code, but not streaming===
Or, perhaps a little more simply, though not streaming (will read everything into memory, don't use on big files):
<syntaxhighlight lang="haskell">import qualified Data.Text.IO as T
import qualified Data.Text as T
 
import Data.List (group, sort, sortBy)
import Data.Ord (comparing)
 
frequentWords :: T.Text -> [(Int, T.Text)]
frequentWords =
sortBy (flip $ comparing fst) .
fmap ((,) . length <*> head) . group . sort . T.words . T.toLower
 
main :: IO ()
main = T.readFile "miserables.txt" >>= (mapM_ print . take 10 . frequentWords)</syntaxhighlight>
{{Out}}
<pre>(40370,"the")
(19863,"of")
(14470,"and")
(14277,"a")
(13587,"to")
(11019,"in")
(9212,"he")
(8346,"was")
(7251,"that")
(6414,"his")</pre>
 
=={{header|J}}==
Text acquisition: store the entire text from the web page http://www.gutenberg.org/files/135/135-0.txt (the plain text UTF-8 link) into a file. This linux example uses ~/downloads/books/LesMis.txt .
 
Program:
Reading from left to right,
10 {. "ten take" from an array computed by words to the right.
\:~ "sort descending" by items of the array computed by whatever is to the right.
(#;{.)/.~ "tally linked with item" key
<nowiki>;: "words" parses the argument to its right as a j sentence.</nowiki>
tolower changes to a common case
 
Hence the remainder of the j sentence must clean after loading the file.
 
<nowiki>The parenthesized expression (a.-.Alpha_j_,' ') computes to a vector of the j alphabet excluding [a-zA-Z ]</nowiki>
<nowiki>((e.&(a.-.Alpha_j_,' '))`(,:&' '))} substitutes space character for the unwanted characters.</nowiki>
<nowiki>1!:1 reads the file named in the box <</nowiki>
 
<pre>
10{.\:~(#;{.)/.~;:tolower((e.&(a.-.Alpha_j_,' '))`(,:&' '))}1!:1<jpath'~/downloads/books/LesMis.txt'
┌─────┬────┐
│41093│the │
├─────┼────┤
│19954│of │
├─────┼────┤
│14943│and │
├─────┼────┤
│14558│a │
├─────┼────┤
│13953│to │
├─────┼────┤
│11219│in │
├─────┼────┤
│9649 │he │
├─────┼────┤
│8622 │was │
├─────┼────┤
│7924 │that│
├─────┼────┤
│6661 │it │
└─────┴────┘
</pre>
 
=={{header|Java}}==
This is relatively simple in Java.<br />
I used a ''URL'' class to download the content, a ''BufferedReader'' class to examine the text line-for-line, a ''Pattern'' and ''Matcher'' to identify words, and a ''Map'' to hold to values.
<syntaxhighlight lang="java">
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
</syntaxhighlight>
 
<syntaxhighlight lang="java">
void printWordFrequency() throws URISyntaxException, IOException {
URL url = new URI("https://www.gutenberg.org/files/135/135-0.txt").toURL();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()))) {
Pattern pattern = Pattern.compile("(\\w+)");
Matcher matcher;
String line;
String word;
Map<String, Integer> map = new HashMap<>();
while ((line = reader.readLine()) != null) {
matcher = pattern.matcher(line);
while (matcher.find()) {
word = matcher.group().toLowerCase();
if (map.containsKey(word)) {
map.put(word, map.get(word) + 1);
} else {
map.put(word, 1);
}
}
}
/* print out top 10 */
List<Map.Entry<String, Integer>> list = new ArrayList<>(map.entrySet());
list.sort(Map.Entry.comparingByValue());
Collections.reverse(list);
int count = 1;
for (Map.Entry<String, Integer> value : list) {
System.out.printf("%-20s%,7d%n", value.getKey(), value.getValue());
if (count++ == 10) break;
}
}
}
</syntaxhighlight>
<pre>
the 41,043
of 19,952
and 14,938
a 14,539
to 13,942
in 11,208
he 9,646
was 8,620
that 7,922
it 6,659
</pre>
<br />
An alternate demonstration
{{trans|Kotlin}}
<syntaxhighlight lang="java">import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
 
public class WordCount {
public static void main(String[] args) throws IOException {
Path path = Paths.get("135-0.txt");
byte[] bytes = Files.readAllBytes(path);
String text = new String(bytes);
text = text.toLowerCase();
 
Pattern r = Pattern.compile("\\p{javaLowerCase}+");
Matcher matcher = r.matcher(text);
Map<String, Integer> freq = new HashMap<>();
while (matcher.find()) {
String word = matcher.group();
Integer current = freq.getOrDefault(word, 0);
freq.put(word, current + 1);
}
 
List<Map.Entry<String, Integer>> entries = freq.entrySet()
.stream()
.sorted((i1, i2) -> Integer.compare(i2.getValue(), i1.getValue()))
.limit(10)
.collect(Collectors.toList());
 
System.out.println("Rank Word Frequency");
System.out.println("==== ==== =========");
int rank = 1;
for (Map.Entry<String, Integer> entry : entries) {
String word = entry.getKey();
Integer count = entry.getValue();
System.out.printf("%2d %-4s %5d\n", rank++, word, count);
}
}
}</syntaxhighlight>
{{out}}
<pre>Rank Word Frequency
==== ==== =========
1 the 41088
2 of 19949
3 and 14942
4 a 14596
5 to 13951
6 in 11214
7 he 9648
8 was 8621
9 that 7924
10 it 6661</pre>
 
=={{header|jq}}==
The following solution uses the concept of a "bag of words" (bow), here realized as a JSON object
with the words as keys and the frequency of a word as the corresponding value.
 
To avoid issues with case folding, the "letters" here just the alphabet and hyphen, but a "word"
may not begin with hyphen. Thus "the-the" would count as one word, and "-the" would be excluded.
 
<syntaxhighlight lang="jq">
< 135-0.txt jq -nR --argjson n 10 '
def bow(stream):
reduce stream as $word ({}; .[($word|tostring)] += 1);
 
bow(inputs | gsub("[^-a-zA-Z]"; " ") | splits(" *") | ascii_downcase | select(test("^[a-z][-a-z]*$")))
| to_entries
| sort_by(.value)
| .[- $n :]
| reverse
| from_entries
'
</syntaxhighlight>
====Output====
<syntaxhighlight lang="jq">
{
"the": 41087,
"of": 19937,
"and": 14932,
"a": 14552,
"to": 13738,
"in": 11209,
"he": 9649,
"was": 8621,
"that": 7923,
"it": 6661
}
</syntaxhighlight>
 
=={{header|Julia}}==
{{works with|Julia|1.0}}
<syntaxhighlight lang="julia">
using FreqTables
 
txt = readstringread("les-mis.txt", String)
words = split(replace(txt, r"\P{L}"i, => " "))
table = sort(freqtable(words); rev=true)
println(table[1:10])10-element Named Array{Int64,1}</langsyntaxhighlight>
 
{{out}}
Line 89 ⟶ 2,993:
"he" │ 6816
"had" │ 6140</pre>
 
=={{header|K}}==
{{works with|ngn/k}}<syntaxhighlight lang=K>common:{+((!d)o)!n@o:x#>n:#'.d:=("&"\`c$"&"|_,/0:y)^,""}
{(,'!x),'.x}common[10;"135-0.txt"]
(("the";41019)
("of";19898)
("and";14658)
(,"a";14517)
("to";13695)
("in";11134)
("he";9405)
("was";8361)
("that";7592)
("his";6446))</syntaxhighlight>
 
(The relatively easy to read output format here is arguably less useful than the table produced by <code>common</code> but it would have been more concise to have <code>common</code> generate it directly.)
 
=={{header|KAP}}==
The below program defines the function 'stats' which accepts a filename containing the text.
 
<syntaxhighlight lang="kap">∇ stats (file) {
content ← "[\\h,.\"'\n-]+" regex:split unicode:toLower io:readFile file
sorted ← (⍋⊇⊢) content
selection ← 1,2≢/sorted
words ← selection / sorted
{⍵[10↑⍒⍵[;1];]} words ,[0.5] ≢¨ sorted ⊂⍨ +\selection
}</syntaxhighlight>
{{out}}
<pre>┏━━━━━━━━━━━━┓
┃ "the" 40387┃
┃ "of" 19913┃
┃ "and" 14742┃
┃ "a" 14289┃
┃ "to" 13819┃
┃ "in" 11088┃
┃ "he" 9430┃
┃ "was" 8597┃
┃"that" 7516┃
┃ "his" 6435┃
┗━━━━━━━━━━━━┛</pre>
 
=={{header|Kotlin}}==
The author of the Perl 6Raku entry has given a good account of the difficulties with this task and, in the absence of any clarification on the various issues, I've followed a similar 'literal' approach.
 
So, after first converting the text to lower case, I've assumed that a word is any sequence of one or more lower-case Unicode letters and obtained the same results as the Perl 6Raku version.
 
There is no change in the results if the numerals 0-9 are also regarded as letters.
<langsyntaxhighlight lang="scala">// version 1.1.3
 
import java.io.File
Line 114 ⟶ 3,058:
for ((word, freq) in wordGroups)
System.out.printf("%2d %-4s %5d\n", rank++, word, freq)
}</langsyntaxhighlight>
 
{{out}}
Line 132 ⟶ 3,076:
</pre>
 
=={{header|PerlLiberty 6BASIC}}==
<syntaxhighlight lang="lb">dim words$(100000,2)'words$(a,1)=the word, words$(a,2)=the count
{{works with|Rakudo|2017.07}}
dim lines$(150000)
open "135-0.txt" for input as #txt
while EOF(#txt)=0 and total < 150000
input #txt, lines$(total)
total=total+1
wend
for a = 1 to total
token$ = "?"
index=0
new=0
while token$ <> ""
new=0
index = index + 1
token$ = lower$(word$(lines$(a),index))
token$=replstr$(token$,".","")
token$=replstr$(token$,",","")
token$=replstr$(token$,";","")
token$=replstr$(token$,"!","")
token$=replstr$(token$,"?","")
token$=replstr$(token$,"-","")
token$=replstr$(token$,"_","")
token$=replstr$(token$,"~","")
token$=replstr$(token$,"+","")
token$=replstr$(token$,"0","")
token$=replstr$(token$,"1","")
token$=replstr$(token$,"2","")
token$=replstr$(token$,"3","")
token$=replstr$(token$,"4","")
token$=replstr$(token$,"5","")
token$=replstr$(token$,"6","")
token$=replstr$(token$,"7","")
token$=replstr$(token$,"8","")
token$=replstr$(token$,"9","")
token$=replstr$(token$,"/","")
token$=replstr$(token$,"<","")
token$=replstr$(token$,">","")
token$=replstr$(token$,":","")
for b = 1 to newwordcount
if words$(b,1)=token$ then
num=val(words$(b,2))+1
num$=str$(num)
if len(num$)=1 then num$="0000"+num$
if len(num$)=2 then num$="000"+num$
if len(num$)=3 then num$="00"+num$
if len(num$)=4 then num$="0"+num$
words$(b,2)=num$
new=1
exit for
end if
next b
if new<>1 then newwordcount=newwordcount+1:words$(newwordcount,1)=token$:words$(newwordcount,2)="00001":print newwordcount;" ";token$
wend
next a
print
sort words$(), 1, newwordcount, 2
print "Count Word"
print "===== ================="
for a = newwordcount to newwordcount-10 step -1
print words$(a,2);" ";words$(a,1)
next a
print "-----------------------"
print newwordcount;" unique words found."
print "End of program"
close #txt
end
</syntaxhighlight>
{{out}}
<pre>Count Word
===== =================
40292 the
19825 of
14703 and
14249 a
13594 to
122613
11061 in
09436 he
08579 was
07530 that
06428 his
-----------------------
29109 unique words found.
</pre>
 
=={{header|Lua}}==
Note: much of the following exposition is no longer critical to the task as the requirements have been updated, but is left here for historical and informational reasons.
{{works with|lua|5.3}}
<syntaxhighlight lang="lua">
-- This program takes two optional command line arguments. The first (arg[1])
-- specifies the input file, or defaults to standard input. The second
-- (arg[2]) specifies the number of results to show, or defaults to 10.
 
-- in freq, each key is a word and each value is its count
This is slightly trickier than it appears initially. The task specifically states: "A word is a sequence of one or more contiguous letters", so contractions and hyphenated words are broken up. Initially we might reach for a regex matcher like /\w+/ , but \w includes underscore, which is not a letter but a punctuation connector; and this text is '''full''' of underscores since that is how Project Gutenberg texts denote italicized text. The underscores are not actually parts of the words though, they are markup.
local freq = {}
for line in io.lines(arg[1]) do
-- %a stands for any letter
for word in string.gmatch(string.lower(line), "%a+") do
if not freq[word] then
freq[word] = 1
else
freq[word] = freq[word] + 1
end
end
end
 
-- in array, each entry is an array whose first value is the count and whose
We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various accented glyphs. Those '''are''' letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.)
-- second value is the word
local array = {}
for word, count in pairs(freq) do
table.insert(array, {count, word})
end
table.sort(array, function (a, b) return a[1] > b[1] end)
 
for i = 1, arg[2] or 10 do
Actually, in this case /A-Za-z/ returns '''very nearly''' the correct answer. Unfortunately, the name "Alèthe" appears once (only once!) in the text, gets incorrectly split into Al & the, and incorrectly reports 41089 occurrences of "the".
io.write(string.format('%7d %s\n', array[i][1] , array[i][2]))
The text has several words like "Panathenæa", "ça", "aérostiers" and "Keksekça" so the counts for 'a' are off too. The other 8 of the top 10 are "correct" using /A-Za-z/, but it is mostly by accident.
end
</syntaxhighlight>
 
{{Out}}
A more accurate regex matcher would be some kind of Unicode aware /\w/ minus underscore. It may also be useful, depending on your requirements, to recognize contractions with embedded apostrophes, hyphenated words, and hyphenated words broken across lines.
<pre>
❯ ./wordcount.lua 135-0.txt
41093 the
19954 of
14943 and
14558 a
13953 to
11219 in
9649 he
8622 was
7924 that
6661 it
</pre>
 
Relevant documentation:
Here is a sample that shows the result when using various different matchers.
[https://www.lua.org/manual/5.3/manual.html#pdf-io.lines io.lines]
<lang perl6>sub MAIN ($filename, $top = 10) {
[https://www.lua.org/manual/5.3/manual.html#pdf-string.gmatch gmatch]
my $file = $filename.IO.slurp.lc.subst(/ (<[\w]-[_]>'-')\n(<[\w]-[_]>) /, {$0 ~ $1}, :g );
[https://www.lua.org/manual/5.3/manual.html#6.4.1 patterns like %a]
my @matcher = (
rx/ <[a..z]>+ /, # simple 7-bit ASCII
rx/ \w+ /, # word characters with underscore
rx/ <[\w]-[_]>+ /, # word characters without underscore
rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* / # word characters without underscore but with hyphens and contractions
);
for @matcher -> $reg {
say "\nTop $top using regex: ", $reg.perl;
.put for $file.comb( $reg ).Bag.sort(-*.value)[^$top];
}
}</lang>
 
=={{header|Mathematica}} / {{header|Wolfram Language}}==
<syntaxhighlight lang="mathematica">TakeLargest[10]@WordCounts[Import["https://www.gutenberg.org/files/135/135-0.txt"], IgnoreCase->True]//Dataset</syntaxhighlight>
{{out}}
<pre>
Passing in the file name and 10:
the 41088
<pre>Top 10 using regex: rx/ <[a..z]>+ /
of 19936
the 41089
and 14931
of 19949
a 14536
and 14942
to 13738
a 14608
in 11208
to 13951
he 9607
in 11214
was 8621
he 9648
that 7825
was 8621
it 6535
that 7924
</pre>
it 6661
 
=={{header|MATLAB}} / {{header|Octave}}==
Top 10 using regex: rx/ \w+ /
<syntaxhighlight lang="matlab">
the 41035
function [result,count] = word_frequency()
of 19946
URL='https://www.gutenberg.org/files/135/135-0.txt';
and 14940
text=webread(URL);
a 14577
DELIMITER={' ', ',', ';', ':', '.', '/', '*', '!', '?', '<', '>', '(', ')', '[', ']','{', '}', '&','$','§','"','”','“','-','—','‘','\t','\n','\r'};
to 13939
words = sort(strsplit(lower(text),DELIMITER));
in 11204
flag = [find(~strcmp(words(1:end-1),words(2:end))),length(words)];
he 9645
dwords = words(flag); % get distinct words, and ...
was 8619
count = diff([0,flag]); % ... the corresponding occurance frequency
that 7922
[tmp,idx] = sort(-count); % sort according to occurance
it 6659
result = dwords(idx);
count = count(idx);
for k = 1:10,
fprintf(1,'%d\t%s\n',count(k),result{k})
end
</syntaxhighlight>
 
{{out}}
Top 10 using regex: rx/ <[\w]-[_]>+ /
<pre>
the 41088
41039 the
of 19949
19950 of
and 14942
14942 and
a 14596
14523 a
to 13951
13941 to
in 11214
11208 in
he 9648
9605 he
was 8621
8620 was
that 7924
7824 that
it 6661
6533 it
</pre>
 
=={{header|Nim}}==
Top 10 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
<syntaxhighlight lang="nim">import tables, strutils, sequtils, httpclient
the 41081
 
of 19930
proc take[T](s: openArray[T], n: int): seq[T] = s[0 ..< min(n, s.len)]
and 14934
 
a 14587
var client = newHttpClient()
to 13735
var text = client.getContent("https://www.gutenberg.org/files/135/135-0.txt")
in 11204
 
he 9607
var wordFrequencies = text.toLowerAscii.splitWhitespace.toCountTable
was 8620
wordFrequencies.sort
that 7825
for (word, count) in toSeq(wordFrequencies.pairs).take(10):
it 6535</pre>
echo alignLeft($count, 8), word</syntaxhighlight>
{{out}}
<pre>40377 the
19870 of
14469 and
14278 a
13590 to
11025 in
9213 he
8347 was
7249 that
6414 his</pre>
 
=={{header|Objeck}}==
<syntaxhighlight lang="objeck">use System.IO.File;
use Collection;
use RegEx;
 
class Rosetta {
function : Main(args : String[]) ~ Nil {
if(args->Size() <> 1) {
return;
};
input := FileReader->ReadFile(args[0]);
filter := RegEx->New("\\w+");
words := filter->Find(input);
word_counts := StringMap->New();
each(i : words) {
word := words->Get(i)->As(String);
if(word <> Nil & word->Size() > 0) {
word := word->ToLower();
if(word_counts->Has(word)) {
count := word_counts->Find(word)->As(IntHolder);
count->Set(count->Get() + 1);
}
else {
word_counts->Insert(word, IntHolder->New(1));
};
};
};
count_words := IntMap->New();
words := word_counts->GetKeys();
each(i : words) {
word := words->Get(i)->As(String);
count := word_counts->Find(word)->As(IntHolder);
count_words->Insert(count->Get(), word);
};
counts := count_words->GetKeys();
counts->Sort();
index := 1;
"Rank\tWord\tFrequency"->PrintLine();
"====\t====\t===="->PrintLine();
for(i := count_words->Size() - 1; i >= 0; i -= 1;) {
if(count_words->Size() - 10 <= i) {
count := counts->Get(i);
word := count_words->Find(count)->As(String);
"{$index}\t{$word}\t{$count}"->PrintLine();
index += 1;
};
};
}
}</syntaxhighlight>
 
Output:
<pre>
Rank Word Frequency
==== ==== ====
1 the 41036
2 of 19946
3 and 14940
4 a 14589
5 to 13939
6 in 11204
7 he 9645
8 was 8619
9 that 7922
10 it 6659
</pre>
 
=={{header|OCaml}}==
 
<syntaxhighlight lang="ocaml">let () =
let n =
try int_of_string Sys.argv.(1)
with _ -> 10
in
let ic = open_in "135-0.txt" in
let h = Hashtbl.create 97 in
let w = Str.regexp "[^A-Za-zéèàêâôîûœ]+" in
try
while true do
let line = input_line ic in
let words = Str.split w line in
List.iter (fun word ->
let word = String.lowercase_ascii word in
match Hashtbl.find_opt h word with
| None -> Hashtbl.add h word 1
| Some x -> Hashtbl.replace h word (succ x)
) words
done
with End_of_file ->
close_in ic;
let l = Hashtbl.fold (fun word count acc -> (word, count)::acc) h [] in
let s = List.sort (fun (_, c1) (_, c2) -> compare c2 c1) l in
let r = List.init n (fun i -> List.nth s i) in
List.iter (fun (word, count) ->
Printf.printf "%d %s\n" count word
) r</syntaxhighlight>
 
{{out}}
<pre>
$ ocaml str.cma word_freq.ml
41092 the
19954 of
14943 and
14554 a
13953 to
11219 in
9649 he
8622 was
7924 that
6661 it
</pre>
 
=={{header|Perl}}==
{{trans|Raku}}
<syntaxhighlight lang="perl">use strict;
use warnings;
use utf8;
 
my $top = 10;
 
open my $fh, '<', 'ref/word-count.txt';
(my $text = join '', <$fh>) =~ tr/A-Z/a-z/;
 
my @matcher = (
qr/[a-z]+/, # simple 7-bit ASCII
qr/\w+/, # word characters with underscore
qr/[a-z0-9]+/, # word characters without underscore
);
 
for my $reg (@matcher) {
print "\nTop $top using regex: " . $reg\n";
my @matches = $text =~ /$reg/g;
my %words;
for my $w (@matches) { $words{$w}++ };
my $c = 0;
for my $w ( sort { $words{$b} <=> $words{$a} } keys %words ) {
printf "%-7s %6d\n", $w, $words{$w};
last if ++$c >= $top;
}
}</syntaxhighlight>
 
{{out}}
<pre>
Top 10 using regex: (?^:[a-z]+)
the 41089
of 19949
and 14942
a 14608
to 13951
in 11214
he 9648
was 8621
that 7924
it 6661
 
Top 10 using regex: (?^:\w+)
the 41036
of 19946
and 14940
a 14589
to 13939
in 11204
he 9645
was 8619
that 7922
it 6659
 
Top 10 using regex: (?^:[a-z0-9]+)
the 41089
of 19949
and 14942
a 14608
to 13951
in 11214
he 9648
was 8621
that 7924
it 6661
</pre>
 
=={{header|Phix}}==
<!--<syntaxhighlight lang="phix">(notonline)-->
<lang Phix>?"loading..."
<span style="color: #008080;">without</span> <span style="color: #008080;">javascript_semantics</span>
constant subs = "\t\r\n_.,\"\'!;:?][()|=<>#/*{}+@%&$",
<span style="color: #0000FF;">?</span><span style="color: #008000;">"loading..."</span>
reps = repeat(' ',length(subs)),
<span style="color: #008080;">constant</span> <span style="color: #000000;">subs</span> <span style="color: #0000FF;">=</span> <span style="color: #008000;">'\t'</span><span style="color: #0000FF;">&</span><span style="color: #008000;">"\r\n_.,\"\'!;:?][()|=&lt;&gt;#/*{}+@%&$"</span><span style="color: #0000FF;">,</span>
fn = open("135-0.txt","r")
<span style="color: #000000;">reps</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">repeat</span><span style="color: #0000FF;">(</span><span style="color: #008000;">' '</span><span style="color: #0000FF;">,</span><span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">subs</span><span style="color: #0000FF;">)),</span>
string text = lower(substitute_all(get_text(fn),subs,reps))
<span style="color: #000000;">fn</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">open</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"135-0.txt"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"r"</span><span style="color: #0000FF;">)</span>
close(fn)
<span style="color: #004080;">string</span> <span style="color: #000000;">text</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">lower</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">substitute_all</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">get_text</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">),</span><span style="color: #000000;">subs</span><span style="color: #0000FF;">,</span><span style="color: #000000;">reps</span><span style="color: #0000FF;">))</span>
sequence words = append(sort(split(text,no_empty:=true)),"")
<span style="color: #7060A8;">close</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span>
constant wf = new_dict()
<span style="color: #004080;">sequence</span> <span style="color: #000000;">words</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">append</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">sort</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">split</span><span style="color: #0000FF;">(</span><span style="color: #000000;">text</span><span style="color: #0000FF;">,</span><span style="color: #000000;">no_empty</span><span style="color: #0000FF;">:=</span><span style="color: #004600;">true</span><span style="color: #0000FF;">)),</span><span style="color: #008000;">""</span><span style="color: #0000FF;">)</span>
string last = words[1]
<span style="color: #008080;">constant</span> <span style="color: #000000;">wf</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">new_dict</span><span style="color: #0000FF;">()</span>
integer count = 1
<span style="color: #004080;">string</span> <span style="color: #000000;">last</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]</span>
for i=2 to length(words) do
<span style="color: #004080;">integer</span> <span style="color: #000000;">count</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">1</span>
if words[i]!=last then
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">2</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">words</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
setd({count,last},0,wf)
<span style="color: #008080;">if</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]!=</span><span style="color: #000000;">last</span> <span style="color: #008080;">then</span>
count = 0
<span style="color: #7060A8;">setd</span><span style="color: #0000FF;">({</span><span style="color: #000000;">count</span><span style="color: #0000FF;">,</span><span style="color: #000000;">last</span><span style="color: #0000FF;">},</span><span style="color: #000000;">0</span><span style="color: #0000FF;">,</span><span style="color: #000000;">wf</span><span style="color: #0000FF;">)</span>
last = words[i]
<span style="color: #000000;">count</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">0</span>
end if
<span style="color: #000000;">last</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span>
count += 1
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
end for
<span style="color: #000000;">count</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span>
count = 10
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
function visitor(object key, object /*data*/, object /*user_data*/)
<span style="color: #000000;">count</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">10</span>
?key
<span style="color: #008080;">function</span> <span style="color: #000000;">visitor</span><span style="color: #0000FF;">(</span><span style="color: #004080;">object</span> <span style="color: #000000;">key</span><span style="color: #0000FF;">,</span> <span style="color: #004080;">object</span> <span style="color: #000080;font-style:italic;">/*data*/</span><span style="color: #0000FF;">,</span> <span style="color: #004080;">object</span> <span style="color: #000080;font-style:italic;">/*user_data*/</span><span style="color: #0000FF;">)</span>
count -= 1
<span style="color: #0000FF;">?</span><span style="color: #000000;">key</span>
return count>0
<span style="color: #000000;">count</span> <span style="color: #0000FF;">-=</span> <span style="color: #000000;">1</span>
end function
<span style="color: #008080;">return</span> <span style="color: #000000;">count</span><span style="color: #0000FF;">></span><span style="color: #000000;">0</span>
traverse_dict(routine_id("visitor"),0,wf,true)</lang>
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
<span style="color: #7060A8;">traverse_dict</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">routine_id</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"visitor"</span><span style="color: #0000FF;">),</span><span style="color: #000000;">0</span><span style="color: #0000FF;">,</span><span style="color: #000000;">wf</span><span style="color: #0000FF;">,</span><span style="color: #004600;">true</span><span style="color: #0000FF;">)</span>
<!--</syntaxhighlight>-->
{{out}}
<pre>
Line 251 ⟶ 3,514:
{6612,"it"}
</pre>
 
=={{header|Phixmonti}}==
<syntaxhighlight lang="phixmonti">include ..\Utilitys.pmt
 
"loading..." ?
"135-0.txt" "r" fopen var fn
" "
true
while
fn fgets number? if drop fn fclose false else lower " " chain chain true endif
endwhile
 
"process..." ?
len for
var i
i get dup 96 > swap 123 < and not if 32 i set endif
endfor
split sort
 
"count..." ?
( ) var words
"" var prev
1 var n
len for
var i
i get dup prev ==
if
drop n 1 + var n
else
words ( n prev ) 0 put var words var prev 1 var n
endif
endfor
drop
words sort
10 for
-1 * get ?
endfor
drop</syntaxhighlight>
{{out}}
<pre>loading...
process...
count...
[41093, "the"]
[19954, "of"]
[14943, "and"]
[14558, "a"]
[13953, "to"]
[11219, "in"]
[9649, "he"]
[8622, "was"]
[7924, "that"]
[6661, "it"]
 
=== Press any key to exit ===</pre>
 
=={{header|PHP}}==
<syntaxhighlight lang="php">
<?php
 
preg_match_all('/\w+/', file_get_contents($argv[1]), $words);
$frecuency = array_count_values($words[0]);
arsort($frecuency);
 
echo "Rank\tWord\tFrequency\n====\t====\t=========\n";
$i = 1;
foreach ($frecuency as $word => $count) {
echo $i . "\t" . $word . "\t" . $count . "\n";
if ($i >= 10) {
break;
}
$i++;
}</syntaxhighlight>
{{out}}
<pre>
Rank Word Frequency
==== ==== =========
1 the 36636
2 of 19615
3 and 14079
4 to 13535
5 a 13527
6 in 10256
7 was 8543
8 that 7324
9 he 6814
10 had 6139
</pre>
 
=={{header|Picat}}==
To get the book proper, the header and footer are removed. Here are some tests with different sets of characters to split the words (<code>split_char/1</code>).
<syntaxhighlight lang="picat">main =>
NTop = 10,
File = "les_miserables.txt",
Chars = read_file_chars(File),
 
% Remove the Project Gutenberg header/footer
find(Chars,"*** START OF THE PROJECT GUTENBERG EBOOK LES MISÉRABLES ***",_,HeaderEnd),
find(Chars,"*** END OF THE PROJECT GUTENBERG EBOOK LES MISÉRABLES ***",FooterStart,_),
 
Book = [to_lowercase(C) : C in slice(Chars,HeaderEnd+1,FooterStart-1)],
 
% Split into words (different set of split characters)
member(SplitType,[all,space_punct,space]),
println(split_type=SplitType),
split_chars(SplitType,SplitChars),
Words = split(Book,SplitChars),
 
println(freq(Words).to_list.sort_down(2).take(NTop)),
nl,
fail.
 
freq(L) = Freq =>
Freq = new_map(),
foreach(E in L)
Freq.put(E,Freq.get(E,0)+1)
end.
 
% different set of split chars
split_chars(all,"\n\r \t,;!.?()[]”\"-“—-__‘’*").
split_chars(space_punct,"\n\r \t,;!.?").
split_chars(space,"\n\r \t").</syntaxhighlight>
 
{{out}}
<pre>split_type = all
[the = 40907,of = 19830,and = 14872,a = 14487,to = 13872,in = 11157,he = 9645,was = 8618,that = 7908,it = 6626]
 
split_type = space_punct
[the = 40193,of = 19779,and = 14668,a = 14227,to = 13538,in = 11033,he = 9455,was = 8604,that = 7576,” = 6578]
 
split_type = space
[the = 40193,of = 19747,and = 14402,a = 14222,to = 13512,in = 10964,he = 9211,was = 8345,that = 7235,his = 6414]</pre>
 
It is a slightly different result if the the header/footer are not removed:
<pre>split_type = all
[the = 41094,of = 19952,and = 14939,a = 14545,to = 13954,in = 11218,he = 9647,was = 8620,that = 7922,it = 6641]
 
split_type = space_punct
[the = 40378,of = 19901,and = 14734,a = 14284,to = 13620,in = 11094,he = 9457,was = 8606,that = 7590,” = 6578]
 
split_type = space
[the = 40378,of = 19869,and = 14468,a = 14278,to = 13590,in = 11025,he = 9213,was = 8347,that = 7249,his = 6414]</pre>
 
 
=={{header|PicoLisp}}==
<langsyntaxhighlight PicoLisplang="picolisp">(setq *Delim " ^I^J^M-_.,\"'*[]?!&@#$%^\(\):;")
(setq *Skip (chop *Delim))
 
Line 267 ⟶ 3,672:
(if (idx 'B W T) (inc (car @)) (set W 1)) ) ) )
(for L (head 10 (flip (by val sort (idx 'B))))
(println L (val L)) )</langsyntaxhighlight>
{{out}}
<pre>
Line 280 ⟶ 3,685:
"that" 7924
"it" 6661
</pre>
 
=={{header|Prolog}}==
{{works with|SWI Prolog}}
<syntaxhighlight lang="prolog">print_top_words(File, N):-
read_file_to_string(File, String, [encoding(utf8)]),
re_split("\\w+", String, Words),
lower_case(Words, Lower),
sort(1, @=<, Lower, Sorted),
merge_words(Sorted, Counted),
sort(2, @>, Counted, Top_words),
writef("Top %w words:\nRank\tCount\tWord\n", [N]),
print_top_words(Top_words, N, 1).
 
lower_case([_], []):-!.
lower_case([_, Word|Words], [Lower - 1|Rest]):-
string_lower(Word, Lower),
lower_case(Words, Rest).
 
merge_words([], []):-!.
merge_words([Word - C1, Word - C2|Words], Result):-
!,
C is C1 + C2,
merge_words([Word - C|Words], Result).
merge_words([W|Words], [W|Rest]):-
merge_words(Words, Rest).
 
print_top_words([], _, _):-!.
print_top_words(_, 0, _):-!.
print_top_words([Word - Count|Rest], N, R):-
writef("%w\t%w\t%w\n", [R, Count, Word]),
N1 is N - 1,
R1 is R + 1,
print_top_words(Rest, N1, R1).
 
main:-
print_top_words("135-0.txt", 10).</syntaxhighlight>
 
{{out}}
<pre>
Top 15 words:
Rank Count Word
1 41040 the
2 19951 of
3 14942 and
4 14539 a
5 13941 to
6 11209 in
7 9646 he
8 8620 was
9 7922 that
10 6659 it
</pre>
 
=={{header|PureBasic}}==
<syntaxhighlight lang="purebasic">EnableExplicit
 
Structure wordcount
wkey$
count.i
EndStructure
 
Define token.c, word$, idx.i, start.i, arg$
NewMap wordmap.i()
NewList wordlist.wordcount()
 
If OpenConsole("")
arg$ = ProgramParameter(0)
If arg$ = "" : End 1 : EndIf
start = ElapsedMilliseconds()
If ReadFile(0, arg$, #PB_Ascii)
While Not Eof(0)
token = ReadCharacter(0, #PB_Ascii)
Select token
Case 'A' To 'Z', 'a' To 'z'
word$ + LCase(Chr(token))
Default
If word$
wordmap(word$) + 1
word$ = ""
EndIf
EndSelect
Wend
CloseFile(0)
ForEach wordmap()
AddElement(wordlist())
wordlist()\wkey$ = MapKey(wordmap())
wordlist()\count = wordmap()
Next
SortStructuredList(wordlist(), #PB_Sort_Descending, OffsetOf(wordcount\count), TypeOf(wordcount\count))
PrintN("Elapsed milliseconds: " + Str(ElapsedMilliseconds() - start))
PrintN("File: " + GetFilePart(arg$))
PrintN(~"Rank\tCount\t\t Word")
If FirstElement(wordlist())
For idx = 1 To 10
Print(RSet(Str(idx), 2))
Print(~"\t")
Print(wordlist()\wkey$)
Print(~"\t\t")
PrintN(RSet(Str(wordlist()\count), 6))
If NextElement(wordlist()) = 0
Break
EndIf
Next
EndIf
EndIf
Input()
EndIf
 
End</syntaxhighlight>
{{out}}
<pre>
Elapsed milliseconds: 462
File: 135-0.txt
Rank Count Word
1 the 41093
2 of 19954
3 and 14943
4 a 14558
5 to 13953
6 in 11219
7 he 9649
8 was 8622
9 that 7924
10 it 6661
</pre>
 
=={{header|Python}}==
===Python2.7Collections===
====Python2.7====
<lang python>import collections
<syntaxhighlight lang="python">import collections
import re
import string
Line 294 ⟶ 3,825:
 
if __name__ == "__main__":
main()</langsyntaxhighlight>
 
{{Out}}
Line 303 ⟶ 3,834:
</pre>
 
====Python3.6====
<langsyntaxhighlight lang="python">from collections import Counter
from re import findall
 
Line 323 ⟶ 3,854:
if __name__ == "__main__":
n = int(input('How many?: '))
most_common_words_in_file(les_mis_file, n)</langsyntaxhighlight>
 
{{Out}}
Line 338 ⟶ 3,869:
that 7922
it 6659</pre>
 
===Sorted and groupby===
{{Works with|Python|3.7}}
<syntaxhighlight lang="python">"""
Word count task from Rosetta Code
http://www.rosettacode.org/wiki/Word_count#Python
"""
from itertools import (groupby,
starmap)
from operator import itemgetter
from pathlib import Path
from typing import (Iterable,
List,
Tuple)
 
 
FILEPATH = Path('lesMiserables.txt')
COUNT = 10
 
 
def main():
words_and_counts = most_frequent_words(FILEPATH)
print(*words_and_counts[:COUNT], sep='\n')
 
 
def most_frequent_words(filepath: Path,
*,
encoding: str = 'utf-8') -> List[Tuple[str, int]]:
"""
A list of word-frequency pairs sorted by their occurrences.
The words are read from the given file.
"""
def word_and_frequency(word: str,
words_group: Iterable[str]) -> Tuple[str, int]:
return word, capacity(words_group)
 
file_contents = filepath.read_text(encoding=encoding)
words = file_contents.lower().split()
grouped_words = groupby(sorted(words))
words_and_frequencies = starmap(word_and_frequency, grouped_words)
return sorted(words_and_frequencies, key=itemgetter(1), reverse=True)
 
 
def capacity(iterable: Iterable) -> int:
"""Returns a number of elements in an iterable"""
return sum(1 for _ in iterable)
 
 
if __name__ == '__main__':
main()
</syntaxhighlight>
{{Out}}
<pre>('the', 40372)
('of', 19868)
('and', 14472)
('a', 14278)
('to', 13589)
('in', 11024)
('he', 9213)
('was', 8347)
('that', 7250)
('his', 6414)</pre>
 
===Collections, Sorted and Lambda===
<syntaxhighlight lang="python">
#!/usr/bin/python3
import collections
import re
 
count = 10
 
with open("135-0.txt") as f:
text = f.read()
 
word_freq = sorted(
collections.Counter(sorted(re.split(r"\W+", text.lower()))).items(),
key=lambda c: c[1],
reverse=True,
)
 
for i in range(len(word_freq)):
print("[{:2d}] {:>10} : {}".format(i + 1, word_freq[i][0], word_freq[i][1]))
if i == count - 1:
break
</syntaxhighlight>
{{Out}}
<pre>[ 1] the : 41039
[ 2] of : 19951
[ 3] and : 14942
[ 4] a : 14527
[ 5] to : 13941
[ 6] in : 11209
[ 7] he : 9646
[ 8] was : 8620
[ 9] that : 7922
[10] it : 6659</pre>
 
=={{header|R}}==
===Version 1===
I chose to remove apostrophes only if they're followed by an s (so "mom" and "mom's" will show up as the same word but "they" and "they're" won't). I also chose not to remove hyphens.
<syntaxhighlight lang="r">
wordcount<-function(file,n){
punctuation=c("`","~","!","@","#","$","%","^","&","*","(",")","_","+","=","{","[","}","]","|","\\",":",";","\"","<",",",">",".","?","/","'s")
wordlist=scan(file,what=character())
wordlist=tolower(wordlist)
for(i in 1:length(punctuation)){
wordlist=gsub(punctuation[i],"",wordlist,fixed=T)
}
df=data.frame("Word"=sort(unique(wordlist)),"Count"=rep(0,length(unique(wordlist))))
for(i in 1:length(unique(wordlist))){
df[i,2]=length(which(wordlist==df[i,1]))
}
df=df[order(df[,2],decreasing = T),]
row.names(df)=1:nrow(df)
return(df[1:n,])
}
</syntaxhighlight>
{{Out}}
<pre>
> wordcount("MobyDick.txt",10)
Read 212793 items
Word Count
1 the 14346
2 of 6590
3 and 6340
4 a 4611
5 to 4572
6 in 4130
7 that 2903
8 his 2516
9 it 2308
10 i 1845
</pre>
 
===Version 2===
This version is purely functional using the native pipe operator in R 4.1+ and runs in less than a second.
<syntaxhighlight lang="r">
word_frequency_pipeline <- function(file=NULL, n=10) {
file |>
vroom::vroom_lines() |>
stringi::stri_split_boundaries(type="word", skip_word_none=T, skip_word_number=T) |>
unlist() |>
tolower() |>
table() |>
sort(decreasing = T) |>
(\(.) .[1:n])() |>
data.frame()
}
</syntaxhighlight>
{{Out}}
<pre>
> word_frequency_pipeline("~/../Downloads/135-0.txt")
Var1 Freq
1 the 41042
2 of 19952
3 and 14938
4 a 14526
5 to 13942
6 in 11208
7 he 9605
8 was 8620
9 that 7824
10 it 6533
</pre>
 
=={{header|Racket}}==
<langsyntaxhighlight lang="racket">#lang racket
 
(define (all-words f (case-fold string-downcase))
Line 350 ⟶ 4,047:
 
(module+ main
(take (counts (all-words "data/les-mis.txt")) 10))</langsyntaxhighlight>
 
{{out}}
Line 363 ⟶ 4,060:
("that" . 7922)
("it" . 6659))</pre>
 
=={{header|Raku}}==
(formerly Perl 6)
{{works with|Rakudo|2022.07}}
Note: much of the following exposition is no longer critical to the task as the requirements have been updated, but is left here for historical and informational reasons.
 
This is slightly trickier than it appears initially. The task specifically states: "A word is a sequence of one or more contiguous letters", so contractions and hyphenated words are broken up. Initially we might reach for a regex matcher like /\w+/ , but \w includes underscore, which is not a letter but a punctuation connector; and this text is '''full''' of underscores since that is how Project Gutenberg texts denote italicized text. The underscores are not actually parts of the words though, they are markup.
 
We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various [[wp:diacritic|diacritic]]s. Those '''are''' letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.)
 
Actually, in this case /A-Za-z/ returns '''very nearly''' the correct answer. Unfortunately, the name "Alèthe" appears once (only once!) in the text, gets incorrectly split into Al & the, and incorrectly reports 41089 occurrences of "the".
The text has several words like "Panathenæa", "ça", "aérostiers" and "Keksekça" so the counts for 'a' are off too. The other 8 of the top 10 are "correct" using /A-Za-z/, but it is mostly by accident.
 
A more accurate regex matcher would be some kind of Unicode aware /\w/ minus underscore. It may also be useful, depending on your requirements, to recognize contractions with embedded apostrophes, hyphenated words, and hyphenated words broken across lines.
 
Here is a sample that shows the result when using various different matchers.
<syntaxhighlight lang="raku" line>sub MAIN ($filename, UInt $top = 10) {
my $file = $filename.IO.slurp.lc.subst(/ (<[\w]-[_]>'-')\n(<[\w]-[_]>) /, {$0 ~ $1}, :g );
my @matcher =
rx/ <[a..z]>+ /, # simple 7-bit ASCII
rx/ \w+ /, # word characters with underscore
rx/ <[\w]-[_]>+ /, # word characters without underscore
rx/ [<[\w]-[_]>+]+ % < ' - '- > / # word characters without underscore but with hyphens and contractions
;
for @matcher -> $reg {
say "\nTop $top using regex: ", $reg.raku;
my @words = $file.comb($reg).Bag.sort(-*.value)[^$top];
my $length = max @words».key».chars;
printf "%-{$length}s %d\n", .key, .value for @words;
}
}</syntaxhighlight>
 
{{out}}
Passing in the file name and 10:
<pre>Top 10 using regex: rx/ <[a..z]>+ /
the 41089
of 19949
and 14942
a 14608
to 13951
in 11214
he 9648
was 8621
that 7924
it 6661
 
Top 10 using regex: rx/ \w+ /
the 41035
of 19946
and 14940
a 14577
to 13939
in 11204
he 9645
was 8619
that 7922
it 6659
 
Top 10 using regex: rx/ <[\w]-[_]>+ /
the 41088
of 19949
and 14942
a 14596
to 13951
in 11214
he 9648
was 8621
that 7924
it 6661
 
Top 10 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
the 41081
of 19930
and 14934
a 14587
to 13735
in 11204
he 9607
was 8620
that 7825
it 6535</pre>
 
It can be difficult to figure out what words the different regexes do or don't match. Here are the three more complex regexes along with a list of "words" that are treated as being different using this regex as opposed to /a..z/. IE: It is lumped in as one of the top 10 word counts using /a..z/ but not with this regex.
 
<pre>Top 10 using regex: rx/ \w+ /
the 41035 alèthe _the _the_
of 19946 of_ _of_
and 14940 _and_ paternoster_and
a 14577 _ça aïe ça keksekça aérostiers _a poréa panathenæa
to 13939 to_ _to
in 11204 _in
he 9645 _he
was 8619 _was
that 7922 _that
it 6659 _it
 
Top 10 using regex: rx/ <[\w]-[_]>+ /
the 41088 alèthe
of 19949
and 14942
a 14596 poréa ça aérostiers panathenæa aïe keksekça
to 13951
in 11214
he 9648
was 8621
that 7924
it 6661
 
Top 10 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
the 41081 will-o'-the-wisps alèthe skip-the-gutter police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change jean-the-screw will-o'-the-wisp
of 19930 chromate-of-lead-colored die-of-hunger die-of-cold-if-you-have-bread police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change unheard-of die-of-hunger-if-you-have-a-fire
and 14934 come-and-see so-and-so cock-and-bull hide-and-seek sambre-and-meuse
a 14587 keksekça l'a ça now-a-days vis-a-vis a-dreaming police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change poréa panathenæa aérostiers a-hunting aïe die-of-hunger-if-you-have-a-fire
to 13735 to-morrow to-day hand-to-hand to-night well-to-do face-to-face
in 11204 in-pace son-in-law father-in-law whippers-in general-in-chief sons-in-law
he 9607 he's he'll
was 8620 police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change
that 7825 that's pick-me-down-that
it 6535 it's it'll</pre>
 
One nice thing is this isn't special cased. It will work out of the box for any text / language.
 
[https://www.gutenberg.org/files/14741/14741-0.txt Russian]? No problem.
 
<pre>$ raku wf 14741-0.txt 5</pre>
<pre>Top 5 using regex: rx/ <[a..z]>+ /
the 176
of 119
gutenberg 93
project 87
to 80
 
Top 5 using regex: rx/ \w+ /
и 860
в 579
не 290
на 222
ты 195
 
Top 5 using regex: rx/ <[\w]-[_]>+ /
и 860
в 579
не 290
на 222
ты 195
 
Top 5 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
и 860
в 579
не 290
на 222
ты 195</pre>
 
[https://www.gutenberg.org/files/39963/39963-0.txt Greek]? Sure, why not.
<pre>$ raku wf 39963-0.txt 5</pre>
<pre>Top 5 using regex: rx/ <[a..z]>+ /
the 187
of 123
gutenberg 93
project 87
to 82
 
Top 5 using regex: rx/ \w+ /
και 1628
εις 986
δε 982
του 895
των 859
 
Top 5 using regex: rx/ <[\w]-[_]>+ /
και 1628
εις 986
δε 982
του 895
των 859
 
Top 5 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
και 1628
εις 986
δε 982
του 895
των 859</pre>
 
Of course, for the first matcher, we are asking specifically to match Latin ASCII, so we end up with... well... Latin ASCII; but the other 3 match any Unicode characters.
 
=={{header|REXX}}==
Line 368 ⟶ 4,249:
This REXX version doesn't need to sort the list of words.
 
Extra code was added to handle some foreign letters &nbsp; (non-Latin) &nbsp; and
Currently, this version recognizes all the accented (non-Latin) accented letters that are present in the text (file) that is specified to be used &nbsp; (and some other non-Latin letters as well). &nbsp; This means that the word &nbsp; &nbsp; <big><big> Alèthe </big></big> &nbsp; &nbsp; is treated as one word, not as two words &nbsp; &nbsp; <big><big> Al &nbsp;the </big></big> &nbsp; &nbsp; (and not thereby adding two words).
also handle most accented letters.
 
This version recognizes all the accented letters that are present in the
This version also supports words that contain embedded apostrophes (<b><big>''' ' '''</big></b>) &nbsp; &nbsp; [that is, within a word, but not those words that start or end with an apostrophe; for those words, the apostrophe is elided].
required/specified text (file) &nbsp; (and some other non-Latin letters as well).
 
Thus,This &nbsp;means '''that it'sthe '''word &nbsp; is counted separately from &nbsp; '''it'''<big><big> Alèthe </big></big> &nbsp; or &nbsp; '''is its'''.
treated as one word, &nbsp; <u>not</u> as two words &nbsp; &nbsp; <big><big> Al &nbsp; the
</big></big> &nbsp; &nbsp; (and not thereby adding two separate words).
 
This version also supports words that contain embedded
Since REXX doesn't support UTF-8 encodings, code was added to this REXX version to support the accented letters in the mandated input file.
apostrophes (<b><big>''' ' '''</big></b>)
<lang rexx>/*REXX pgm displays top 10 words in a file (includes foreign letters), case is ignored.*/
<br><big>[</big>that is, within a word, &nbsp; but <u>not</u> those words that start or
end with an apostrophe; &nbsp; for those encapsulated words, &nbsp; the apostrophe is
elided<big>]</big>.
 
Thus, &nbsp; ''' it's ''' &nbsp; is counted separately
from &nbsp; '''it''' &nbsp; and/or &nbsp; ''' its'''.
 
Since REXX doesn't support UTF-8 encodings, code was added to this REXX version to
support the accented letters in the mandated input file.
<syntaxhighlight lang="rexx">/*REXX pgm displays top 10 words in a file (includes foreign letters), case is ignored.*/
parse arg fID top . /*obtain optional arguments from the CL*/
if fID=='' | fID=="," then fID= 'les_mes.TXTtxt' /*None specified? Then use the default.*/
if top=='' | top=="," then top= 10 /* " " " " " " */
call init /*initialize varied bunch of variables.*/
@.=0; c=0; abcL="abcdefghijklmnopqrstuvwxyz'" /*initialize word list, count; alphabet*/
call rdr
q= "'"; abcU= abcL; upper abcU /*define uppercase version of alphabet*/
totW=0; accL= 'üéâÄàÅÇêëèïîìéæôÖòûùÿáíóúÑ' /* " " of some accented chrs*/
accU= 'ÜéâäàåçêëèïîìÉÆôöòûùÿáíóúñ' /* " lowercase accented characters.*/
accG= 'αßΓπΣσµτΦΘΩδφε' /* " some upper/lower Greek letters*/
a=abcL || abcL ||accL ||accL || accG /* " char string of after letters.*/
b=abcL || abcU ||accL ||accU || accG || xrange() /* " char string of before " */
x= 'Çà åÅ çÇ êÉ ëÉ áà óâ ªæ ºç ¿è ⌐é ¬ê ½ë «î »ï ▒ñ ┤ô ╣ù ╗û ╝ü' /*list of 16-bit chars.*/
xs= words(x) /*num. " " " */
!.= /*define the original word instances. */
do #=0 while lines(fID)\==0; $=linein(fID) /*loop whilst there are lines in file. */
if pos('├', $)\==0 then do k=1 for xs; _=word(x, k) /*any 16-bit chars? */
$=changestr('├'left(_, 1), $, right(_, 1) ) /*convert.*/
end /*k*/
$=translate( $, a, b) /*remove superfluous blanks in the line*/
do while $\=''; parse var $ z $ /*now, process each word in the $ list.*/
parse var z z1 2 zr '' -1 zL /*extract: first, middle, & last char.*/
if z1==q then do; z=zr; if z=='' then iterate; end /*starts with apostrophe? */
if zL==q then z=strip(left(z, length(z) - 1)) /*ends " " */
if z=='' then iterate /*if Z is now null, skip.*/
if @.z==0 then do; c=c+1; !.c=z; end /*bump word count; assign word to array*/
totW=totW + 1; @.z=@.z + 1 /*bump total words & count of the word.*/
end /*while*/
end /*#*/
say commas(totW) ' words found ('commas(c) "unique) in " commas(#),
' records read from file: ' fID; say
say right('word', 40) " " center(' rank ', 6) " count " /*display title for output*/
say right('════', 40) " " center('══════', 6) " ═══════" /* " title separator.*/
 
tops=1
do until otops==tops | tops>top /*process enough words to satisfy TOP.*/
WL=; mk= 0; otops=tops tops /*initialize the word list (to a NULL).*/
 
do n=1 for c; z=!.n; k=@.z /*process the list of words in the file*/
ifdo kn==mk then WL=WL z1 for c; z= !.n; k= @.z /*handleprocess casesthe list of tiedwords numberin ofthe words.file*/
if k==mk then WL= WL z /*handle cases of tied number of words.*/
if k> mk then do; mk=k; WL=z; end /*this word count is the current max. */
end /*n*/
 
wr=max( length(' rank '), length(top) ) /*find the maximum length of the rank #*/
wr= max( length(' rank '), do d=1 for wordslength(WLtop); ) _=word(WL, d) /*processfind allthe wordsmaximum inlength of the rank word list. #*/
 
if d==1 then w=max(10, length(@._) ) /*use length of the first number used. */
saydo right(@._,d=1 40) for words(WL); y= rightword(commas(tops)WL, wrd) /*process all words in the word right(commas(@list._), w)*/
@._if d== -1 then w= max(10, length(@.y) ) /*nullifyuse wordlength countof forthe nextfirst gonumber used. around*/
say right(y, 40) right( commas(tops), wr) right(commas(@.y), w)
@.y= . /*nullify word count for next go 'round*/
end /*d*/ /* [↑] this allows a non-sorted list. */
 
tops=tops + words(WL) /*correctly handle any tied rankings.*/
tops= tops + words(WL) /*correctly handle any tied rankings.*/
end /*until*/
exit /*stick a fork in it, we're all done. */
/*──────────────────────────────────────────────────────────────────────────────────────*/
commas: procedure; parse arg _?; do njc=_'.9';length(?)-3 to 1 by #=123456789-3; b?=verifyinsert(n',', #?, "M"jc); end; return ?
16bit: do k=1 for xs; _=word(x,k); $=changestr('├'left(_,1),$,right(_,1)); end; return
e=verify(n, #'0', , verify(n, #"0.", 'M') ) - 4
/*──────────────────────────────────────────────────────────────────────────────────────*/
do j=e to b by -3; _=insert(',', _, j); end /*j*/; return _</lang>
init: x= 'Çà åÅ çÇ êÉ ëÉ áà óâ ªæ ºç ¿è ⌐é ¬ê ½ë «î »ï ▒ñ ┤ô ╣ù ╗û ╝ü'; xs= words(x)
abcL="abcdefghijklmnopqrstuvwxyz'" /*lowercase letters of Latin alphabet. */
abcU= abcL; upper abcU /*uppercase version of Latin alphabet. */
accL= 'üéâÄàÅÇêëèïîìéæôÖòûùÿáíóúÑ' /*some lowercase accented characters. */
accU= 'ÜéâäàåçêëèïîìÉÆôöòûùÿáíóúñ' /* " uppercase " " */
accG= 'αßΓπΣσµτΦΘΩδφε' /* " upper/lowercase Greek letters. */
ll= abcL || abcL ||accL ||accL || accG /*chars of after letters. */
uu= abcL || abcU ||accL ||accU || accG || xrange() /* " " before " */
@.= 0; q= "'"; totW= 0; !.= @.; c= 0; tops= 1; return
/*──────────────────────────────────────────────────────────────────────────────────────*/
rdr: do #=0 while lines(fID)\==0; $=linein(fID) /*loop whilst there're lines in file.*/
if pos('├', $) \== 0 then call 16bit /*are there any 16-bit characters ?*/
$= translate( $, ll, uu) /*trans. uppercase letters to lower. */
do while $ \= ''; parse var $ z $ /*process each word in the $ line. */
parse var z z1 2 zr '' -1 zL /*obtain: first, middle, & last char.*/
if z1==q then do; z=zr; if z=='' then iterate; end /*starts with apostrophe?*/
if zL==q then z= strip(left(z, length(z) - 1)) /*ends " " ?*/
if z=='' then iterate /*if Z is now null, skip.*/
if @.z==0 then do; c=c+1; !.c=z; end /*bump word cnt; assign word to array*/
totW= totW + 1; @.z= @.z + 1 /*bump total words; bump a word count*/
end /*while*/
end /*#*/
say commas(totW) ' words found ('commas(c) "unique) in " commas(#),
' records read from file: ' fID; say; return</syntaxhighlight>
{{out|output|text=&nbsp; when using the default inputs:}}
<pre>
574,122 words found (23,414 unique) in 67,663 records read from file: les_mes.TXTtxt
 
word rank count
Line 444 ⟶ 4,343:
it 10 6,535
</pre>
To see a list of the top 1,000 words that show (among other things) words like &nbsp; '''it's''' &nbsp; and other accented words, see the &nbsp; ''discussion'' &nbsp; page for this task. <br><br>
 
===version 2===
Inspired by version 1 and adapted for ooRexx.
It ignores all characters other than a-z and A-Z (which are translated to a-z).
<syntaxhighlight lang="text">/*REXX program reads and displays a count of words a file. Word case is ignored.*/
Call time 'R'
abc='abcdefghijklmnopqrstuvwxyz'
Line 499 ⟶ 4,398:
tops=tops+words(tl) /*correctly handle the tied rankings. */
end
Say time('E') 'seconds elapsed'</langsyntaxhighlight>
{{out}}
<pre>We found 22820 different words
Line 515 ⟶ 4,414:
it 10 6661
1.750000 seconds elapsed</pre>
 
=={{header|Ring}}==
<syntaxhighlight lang="ring">
# project : Word count
 
fp = fopen("Miserables.txt","r")
str = fread(fp, getFileSize(fp))
fclose(fp)
 
mis =substr(str, " ", nl)
mis = lower(mis)
mis = str2list(mis)
count = list(len(mis))
ready = []
for n = 1 to len(mis)
flag = 0
for m = 1 to len(mis)
if mis[n] = mis[m] and n != m
for p = 1 to len(ready)
if m = ready[p]
flag = 1
ok
next
if flag = 0
count[n] = count[n] + 1
ok
ok
next
if flag = 0
add(ready, n)
ok
next
for n = 1 to len(count)
for m = n + 1 to len(count)
if count[m] > count[n]
temp = count[n]
count[n] = count[m]
count[m] = temp
temp = mis[n]
mis[n] = mis[m]
mis[m] = temp
ok
next
next
for n = 1 to 10
see mis[n] + " " + (count[n] + 1) + nl
next
 
func getFileSize fp
c_filestart = 0
c_fileend = 2
fseek(fp,0,c_fileend)
nfilesize = ftell(fp)
fseek(fp,0,c_filestart)
return nfilesize
 
func swap(a, b)
temp = a
a = b
b = temp
return [a, b]
</syntaxhighlight>
Output:
<pre>
the 41089
of 19949
and 14942
a 14608
to 13951
in 11214
he 9648
was 8621
that 7924
it 6661
</pre>
 
=={{header|Ruby}}==
<langsyntaxhighlight lang="ruby">
class String
def wc
Line 527 ⟶ 4,501:
 
open('135-0.txt') { |n| n.read.wc[-10,10].each{|n| puts n[0].to_s+"->"+n[1].to_s} }
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 540 ⟶ 4,514:
of->19949
the->41088
</pre>
===Tally and max_by===
{{Works with|Ruby|2.7}}
<syntaxhighlight lang="ruby">RE = /[[:alpha:]]+/
count = open("135-0.txt").read.downcase.scan(RE).tally.max_by(10, &:last)
count.each{|ar| puts ar.join("->") }
</syntaxhighlight>
{{out}}
<pre>the->41092
of->19954
and->14943
a->14546
to->13953
in->11219
he->9649
was->8622
that->7924
it->6661
</pre>
===Chain of Enumerables===
<syntaxhighlight lang="ruby">wf = File.read("135-0.txt", :encoding => "UTF-8")
.downcase
.scan(/\w+/)
.each_with_object(Hash.new(0)) { |word, hash| hash[word] += 1 }
.sort_by { |k, v| v }
.reverse
.take(10)
.each_with_index { |w, i|
printf "[%2d] %10s : %d\n",
i += 1,
w[0],
w[1]
}
</syntaxhighlight>
{{out}}
<pre>[ 1] the : 41040
[ 2] of : 19951
[ 3] and : 14942
[ 4] a : 14539
[ 5] to : 13941
[ 6] in : 11209
[ 7] he : 9646
[ 8] was : 8620
[ 9] that : 7922
[10] it : 6659
</pre>
 
=={{header|Rust}}==
<syntaxhighlight lang="rust">use std::cmp::Reverse;
use std::collections::HashMap;
use std::fs::File;
use std::io::{BufRead, BufReader};
 
extern crate regex;
use regex::Regex;
 
fn word_count(file: File, n: usize) {
let word_regex = Regex::new("(?i)[a-z']+").unwrap();
 
let mut words = HashMap::new();
for line in BufReader::new(file).lines() {
word_regex
.find_iter(&line.expect("Read error"))
.map(|m| m.as_str())
.for_each(|word| {
*words.entry(word.to_lowercase()).or_insert(0) += 1;
});
}
 
let mut words: Vec<_> = words.iter().collect();
words.sort_unstable_by_key(|&(word, count)| (Reverse(count), word));
 
for (word, count) in words.iter().take(n) {
println!("{:8} {:>8}", word, count);
}
}
 
fn main() {
word_count(File::open("135-0.txt").expect("File open error"), 10)
}</syntaxhighlight>
 
{{out}}
<pre>
the 41083
of 19948
and 14941
a 14604
to 13951
in 11212
he 9604
was 8621
that 7824
it 6534
</pre>
 
=={{header|Scala}}==
=== Featuring online remote file as input===
{{Out}}
Best seen running in your browser [https://scastie.scala-lang.org/EP2Fm6HXQrC1DwtSNvnUzQ Scastie (remote JVM)].
<syntaxhighlight lang="scala">import scala.io.Source
 
object WordCount extends App {
 
val url = "http://www.gutenberg.org/files/135/135-0.txt"
val header = "Rank Word Frequency\n==== ======== ======"
 
def wordCnt =
Source.fromURL(url).getLines()
.filter(_.nonEmpty)
.flatMap(_.split("""\W+""")).toSeq
.groupBy(_.toLowerCase())
.mapValues(_.size).toSeq
.sortWith { case ((_, v0), (_, v1)) => v0 > v1 }
.take(10).zipWithIndex
 
println(header)
wordCnt.foreach {
case ((word, count), rank) => println(f"${rank + 1}%4d $word%-8s $count%6d")
}
 
println(s"\nSuccessfully completed without errors. [total ${scala.compat.Platform.currentTime - executionStart} ms]")
 
}</syntaxhighlight>
{{out}}
<pre>Rank Word Frequency
==== ======== ======
1 the 41036
2 of 19946
3 and 14940
4 a 14589
5 to 13939
6 in 11204
7 he 9645
8 was 8619
9 that 7922
10 it 6659
 
Successfully completed without errors. [total 4528 ms]</pre>
 
=={{header|Seed7}}==
 
The Seed7 program uses the function [http://seed7.sourceforge.net/libraries/gethttp.htm#getHttp(in_string) getHttp],
to get the file 135-0.txt directly from Gutemberg. The library [http://seed7.sourceforge.net/libraries/scanfile.htm scanfile.s7i]
provides [http://seed7.sourceforge.net/libraries/scanfile.htm#getSimpleSymbol(inout_file) getSimpleSymbol],
to get words from a fle. The words are [http://seed7.sourceforge.net/libraries/string.htm#lower(in_string) converted to lower case], to assure that "The" and "the" are considered the same.
 
<syntaxhighlight lang="seed7">$ include "seed7_05.s7i";
include "gethttp.s7i";
include "strifile.s7i";
include "scanfile.s7i";
include "chartype.s7i";
include "console.s7i";
 
const type: wordHash is hash [string] integer;
const type: countHash is hash [integer] array string;
 
const proc: main is func
local
var file: inFile is STD_NULL;
var string: aWord is "";
var wordHash: numberOfWords is wordHash.EMPTY_HASH;
var countHash: countWords is countHash.EMPTY_HASH;
var array integer: countKeys is 0 times 0;
var integer: index is 0;
var integer: number is 0;
begin
OUT := STD_CONSOLE;
inFile := openStrifile(getHttp("www.gutenberg.org/files/135/135-0.txt"));
while hasNext(inFile) do
aWord := lower(getSimpleSymbol(inFile));
if aWord <> "" and aWord[1] in letter_char then
if aWord in numberOfWords then
incr(numberOfWords[aWord]);
else
numberOfWords @:= [aWord] 1;
end if;
end if;
end while;
countWords := flip(numberOfWords);
countKeys := sort(keys(countWords));
writeln("Word Frequency");
for index range length(countKeys) downto length(countKeys) - 9 do
number := countKeys[index];
for aWord range sort(countWords[number]) do
writeln(aWord rpad 8 <& number);
end for;
end for;
end func;</syntaxhighlight>
 
{{out}}
<pre>
Word Frequency
the 41036
of 19946
and 14940
a 14589
to 13939
in 11204
he 9645
was 8619
that 7922
it 6659
</pre>
 
=={{header|Sidef}}==
<langsyntaxhighlight lang="ruby">var count = Hash()
var file = File(ARGV[0] \\ '135-0.txt')
 
Line 556 ⟶ 4,732:
top.each { |pair|
say "#{pair.key}\t-> #{pair.value}"
}</langsyntaxhighlight>
{{out}}
<pre>
Line 572 ⟶ 4,748:
 
=={{header|Simula}}==
<langsyntaxhighlight lang="simula">COMMENT COMPILE WITH
$ cim -m64 word-count.sim
;
Line 851 ⟶ 5,027:
 
END
</syntaxhighlight>
</lang>
{{out}}
<pre>
Line 866 ⟶ 5,042:
 
6 garbage collection(s) in 0.2 seconds.
</pre>
 
=={{header|Smalltalk}}==
The ASCII text file is from https://www.gutenberg.org/files/135/old/lesms10.txt.
 
===Cuis Smalltalk, ASCII===
{{works with|Cuis|6.0}}
<syntaxhighlight lang="smalltalk">
(StandardFileStream new open: 'lesms10.txt' forWrite: false)
contents asLowercase substrings asBag sortedCounts first: 10.
</syntaxhighlight>
{{Out}}<pre>an OrderedCollection(40543 -> 'the' 19796 -> 'of' 14448 -> 'and' 14380 -> 'a' 13582 -> 'to' 11006 -> 'in' 9221 -> 'he' 8351 -> 'was' 7258 -> 'that' 6420 -> 'his') </pre>
 
===Squeak Smalltalk, ASCII===
{{works with|Squeak|6.0}}
<syntaxhighlight lang="smalltalk">
(StandardFileStream readOnlyFileNamed: 'lesms10.txt')
contents asLowercase substrings asBag sortedCounts first: 10.
</syntaxhighlight>
{{Out}}<pre>{40543->'the' . 19796->'of' . 14448->'and' . 14380->'a' . 13582->'to' . 11006->'in' . 9221->'he' . 8351->'was' . 7258->'that' . 6420->'his'} </pre>
 
=={{header|Swift}}==
<syntaxhighlight lang="swift">import Foundation
 
func printTopWords(path: String, count: Int) throws {
// load file contents into a string
let text = try String(contentsOfFile: path, encoding: String.Encoding.utf8)
var dict = Dictionary<String, Int>()
// split text into words, convert to lowercase and store word counts in dict
let regex = try NSRegularExpression(pattern: "\\w+")
regex.enumerateMatches(in: text, range: NSRange(text.startIndex..., in: text)) {
(match, _, _) in
guard let match = match else { return }
let word = String(text[Range(match.range, in: text)!]).lowercased()
dict[word, default: 0] += 1
}
// sort words by number of occurrences
let wordCounts = dict.sorted(by: {$0.1 > $1.1})
// print the top count words
print("Rank\tWord\tCount")
for (i, (word, n)) in wordCounts.prefix(count).enumerated() {
print("\(i + 1)\t\(word)\t\(n)")
}
}
 
do {
try printTopWords(path: "135-0.txt", count: 10)
} catch {
print(error.localizedDescription)
}</syntaxhighlight>
 
{{out}}
<pre>
Rank Word Count
1 the 41039
2 of 19951
3 and 14942
4 a 14527
5 to 13941
6 in 11209
7 he 9646
8 was 8620
9 that 7922
10 it 6659
</pre>
 
=={{header|Tcl}}==
<syntaxhighlight lang="tcl">lassign $argv head
while { [gets stdin line] >= 0 } {
foreach word [regexp -all -inline {[A-Za-z]+} $line] {
dict incr wordcount [string tolower $word]
}
}
 
set sorted [lsort -stride 2 -index 1 -int -decr $wordcount]
foreach {word count} [lrange $sorted 0 [expr {$head * 2 - 1}]] {
puts "$count\t$word"
}</syntaxhighlight>
 
./wordcount-di.tcl 10 < 135-0.txt
{{out}}
<pre>
41093 the
19954 of
14943 and
14558 a
13953 to
11219 in
9649 he
8622 was
7924 that
6661 it
</pre>
 
=={{header|TMG}}==
McIlroy's Unix TMG:
<syntaxhighlight lang="unixtmg">/* Input format: N text */
/* Only lowercase letters can constitute a word in text. */
/* (c) 2020, Andrii Makukha, 2-clause BSD licence. */
 
progrm: readn/error
table(freq) table(chain) [firstword = ~0]
loop: not(!<<>>) output
| [j=777] batch/loop loop; /* Main loop */
 
/* To use less stack, divide input into batches. */
/* (Avoid interpreting entire input as a single "sentence".) */
batch: [j<=0?] succ
| word/skip [j--] skip batch;
skip: string(other);
not: params(1) (any($1) fail | ());
readn: string(!<<0123456789>>) readint(n) skip;
error: diag(( ={ <ERROR: input must start with a number> * } ));
 
/* Process a word */
word: smark any(letter) string(letter) scopy
locate/new
[freq[k]++] newmax;
locate: find(freq, k);
new: enter(freq, k)
[freq[k] = 1] newmax
[firstword = firstword==~0 ? k : firstword]
enter(chain, i) [chain[i]=prevword] [prevword=k];
newmax: [max = max<freq[k] ? freq[k] : max];
 
/* Output logic */
output: [next=max]
outmax: [max=next] [next=0] [max>0?] [j = prevword] cycle/outmax;
cycle: [i = j] [k = freq[i]] [n>0?]
( [max==freq[i]?] parse(wn)
| [(freq[i]<max) & (next<freq[i])?] [next = freq[i]]
| ())
[i != firstword?] [j = chain[i]] cycle;
wn: getnam(freq, i) [k = freq[i]] decimal(k) [n--]
= { 2 < > 1 * };
 
/* Reads decimal integer */
readint: proc(n;i) ignore(<<>>) [n=0] inta
int1: [n = n*12+i] inta\int1;
inta: char(i) [i<72?] [(i =- 60)>=0?];
 
/* Variables */
prevword: 0; /* Head of the linked list */
firstword: 0; /* First word's index to know where to stop output */
k: 0;
i: 0;
j: 0;
n: 0; /* Number of most frequent words to display */
max: 0; /* Current highest number of occurrences */
next: 0; /* Next highest number of occurrences */
 
/* Tables */
freq: 0;
chain: 0;
 
/* Character classes */
letter: <<abcdefghijklmnopqrstuvwxyz>>;
other: !<<abcdefghijklmnopqrstuvwxyz>>;</syntaxhighlight>
 
Unix TMG didn't have <tt>tolower</tt> builtin. Therefore, you would use it together with <tt>tr</tt>:
<syntaxhighlight lang="bash">cat file | tr A-Z a-z > file1; ./a.out file1</syntaxhighlight>
 
Additionally, because 1972 TMG only understood ASCII characters, you might want to strip down the diacritics (e.g., é → e):
<syntaxhighlight lang="bash">cat file | uni2ascii -B | tr A-Z a-z > file1; ./a.out file1</syntaxhighlight>
 
=={{header|Transd}}==
<syntaxhighlight lang="Scheme">#lang transd
 
MainModule: {
_start: (λ locals: cnt 0
(with fs FileStream() words String()
(open-r fs "/mnt/text/Literature/Miserables.txt")
(textin fs words)
 
(with v ( -|
(split (tolower words))
(group-by)
(regroup-by (λ v Vector<String>() -> Int() (size v))))
 
(for i in v :rev do (lout (get (get (snd i) 0) 0) ":\t " (fst i))
(+= cnt 1) (if (> cnt 10) break))
)))
}</syntaxhighlight>
{{out}}
<pre>
the: 40379
of: 19869
and: 14468
a: 14278
to: 13590
in: 11025
he: 9213
was: 8347
that: 7249
his: 6414
had: 6051
</pre>
 
Line 872 ⟶ 5,244:
{{works with|zsh}}
This is derived from Doug McIlroy's original 6-line note in the ACM article cited in the task.
<langsyntaxhighlight lang="bash">#!/bin/sh
cat <"${1} |" tr -cs A-Za-z '\n' | tr A-Z a-z | LC_ALL=C sort | uniq -c | sort -rn | sedhead -n "${2}q"</langsyntaxhighlight>
 
 
Line 890 ⟶ 5,262:
6661 it
</pre>
 
 
 
=== Original + URL import ===
 
 
 
This is Doug McIlroy's original solution but follows other solutions in importing the task's text file from the web and directly specifying the 10 most commonly used words.
 
<syntaxhighlight lang="zsh">curl "https://www.gutenberg.org/files/135/135-0.txt" | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 10q</syntaxhighlight>
 
{{Out}}
 
<pre>41096 the
19955 of
14939 and
14558 a
13954 to
11218 in
9649 he
8622 was
7924 that
6661 it</pre>
 
=={{header|VBA}}==
In order to use it, you have to adapt the PATHFILE Const.
 
<syntaxhighlight lang="vb">
Option Explicit
 
Private Const PATHFILE As String = "C:\HOME\VBA\ROSETTA"
 
Sub Main()
Dim arr
Dim Dict As Object
Dim Book As String, temp As String
Dim T#
T = Timer
Book = ExtractTxt(PATHFILE & "\les miserables.txt")
temp = RemovePunctuation(Book)
temp = UCase(temp)
arr = Split(temp, " ")
Set Dict = CreateObject("Scripting.Dictionary")
FillDictionary Dict, arr
Erase arr
SortDictByFreq Dict, arr
DisplayTheTopMostUsedWords arr, 10
 
Debug.Print "Words different in this book : " & Dict.Count
Debug.Print "-------------------------"
Debug.Print ""
Debug.Print "Optionally : "
Debug.Print "Frequency of the word MISERABLE : " & DisplayFrequencyOf("MISERABLE", Dict)
Debug.Print "Frequency of the word DISASTER : " & DisplayFrequencyOf("DISASTER", Dict)
Debug.Print "Frequency of the word ROSETTA_CODE : " & DisplayFrequencyOf("ROSETTA_CODE", Dict)
Debug.Print "-------------------------"
Debug.Print "Execution Time : " & Format(Timer - T, "0.000") & " sec."
End Sub
 
Private Function ExtractTxt(strFile As String) As String
'http://rosettacode.org/wiki/File_input/output#VBA
Dim i As Integer
i = FreeFile
Open strFile For Input As #i
ExtractTxt = Input(LOF(1), #i)
Close #i
End Function
 
Private Function RemovePunctuation(strBook As String) As String
Dim T, i As Integer, temp As String
Const PUNCT As String = """,;:!?."
T = Split(StrConv(PUNCT, vbUnicode), Chr(0))
temp = strBook
For i = LBound(T) To UBound(T) - 1
temp = Replace(temp, T(i), " ")
Next
temp = Replace(temp, "--", " ")
temp = Replace(temp, "...", " ")
temp = Replace(temp, vbCrLf, " ")
RemovePunctuation = Replace(temp, " ", " ")
End Function
 
Private Sub FillDictionary(d As Object, a As Variant)
Dim L As Long
For L = LBound(a) To UBound(a)
If a(L) <> "" Then _
d(a(L)) = d(a(L)) + 1
Next
End Sub
 
Private Sub SortDictByFreq(d As Object, myArr As Variant)
Dim K
Dim L As Long
ReDim myArr(1 To d.Count, 1 To 2)
For Each K In d.keys
L = L + 1
myArr(L, 1) = K
myArr(L, 2) = CLng(d(K))
Next
SortArray myArr, LBound(myArr), UBound(myArr), 2
End Sub
 
Private Sub SortArray(a, Le As Long, Ri As Long, Col As Long)
Dim ref As Long, L As Long, r As Long, temp As Variant
ref = a((Le + Ri) \ 2, Col)
L = Le
r = Ri
Do
Do While a(L, Col) < ref
L = L + 1
Loop
Do While ref < a(r, Col)
r = r - 1
Loop
If L <= r Then
temp = a(L, 1)
a(L, 1) = a(r, 1)
a(r, 1) = temp
temp = a(L, 2)
a(L, 2) = a(r, 2)
a(r, 2) = temp
L = L + 1
r = r - 1
End If
Loop While L <= r
If L < Ri Then SortArray a, L, Ri, Col
If Le < r Then SortArray a, Le, r, Col
End Sub
 
Private Sub DisplayTheTopMostUsedWords(arr As Variant, Nb As Long)
Dim L As Long, i As Integer
i = 1
Debug.Print "Rank Word Frequency"
Debug.Print "==== ======= ========="
For L = UBound(arr) To UBound(arr) - Nb + 1 Step -1
Debug.Print Left(CStr(i) & " ", 5) & Left(arr(L, 1) & " ", 8) & " " & Format(arr(L, 2), "0 000")
i = i + 1
Next
End Sub
 
Private Function DisplayFrequencyOf(Word As String, d As Object) As Long
If d.Exists(Word) Then _
DisplayFrequencyOf = d(Word)
End Function</syntaxhighlight>
{{out}}
<pre>Words different in this book : 25884
-------------------------
Rank Word Frequency
==== ======= =========
1 THE 40 831
2 OF 19 807
3 AND 14 860
4 A 14 453
5 TO 13 641
6 IN 11 133
7 HE 9 598
8 WAS 8 617
9 THAT 7 807
10 IT 6 517
 
Optionally :
Frequency of the word MISERABLE : 35
Frequency of the word DISASTER : 12
Frequency of the word ROSETTA_CODE : 0
-------------------------
Execution Time : 7,785 sec.</pre>
 
=={{header|Wren}}==
{{trans|Go}}
{{libheader|Wren-str}}
{{libheader|Wren-sort}}
{{libheader|Wren-fmt}}
{{libheader|Wren-pattern}}
I've taken the view that 'letter' means either a letter or digit for Unicode codepoints up to 255. I haven't included underscore, hyphen nor apostrophe as these usually separate compound words.
 
Not very quick (runs in about 15 seconds on my system) though this is partially due to Wren not having regular expressions and the string pattern matching module being written in Wren itself rather than C.
 
If the Go example is re-run today (17 February 2024), then the output matches this Wren example precisely though it appears that the text file has changed since the former was written more than 5 years ago.
<syntaxhighlight lang="wren">import "io" for File
import "./str" for Str
import "./sort" for Sort
import "./fmt" for Fmt
import "./pattern" for Pattern
 
var fileName = "135-0.txt"
var text = File.read(fileName).trimEnd()
var groups = {}
// match runs of A-z, a-z, 0-9 and any non-ASCII letters with code-points < 256
var p = Pattern.new("+1&w")
var lines = text.split("\n")
for (line in lines) {
var ms = p.findAll(line)
for (m in ms) {
var t = Str.lower(m.text)
groups[t] = groups.containsKey(t) ? groups[t] + 1 : 1
}
}
var keyVals = groups.toList
Sort.quick(keyVals, 0, keyVals.count - 1) { |i, j| (j.value - i.value).sign }
System.print("Rank Word Frequency")
System.print("==== ==== =========")
for (rank in 1..10) {
var word = keyVals[rank-1].key
var freq = keyVals[rank-1].value
Fmt.print("$2d $-4s $5d", rank, word, freq)
}</syntaxhighlight>
 
{{out}}
<pre>
Rank Word Frequency
==== ==== =========
1 the 41092
2 of 19954
3 and 14943
4 a 14546
5 to 13953
6 in 11219
7 he 9649
8 was 8622
9 that 7924
10 it 6661
</pre>
 
=={{header|XQuery}}==
 
<syntaxhighlight lang="xquery">let $maxentries := 10,
$uri := 'https://www.gutenberg.org/files/135/135-0.txt'
return
<words in="{$uri}" top="{$maxentries}"> {
(
let $doc := unparsed-text($uri),
$tokens := (
tokenize($doc, '\W+')[normalize-space()]
! lower-case(.)
! normalize-unicode(., 'NFC')
)
return
for $token in $tokens
let $key := $token
group by $key
let $count := count($token)
order by $count descending
return <word key="{$key}" count="{$count}"/>
)[position()=(1 to $maxentries)]
}</words></syntaxhighlight>
{{out}}
<syntaxhighlight lang="xml"><words in="https://www.gutenberg.org/files/135/135-0.txt" top="10">
<word key="the" count="41092"/>
<word key="of" count="19954"/>
<word key="and" count="14943"/>
<word key="a" count="14545"/>
<word key="to" count="13953"/>
<word key="in" count="11219"/>
<word key="he" count="9649"/>
<word key="was" count="8622"/>
<word key="that" count="7924"/>
<word key="it" count="6661"/>
</words></syntaxhighlight>
 
=={{header|zkl}}==
<langsyntaxhighlight lang="zkl">fname,count := vm.arglist; // grab cammand line args
 
// words may have leading or trailing "_", ie "the" and "_the"
Line 898 ⟶ 5,528:
RegExp("[a-z]+").pump.fp1(Dictionary().incV)) // line-->(word:count,..)
.toList().copy().sort(fcn(a,b){ b[1]<a[1] })[0,count.toInt()] // hash-->list
.pump(String,Void.Xplode,"%s,%s\n".fmt).println();</langsyntaxhighlight>
{{out}}
<pre>
Line 913 ⟶ 5,543:
it,6661
</pre>
 
{{omit from|6502 Assembly|The text file is much larger than the CPU's address space.}}
{{omit from|Z80 Assembly}}
{{omit from|8080 Assembly}}
9,482

edits