I'm working on modernizing Rosetta Code's infrastructure. Starting with communications. Please accept this time-limited open invite to RC's Slack.. --Michael Mol (talk) 20:59, 30 May 2020 (UTC)

# Word frequency

Word frequency
You are encouraged to solve this task according to the task description, using any language you may know.

Given a text file and an integer   n,   print/display the   n   most common words in the file   (and the number of their occurrences)   in decreasing frequency.

For the purposes of this task:

•   A word is a sequence of one or more contiguous letters.
•   You are free to define what a   letter   is.
•   Underscores, accented letters, apostrophes, hyphens, and other special characters can be handled at your discretion.
•   You may treat a compound word like   well-dressed   as either one word or two.
•   The word   it's   could also be one or two words as you see fit.
•   You may also choose not to support non US-ASCII characters.
•   Assume words will not span multiple lines.
•   Don't worry about normalization of word spelling differences.
•   Treat   color   and   colour   as two distinct words.
•   Uppercase letters are considered equivalent to their lowercase counterparts.
•   Words of equal frequency can be listed in any order.
•   Feel free to explicitly state the thoughts behind the program decisions.

Show example output using Les Misérables from Project Gutenberg as the text file input and display the top   10   most used words.

History

This task was originally taken from programming pearls from Communications of the ACM June 1986 Volume 29 Number 6 where this problem is solved by Donald Knuth using literate programming and then critiqued by Doug McIlroy, demonstrating solving the problem in a 6 line Unix shell script (provided as an example below).

References

## 11l

DefaultDict[String, Int] cntL(word) re:‘\w+’.find_strings(File(‘135-0.txt’).read().lowercase())   cnt[word]++print(sorted(cnt.items(), key' wordc -> wordc, reverse' 1B)[0.<10])
Output:
[(the, 41045), (of, 19953), (and, 14939), (a, 14527), (to, 13942), (in, 11210), (he, 9646), (was, 8620), (that, 7922), (it, 6659)]


This version uses a character set to match valid characters in a token. Another version could use a pointer to a function returning a boolean to match valid characters (allowing to use functions such as Is_Alphanumeric), but AFAIK there is no "Find_Token" method that uses one.

with Ada.Command_Line;with Ada.Text_IO;with Ada.Integer_Text_IO;with Ada.Strings.Maps;with Ada.Strings.Fixed;with Ada.Characters.Handling;with Ada.Containers.Indefinite_Ordered_Maps;with Ada.Containers.Indefinite_Ordered_Sets;with Ada.Containers.Ordered_Maps; procedure Word_Frequency is    package TIO renames Ada.Text_IO;     package String_Counters is new Ada.Containers.Indefinite_Ordered_Maps(String, Natural);    package String_Sets is new Ada.Containers.Indefinite_Ordered_Sets(String);    package Sorted_Counters is new Ada.Containers.Ordered_Maps      (Natural,       String_Sets.Set,       "=" => String_Sets."=",       "<" => ">");    -- for sorting by decreasing number of occurrences and ascending lexical order     procedure Increment(Key : in String; Element : in out Natural) is    begin        Element := Element + 1;    end Increment;     path : constant String := Ada.Command_Line.Argument(1);    how_many : Natural := 10;    set : constant Ada.Strings.Maps.Character_Set := Ada.Strings.Maps.To_Set(ranges => (('a', 'z'), ('0', '9')));    F : TIO.File_Type;    first : Positive;    last : Natural;    from : Positive;    counter : String_Counters.Map;    sorted_counts : Sorted_Counters.Map;    C1 : String_Counters.Cursor;    C2 : Sorted_Counters.Cursor;    tmp_set : String_Sets.Set;begin    -- read file and count words    TIO.Open(F, name => path, mode => TIO.In_File);    while not TIO.End_Of_File(F) loop       declare          line : constant String := Ada.Characters.Handling.To_Lower(TIO.Get_Line(F));       begin          from := line'First;          loop             Ada.Strings.Fixed.Find_Token(line(from .. line'Last), set, Ada.Strings.Inside, first, last);             exit when last < First;             C1 := counter.Find(line(first .. last));             if String_Counters.Has_Element(C1) then                counter.Update_Element(C1, Increment'Access);             else                counter.Insert(line(first .. last), 1);             end if;             from := last + 1;          end loop;       end;    end loop;    TIO.Close(F);     -- fill Natural -> StringSet Map    C1 := counter.First;    while String_Counters.Has_Element(C1) loop       if sorted_counts.Contains(String_Counters.Element(C1)) then          tmp_set := sorted_counts.Element(String_Counters.Element(C1));          tmp_set.Include(String_Counters.Key(C1));       else          sorted_counts.Include(String_Counters.Element(C1), String_Sets.To_Set(String_Counters.Key(C1)));       end if;       String_Counters.Next(C1);    end loop;     -- output    C2 := sorted_counts.First;    while Sorted_Counters.Has_Element(C2) loop       for Item of Sorted_Counters.Element(C2) loop          Ada.Integer_Text_IO.Put(TIO.Standard_Output, Sorted_Counters.Key(C2), width => 9);          TIO.Put(TIO.Standard_Output, " ");          TIO.Put_Line(Item);       end loop;       Sorted_Counters.Next(C2);       how_many := how_many - 1;       exit when how_many = 0;    end loop;end Word_Frequency;
Output:
$./word_frequency 135-0.txt 41093 the 19954 of 14943 and 14558 a 13953 to 11219 in 9649 he 8622 was 7924 that 6661 it  ## ALGOL 68 Works with: ALGOL 68G version Any - tested with release 2.8.3.win32 Uses the associative array implementations in ALGOL_68/prelude. # find the n most common words in a file ## use the associative array in the Associate array/iteration task ## but with integer values #PR read "aArrayBase.a68" PRMODE AAKEY = STRING;MODE AAVALUE = INT;AAVALUE init element value = 0;# returns text converted to upper case #OP TOUPPER = ( STRING text )STRING: BEGIN STRING result := text; FOR ch pos FROM LWB result TO UPB result DO IF is lower( result[ ch pos ] ) THEN result[ ch pos ] := to upper( result[ ch pos ] ) FI OD; result END # TOUPPER # ;# returns text converted to an INT or -1 if text is not a number #OP TOINT = ( STRING text )INT: BEGIN INT result := 0; BOOL is numeric := TRUE; FOR ch pos FROM UPB text BY -1 TO LWB text WHILE is numeric DO CHAR c = text[ ch pos ]; is numeric := is numeric AND c >= "0" AND c <= "9"; IF is numeric THEN ( result *:= 10 ) +:= ABS c - ABS "0" FI OD; IF is numeric THEN result ELSE -1 FI END # TOINT # ;# returns TRUE if c is a letter, FALSE otherwise #OP ISLETTER = ( CHAR c )BOOL: IF ( c >= "a" AND c <= "z" ) OR ( c >= "A" AND c <= "Z" ) THEN TRUE ELSE char in string( c, NIL, "ÇåçêëÆôöÿÖØáóÔ" ) FI # ISLETER # ;# get the file name and number of words from then commmand line #STRING file name := "pg-les-misrables.txt";INT number of words := 10;FOR arg pos TO argc - 1 DO STRING arg upper = TOUPPER argv( arg pos ); IF arg upper = "FILE" THEN file name := argv( arg pos + 1 ) ELIF arg upper = "NUMBER" THEN number of words := TOINT argv( arg pos + 1 ) FIOD;IF FILE input file; open( input file, file name, stand in channel ) /= 0THEN # failed to open the file # print( ( "Unable to open """ + file name + """", newline ) )ELSE # file opened OK # print( ( "Processing: ", file name, newline ) ); BOOL at eof := FALSE; BOOL at eol := FALSE; # set the EOF handler for the file # on logical file end( input file, ( REF FILE f )BOOL: BEGIN # note that we reached EOF on the # # latest read # at eof := TRUE; # return TRUE so processing can continue # TRUE END ); # set the end-of-line handler for the file so get word can see line boundaries # on line end( input file , ( REF FILE f )BOOL: BEGIN # note we reached end-of-line # at eol := TRUE; # return FALSE to use the default eol handling # # i.e. just get the next charactefr # FALSE END ); # get the words from the file and store the counts in an associative array # REF AARRAY words := INIT LOC AARRAY; INT word count := 0; CHAR c := " "; WHILE get( input file, ( c ) ); NOT at eof DO WHILE NOT ISLETTER c AND NOT at eof DO get( input file, ( c ) ) OD; STRING word := ""; at eol := FALSE; WHILE ISLETTER c AND NOT at eol AND NOT at eof DO word +:= c; get( input file, ( c ) ) OD; word count +:= 1; words // TOUPPER word +:= 1 OD; close( input file ); print( ( file name, " contains ", whole( word count, 0 ), " words", newline ) ); # find the most used words # [ number of words ]STRING top words; [ number of words ]INT top counts; FOR i TO number of words DO top words[ i ] := ""; top counts[ i ] := 0 OD; REF AAELEMENT w := FIRST words; WHILE w ISNT nil element DO INT count = value OF w; STRING word = key OF w; BOOL found := FALSE; FOR i TO number of words WHILE NOT found DO IF count > top counts[ i ] THEN # found a word that is used nore than a current # # most used word # found := TRUE; # move the other words down one place # FOR move pos FROM number of words BY - 1 TO i + 1 DO top counts[ move pos ] := top counts[ move pos - 1 ]; top words [ move pos ] := top words [ move pos - 1 ] OD; # install the new word # top counts[ i ] := count; top words [ i ] := word FI OD; w := NEXT words OD; print( ( whole( number of words, 0 ), " most used words:", newline ) ); print( ( " count word", newline ) ); FOR i TO number of words DO print( ( whole( top counts[ i ], -6 ), ": ", top words[ i ], newline ) ) ODFI Output: Processing: pg-les-misrables.txt pg-les-misrables.txt contains 578381 words 10 most used words: count word 39333: THE 19154: OF 14628: AND 14229: A 13431: TO 11275: HE 10879: IN 8236: WAS 7527: THAT 6491: IT  ## APL Works with: GNU APL  ⍝⍝ NOTE: input text is assumed to be encoded in ISO-8859-1⍝⍝ (The suggested example '135-0.txt' of Les Miserables on⍝⍝ Project Gutenberg is in UTF-8.)⍝⍝⍝⍝ Use Unix 'iconv' if required⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝∇r ← lowerAndStrip s;stripped;mixedCase ⍝⍝ Convert text to lowercase, punctuation and newlines to spaces stripped ← ' abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*' mixedCase ← ⎕av,' ,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' r ← stripped[mixedCase ⍳ s]∇ ⍝⍝ Return the _n_ most frequent words and a count of their occurrences⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝∇r ← n wordCount fname ;D;wl;sidx;swv;pv;wc;uw;sortOrder D ← lowerAndStrip (⎕fio['read_file'] fname) ⍝ raw text with newlines wl ← (~ D ∊ ' ') ⊂ D sidx ← ⍒wl swv ← wl[sidx] pv ← +\ 1,~2 ≡/ swv wc ← ∊ ⍴¨ pv ⊂ pv uw ← 1 ⊃¨ pv ⊂ swv sortOrder ← ⍒wc r ← n↑ uw[sortOrder],[0.5]wc[sortOrder]∇ 5 wordCount '135-0.txt' the of and a to 41042 19952 14938 14526 13942  ## AppleScript (* For simplicity here, words are considered to be uninterrupted sequences of letters and/or digits. The set text is too messy to warrant faffing around with anything more sophisticated. The first letter in each word is upper-cased and the rest lower-cased for case equivalence and presentation. Where more than n words qualify for the top n or fewer places, all are included in the result.*) use AppleScript version "2.4" -- OS X 10.10 (Yosemite) or lateruse framework "Foundation"use scripting additions on wordFrequency(filePath, n) set |⌘| to current application -- Get the text and "capitalize" it (lower-case except for the first letters in words). set theText to |⌘|'s class "NSString"'s stringWithContentsOfFile:(filePath) usedEncoding:(missing value) |error|:(missing value) set theText to theText's capitalizedStringWithLocale:(|⌘|'s class "NSLocale"'s currentLocale()) -- Yosemite compatible. -- Split it at the non-word characters. set nonWordCharacters to |⌘|'s class "NSCharacterSet"'s alphanumericCharacterSet()'s invertedSet() set theWords to theText's componentsSeparatedByCharactersInSet:(nonWordCharacters) -- Use a counted set to count the individual words' occurrences. set countedSet to |⌘|'s class "NSCountedSet"'s alloc()'s initWithArray:(theWords) -- Build a list of word/frequency records, excluding any empty strings left over from the splitting above. set mutableSet to |⌘|'s class "NSMutableSet"'s setWithSet:(countedSet) tell mutableSet to removeObject:("") script o property discreteWords : mutableSet's allObjects() as list property wordsAndFrequencies : {} end script set discreteWordCount to (count o's discreteWords) repeat with i from 1 to discreteWordCount set thisWord to item i of o's discreteWords set end of o's wordsAndFrequencies to {thisWord:thisWord, frequency:(countedSet's countForObject:(thisWord)) as integer} end repeat -- Convert to NSMutableArray, reverse-sort the result on the frequencies, and convert back to list. set wordsAndFrequencies to |⌘|'s class "NSMutableArray"'s arrayWithArray:(o's wordsAndFrequencies) set descendingByFrequency to |⌘|'s class "NSSortDescriptor"'s sortDescriptorWithKey:("frequency") ascending:(false) tell wordsAndFrequencies to sortUsingDescriptors:({descendingByFrequency}) set o's wordsAndFrequencies to wordsAndFrequencies as list if (discreteWordCount > n) then -- If there are more than n records, check for any immediately following the nth which may have the same frequency as it. set nthHighestFrequency to frequency of item n of o's wordsAndFrequencies set qualifierCount to n repeat with i from (n + 1) to discreteWordCount if (frequency of item i of o's wordsAndFrequencies = nthHighestFrequency) then set qualifierCount to i else exit repeat end if end repeat else -- Otherwise reduce n to the actual number of discrete words. set n to discreteWordCount set qualifierCount to discreteWordCount end if -- Compose a text report from the qualifying words and frequencies. if (qualifierCount = n) then set output to {"The " & n & " most frequently occurring words in the file are:"} else set output to {(qualifierCount as text) & " words share the " & ((n as text) & " highest frequencies in the file:")} end if repeat with i from 1 to qualifierCount set {thisWord:thisWord, frequency:frequency} to item i of o's wordsAndFrequencies set end of output to thisWord & ": " & (tab & frequency) end repeat set astid to AppleScript's text item delimiters set AppleScript's text item delimiters to linefeed set output to output as text set AppleScript's text item delimiters to astid return outputend wordFrequency -- Test code:set filePath to POSIX path of ((path to desktop as text) & "www.rosettacode.org:Word frequency:135-0.txt")set n to 10return wordFrequency(filePath, n) Output: "The 10 most frequently occurring words in the file are:The: 41092Of: 19954And: 14943A: 14545To: 13953In: 11219He: 9649Was: 8622That: 7924It: 6661" ## Arturo findFrequency: function [file, count][ freqs: #[] r: {/[[:alpha:]]+/} loop flatten map split.lines read file 'l -> match lower l r 'word [ if not? key? freqs word -> freqs\[word]: 0 freqs\[word]: freqs\[word] + 1 ] freqs: sort.values.descending freqs result: new [] loop 0..dec count 'x [ 'result ++ @[@[get keys freqs x, get values freqs x]] ] return result] loop findFrequency "https://www.gutenberg.org/files/135/135-0.txt" 10 'pair [ print pair] Output: the 41096 of 19955 and 14939 a 14558 to 13954 in 11218 he 9649 was 8622 that 7924 it 6661 ## AutoHotkey URLDownloadToFile, http://www.gutenberg.org/files/135/135-0.txt, % A_temp "\tempfile.txt"FileRead, H, % A_temp "\tempfile.txt"FileDelete, % A_temp "\tempfile.txt"words := []while pos := RegExMatch(H, "\b[[:alpha:]]+\b", m, A_Index=1?1:pos+StrLen(m)) words[m] := words[m] ? words[m] + 1 : 1for word, count in words list .= count "t" word "rn"Sort, list, RNloop, parse, list, n, r{ result .= A_LoopField "rn" if A_Index = 10 break}MsgBox % "FreqtWordn" resultreturn Outputs: Freq Word 41036 The 19946 of 14940 and 14589 A 13939 TO 11204 in 9645 HE 8619 WAS 7922 THAT 6659 it ## AWK  # syntax: GAWK -f WORD_FREQUENCY.AWK [-v show=x] LES_MISERABLES.TXT## A word is anything separated by white space.# Therefor "this" and "this." are different.# But "This" and "this" are identical.# As I am "free to define what a letter is" I have chosen to allow# numerics and all special characters as they are usually considered# parts of words in text processing applications.#{ nbytes += length($0) + 2 # +2 for CR/LF    nfields += NF    $0 = tolower($0)    for (i=1; i<=NF; i++) {      arr[$i]++ }}END { show = (show == "") ? 10 : show width1 = length(show) PROCINFO["sorted_in"] = "@val_num_desc" for (i in arr) { if (width2 == 0) { width2 = length(arr[i]) } if (n++ >= show) { break } printf("%*d %*d %s\n",width1,n,width2,arr[i],i) } printf("input: %d records, %d bytes, %d words of which %d are unique\n",NR,nbytes,nfields,length(arr)) exit(0)}  Output:  1 40372 the 2 19868 of 3 14472 and 4 14278 a 5 13589 to 6 11024 in 7 9213 he 8 8347 was 9 7250 that 10 6414 his input: 73829 records, 3369772 bytes, 568744 words of which 50394 are unique  ## BASIC ### QB64 This is a rather long code. I fulfilled the requirement with QB64. It "cleans" each word so it takes as a word anything that begins and ends with a letter. It works with arrays. Amazing the speed of QB64 to do this job with such a big file as Les Miserables.txt.  OPTION _EXPLICIT ' SUBs and FUNCTIONsDECLARE SUB CountWords (FromString AS STRING)DECLARE SUB QuickSort (lLeftN AS LONG, lRightN AS LONG, iMode AS INTEGER)DECLARE SUB ShowResults ()DECLARE SUB ShowCompletion ()DECLARE SUB TopCounted ()DECLARE FUNCTION InsertWord& (WhichWord AS STRING)DECLARE FUNCTION BinarySearch& (LookFor AS STRING, RetPos AS INTEGER)DECLARE FUNCTION CleanWord$ (WhichWord AS STRING) ' VarDIM iFile AS INTEGERDIM iCol AS INTEGERDIM iFil AS INTEGERDIM iStep AS INTEGERDIM iBar AS INTEGERDIM iBlock AS INTEGERDIM lIni AS LONGDIM lEnd AS LONGDIM lLines AS LONGDIM lLine AS LONGDIM lLenF AS LONGDIM iRuns AS INTEGERDIM lMaxWords AS LONGDIM sTimer AS SINGLEDIM strFile AS STRINGDIM strKey AS STRINGDIM strText AS STRINGDIM strDate AS STRINGDIM strTime AS STRINGDIM strBar AS STRINGDIM lWords AS LONGDIM strWords AS STRINGCONST AddWords = 100CONST TopCount = 10CONST FALSE = 0, TRUE = NOT FALSE ' InitializeiFile = FREEFILElIni = 1strDate = DATE$strTime = TIME$lEnd = 0lMaxWords = 1000REDIM strWords(lIni TO lMaxWords) AS STRINGREDIM lWords(lIni TO lMaxWords) AS LONGREDIM lTopWords(1) AS LONGREDIM strTopWords(1) AS STRING ' ---Main program loop$RESIZE:SMOOTHDO DO CLS PRINT "This program will count how many words are in a text file and shows the 10" PRINT "most used of them." PRINT INPUT "Document to open (TXT file) (f=see files): ", strFile IF UCASE$(strFile) = "F" THEN            strFile = ""            FILES            DO: LOOP UNTIL INKEY$<> "" END IF LOOP UNTIL strFile <> "" OPEN strFile FOR BINARY AS #iFile IF LOF(iFile) > 0 THEN iRuns = iRuns + 1 CLOSE #iFile ' Opens the document file to analyze it sTimer = TIMER ON TIMER(1) GOSUB ShowAdvance OPEN strFile FOR INPUT AS #iFile lLenF = LOF(iFile) PRINT "Looking for words in "; strFile; ". File size:"; STR$(lLenF); ". ";: iCol = POS(0): PRINT "Initializing";        COLOR 23: PRINT "...";: COLOR 7         ' Count how many lines has the file        lLines = 0        DO WHILE NOT EOF(iFile)            LINE INPUT #iFile, strText            lLines = lLines + 1        LOOP        CLOSE #iFile         ' Shows the bar        LOCATE , iCol: PRINT "Initialization complete."        PRINT        PRINT "Processing"; lLines; "lines";: COLOR 23: PRINT "...": COLOR 7        iFil = CSRLIN        iCol = POS(0)        iBar = 80        iBlock = 80 / lLines        IF iBlock = 0 THEN iBlock = 1        PRINT STRING$(iBar, 176) lLine = 0 iStep = lLines * iBlock / iBar IF iStep = 0 THEN iStep = 1 IF iStep > 20 THEN TIMER ON END IF OPEN strFile FOR INPUT AS #iFile DO WHILE NOT EOF(iFile) lLine = lLine + 1 IF (lLine MOD iStep) = 0 THEN strBar = STRING$(iBlock * (lLine / iStep), 219)                LOCATE iFil, 1                PRINT strBar                ShowCompletion            END IF            LINE INPUT #iFile, strText            CountWords strText            strKey = INKEY$LOOP ShowCompletion CLOSE #iFile TIMER OFF LOCATE iFil - 1, 1 PRINT "Done!" + SPACE$(70)        strBar = STRING$(iBar, 219) LOCATE iFil, 1 PRINT strBar LOCATE iFil + 5, 1 PRINT "Finishing";: COLOR 23: PRINT "...";: COLOR 7 ShowResults ' Frees the RAM lMaxWords = 1000 lEnd = 0 REDIM strWords(lIni TO lMaxWords) AS STRING REDIM lWords(lIni TO lMaxWords) AS LONG ELSE CLOSE #iFile KILL strFile PRINT PRINT "Document does not exist." END IF PRINT PRINT "Try again? (Y/n)" DO strKey = UCASE$(INKEY$) LOOP UNTIL strKey = "Y" OR strKey = "N" OR strKey = CHR$(13) OR strKey = CHR$(27)LOOP UNTIL strKey = "N" OR strKey = CHR$(27) OR iRuns >= 99 CLSIF iRuns >= 99 THEN    PRINT "Maximum runs reached for this session."END IF PRINT "End of program"PRINT "Start date/time: "; strDate; " "; strTimePRINT "End date/time..: "; DATE$; " "; TIME$END' ---End main program ShowAdvance:ShowCompletionRETURN FUNCTION BinarySearch& (LookFor AS STRING, RetPos AS INTEGER)    ' Var    DIM lFound AS LONG    DIM lLow AS LONG    DIM lHigh AS LONG    DIM lMid AS LONG    DIM strLookFor AS STRING    SHARED lIni AS LONG    SHARED lEnd AS LONG    SHARED lMaxWords AS LONG    SHARED strWords() AS STRING    SHARED lWords() AS LONG     ' Binary search for stated word in the list    lLow = lIni    lHigh = lEnd    lFound = 0    strLookFor = UCASE$(LookFor) DO WHILE (lFound < 1) AND (lLow <= lHigh) lMid = (lHigh + lLow) / 2 IF strWords(lMid) = strLookFor THEN lFound = lMid ELSEIF strWords(lMid) > strLookFor THEN lHigh = lMid - 1 ELSE lLow = lMid + 1 END IF LOOP ' Should I return the position if not found? IF lFound = 0 AND RetPos THEN IF lEnd < 1 THEN lFound = 1 ELSEIF strWords(lMid) > strLookFor THEN lFound = lMid ELSE lFound = lMid + 1 END IF END IF ' Return the value BinarySearch = lFound END FUNCTION FUNCTION CleanWord$ (WhichWord AS STRING)    ' Var    DIM iSeek AS INTEGER    DIM iStep AS INTEGER    DIM bOK AS INTEGER    DIM strWord AS STRING    DIM strChar AS STRING     strWord = WhichWord     ' Look for trailing wrong characters    strWord = LTRIM$(RTRIM$(strWord))    IF LEN(strWord) > 0 THEN        iStep = 0        DO            ' Determines if step will be forward or backwards            IF iStep = 0 THEN                iStep = -1            ELSE                iStep = 1            END IF             ' Sets the initial value of iSeek            IF iStep = -1 THEN                iSeek = LEN(strWord)            ELSE                iSeek = 1            END IF             bOK = FALSE            DO                strChar = MID$(strWord, iSeek, 1) SELECT CASE strChar CASE "A" TO "Z" bOK = TRUE CASE CHR$(129) TO CHR$(154) bOK = TRUE CASE CHR$(160) TO CHR$(165) bOK = TRUE END SELECT ' If it is not a character valid as a letter, please move one space IF NOT bOK THEN iSeek = iSeek + iStep END IF ' If no letter was recognized, then exit the loop IF iSeek < 1 OR iSeek > LEN(strWord) THEN bOK = TRUE END IF LOOP UNTIL bOK IF iStep = -1 THEN ' Reviews if a word was encountered IF iSeek > 0 THEN strWord = LEFT$(strWord, iSeek)                ELSE                    strWord = ""                END IF            ELSEIF iStep = 1 THEN                IF iSeek <= LEN(strWord) THEN                    strWord = MID$(strWord, iSeek) ELSE strWord = "" END IF END IF LOOP UNTIL iStep = 1 OR strWord = "" END IF ' Return the result CleanWord = strWord END FUNCTION SUB CountWords (FromString AS STRING) ' Var DIM iStart AS INTEGER DIM iLenW AS INTEGER DIM iLenS AS INTEGER DIM iLenD AS INTEGER DIM strString AS STRING DIM strWord AS STRING DIM lWhichWord AS LONG SHARED lEnd AS LONG SHARED lMaxWords AS LONG SHARED strWords() AS STRING SHARED lWords() AS LONG ' Converts to Upper Case and cleans leading and trailing spaces strString = UCASE$(FromString)    strString = LTRIM$(RTRIM$(strString))     ' Get words from string    iStart = 1    iLenW = 0    iLenS = LEN(strString)    DO WHILE iStart <= iLenS        iLenW = INSTR(iStart, strString, " ")        IF iLenW = 0 AND iStart <= iLenS THEN            iLenW = iLenS + 1        END IF        strWord = MID$(strString, iStart, iLenW - iStart) ' Will remove mid dashes or apostrophe or "â€" iLenD = INSTR(strWord, "-") IF iLenD < 1 THEN iLenD = INSTR(strWord, "'") IF iLenD < 1 THEN iLenD = INSTR(strWord, "â€") END IF END IF IF iLenD >= 2 THEN strWord = LEFT$(strWord, iLenD - 1)            iLenW = iStart + (iLenD - 1)        END IF        strWord = CleanWord(strWord)         IF strWord <> "" THEN            ' Look for the word to be counted            lWhichWord = BinarySearch(strWord, FALSE)             ' If the word doesn't exist in the list, add it            IF lWhichWord = 0 THEN                lWhichWord = InsertWord(strWord)            END IF             ' Count the word            IF lWhichWord > 0 THEN                lWords(lWhichWord) = lWords(lWhichWord) + 1            END IF        END IF        iStart = iLenW + 1    LOOP END SUB ' Here a word will be inserted in the listFUNCTION InsertWord& (WhichWord AS STRING)    ' Var    DIM lFound AS LONG    DIM lWord AS LONG    DIM strWord AS STRING    SHARED lIni AS LONG    SHARED lEnd AS LONG    SHARED lMaxWords AS LONG    SHARED strWords() AS STRING    SHARED lWords() AS LONG     ' Look for the word in the list    strWord = UCASE$(WhichWord) lFound = BinarySearch(WhichWord, TRUE) IF lFound > 0 THEN ' Add one word lEnd = lEnd + 1 ' Verifies if there is still room for a new word IF lEnd > lMaxWords THEN lMaxWords = lMaxWords + AddWords ' Other words IF lMaxWords > 32767 THEN IF lEnd <= 32767 THEN lMaxWords = 32767 ELSE lFound = -1 END IF END IF IF lFound > 0 THEN REDIM _PRESERVE strWords(lIni TO lMaxWords) AS STRING REDIM _PRESERVE lWords(lIni TO lMaxWords) AS LONG END IF END IF IF lFound > 0 THEN ' Move the words below this IF lEnd > 1 THEN FOR lWord = lEnd TO lFound + 1 STEP -1 strWords(lWord) = strWords(lWord - 1) lWords(lWord) = lWords(lWord - 1) NEXT lWord END IF ' Insert the word in the position strWords(lFound) = strWord lWords(lFound) = 0 END IF END IF InsertWord = lFoundEND FUNCTION SUB QuickSort (lLeftN AS LONG, lRightN AS LONG, iMode AS INTEGER) ' Var DIM lPivot AS LONG DIM lLeftNIdx AS LONG DIM lRightNIdx AS LONG SHARED lWords() AS LONG SHARED strWords() AS STRING ' Clasifies from highest to lowest lLeftNIdx = lLeftN lRightNIdx = lRightN IF (lRightN - lLeftN) > 0 THEN lPivot = (lLeftN + lRightN) / 2 DO WHILE (lLeftNIdx <= lPivot) AND (lRightNIdx >= lPivot) IF iMode = 0 THEN ' Ascending DO WHILE (lWords(lLeftNIdx) < lWords(lPivot)) AND (lLeftNIdx <= lPivot) lLeftNIdx = lLeftNIdx + 1 LOOP DO WHILE (lWords(lRightNIdx) > lWords(lPivot)) AND (lRightNIdx >= lPivot) lRightNIdx = lRightNIdx - 1 LOOP ELSE ' Descending DO WHILE (lWords(lLeftNIdx) > lWords(lPivot)) AND (lLeftNIdx <= lPivot) lLeftNIdx = lLeftNIdx + 1 LOOP DO WHILE (lWords(lRightNIdx) < lWords(lPivot)) AND (lRightNIdx >= lPivot) lRightNIdx = lRightNIdx - 1 LOOP END IF SWAP lWords(lLeftNIdx), lWords(lRightNIdx) SWAP strWords(lLeftNIdx), strWords(lRightNIdx) lLeftNIdx = lLeftNIdx + 1 lRightNIdx = lRightNIdx - 1 IF (lLeftNIdx - 1) = lPivot THEN lRightNIdx = lRightNIdx + 1 lPivot = lRightNIdx ELSEIF (lRightNIdx + 1) = lPivot THEN lLeftNIdx = lLeftNIdx - 1 lPivot = lLeftNIdx END IF LOOP QuickSort lLeftN, lPivot - 1, iMode QuickSort lPivot + 1, lRightN, iMode END IFEND SUB SUB ShowCompletion () ' Var SHARED iFil AS INTEGER SHARED lLine AS LONG SHARED lLines AS LONG SHARED lEnd AS LONG LOCATE iFil + 1, 1 PRINT "Lines analyzed :"; lLine PRINT USING "% of completion: ###%"; (lLine / lLines) * 100 PRINT "Words found....:"; lEndEND SUB SUB ShowResults () ' Var DIM iMaxL AS INTEGER DIM iMaxW AS INTEGER DIM lWord AS LONG DIM lHowManyWords AS LONG DIM strString AS STRING DIM strFileR AS STRING SHARED lIni AS LONG SHARED lEnd AS LONG SHARED lLenF AS LONG SHARED lMaxWords AS LONG SHARED sTimer AS SINGLE SHARED strFile AS STRING SHARED strWords() AS STRING SHARED lWords() AS LONG SHARED strTopWords() AS STRING SHARED lTopWords() AS LONG SHARED iRuns AS INTEGER ' Show results ' Creates file name lWord = INSTR(strFile, ".") IF lWord = 0 THEN lWord = LEN(strFile) strFileR = LEFT$(strFile, lWord)    IF RIGHT$(strFileR, 1) <> "." THEN strFileR = strFileR + "." ' Retrieves the longest word found and the highest count FOR lWord = lIni TO lEnd ' Gets the longest word found IF LEN(strWords(lWord)) > iMaxL THEN iMaxL = LEN(strWords(lWord)) END IF lHowManyWords = lHowManyWords + lWords(lWord) NEXT lWord IF iMaxL > 60 THEN iMaxW = 60 ELSE iMaxW = iMaxL ' Gets top counted TopCounted ' Shows the results CLS PRINT "File analyzed : "; strFile PRINT "Length of file:"; lLenF PRINT "Time lapse....:"; TIMER - sTimer;"seconds" PRINT "Words found...:"; lHowManyWords; "(Unique:"; STR$(lEnd); ")"    PRINT "Longest word..:"; iMaxL    PRINT    PRINT "The"; TopCount; "most used are:"    PRINT STRING$(iMaxW, "-"); "+"; STRING$(80 - (iMaxW + 1), "-")    PRINT " Word"; SPACE$(iMaxW - 5); "| Count" PRINT STRING$(iMaxW, "-"); "+"; STRING$(80 - (iMaxW + 1), "-") strString = "\" + SPACE$(iMaxW - 2) + "\| #########,"    FOR lWord = lIni TO TopCount        PRINT USING strString; strTopWords(lWord); lTopWords(lWord)    NEXT lWord    PRINT STRING$(iMaxW, "-"); "+"; STRING$(80 - (iMaxW + 1), "-")    PRINT "See files "; strFileR + "S" + LTRIM$(STR$(iRuns)); " and "; strFileR + "C" + LTRIM$(STR$(iRuns)); " to see the full list."END SUB SUB TopCounted ()    ' Var    DIM lWord AS LONG    DIM strFileR AS STRING    DIM iFile AS INTEGER    CONST Descending = 1    SHARED lIni AS LONG    SHARED lEnd AS LONG    SHARED lMaxWords AS LONG    SHARED strWords() AS STRING    SHARED lWords() AS LONG    SHARED strTopWords() AS STRING    SHARED lTopWords() AS LONG    SHARED iRuns AS INTEGER    SHARED strFile AS STRING     ' Assigns new dimmentions    REDIM strTopWords(lIni TO TopCount) AS STRING    REDIM lTopWords(lIni TO TopCount) AS LONG     ' Saves the current values    lWord = INSTR(strFile, ".")    IF lWord = 0 THEN lWord = LEN(strFile)    strFileR = LEFT$(strFile, lWord) IF RIGHT$(strFileR, 1) <> "." THEN strFileR = strFileR + "."    iFile = FREEFILE    OPEN strFileR + "S" + LTRIM$(STR$(iRuns)) FOR OUTPUT AS #iFile    FOR lWord = lIni TO lEnd        WRITE #iFile, strWords(lWord), lWords(lWord)    NEXT lWord    CLOSE #iFile     ' Classifies the counted in descending order    QuickSort lIni, lEnd, Descending     ' Now, saves the required values in the arrays    FOR lWord = lIni TO TopCount        strTopWords(lWord) = strWords(lWord)        lTopWords(lWord) = lWords(lWord)    NEXT lWord     ' Now, saves the order from the file    OPEN strFileR + "C" + LTRIM$(STR$(iRuns)) FOR OUTPUT AS #iFile    FOR lWord = lIni TO lEnd        WRITE #iFile, strWords(lWord), lWords(lWord)    NEXT lWord    CLOSE #iFile END SUB 
Output:
This program will count how many words are in a text file and shows the 10
most used of them.

Document to open (TXT file) (f=see files): miserabl.txt
Looking for words in miserabl.txt. File size: 3369775. Initialization complete.

Processing... Done!
Lines analyzed : 72917
% of completion: 100%
Words found....: 23288

Finishing...

Lenght of file: 3369775
Time lapse....: 35 seconds
Words found...: 578614 (Unique: 23538)
Longest word..: 25

The 10 most used are:
---------------------------+------------------------------------------------------------------------
Word                       | Count
---------------------------+------------------------------------------------------------------------
THE                        |     40,751
OF                         |     19,949
AND                        |     14,891
A                          |     14,430
TO                         |     13,923
IN                         |     11,189
HE                         |      9,605
WAS                        |      8,617
THAT                       |      7,833
IT                         |      6.579
---------------------------+------------------------------------------------------------------------
See files miserabl.S1 and miserabl.C1 to see the full list.

Try again? (Y/n)


### BaCon

Removing all punctuation, digits, tabs and carriage returns. So "This", "this" and "this." are the same. Full support for UTF8 characters in words. The code itself could be smaller, but for sake of clarity all has been written explicitly.

' We do not count superfluous spaces as wordsOPTION COLLAPSE TRUE ' Optional: use TRE regex library to speed up the programPRAGMA RE tre INCLUDE <tre/regex.h> LDFLAGS -ltre ' We're using associative arraysDECLARE frequency ASSOC NUMBER ' Load the text and remove all punctuation, digits, tabs and crbook$= EXTRACT$(LOAD$("miserables.txt"), "[[:punct:]]|[[:digit:]]|[\t\r]", TRUE) ' Count each word in lowercaseFOR word$ IN REPLACE$(book$, NL$, CHR$(32))    INCR frequency(LCASE$(word$))NEXT ' Sort the associative array and then map the index to a string arrayLOOKUP frequency TO term$SIZE x SORT DOWN ' Show resultsFOR i = 0 TO 9 PRINT term$[i], " : ", frequency(term$[i])NEXT Output: the : 40440 of : 19903 and : 14738 a : 14306 to : 13630 in : 11083 he : 9452 was : 8605 that : 7535 his : 6434  ## Batch File This takes a very long time per word thus I have chosen to feed it a 200 line sample and go from there. You could cut the length of this down drastically if you didn't need to be able to recall the word at nth position and wished only to display the top 10 words.  @echo off call:wordCount 1 2 3 4 5 6 7 8 9 10 42 101 pause>nulexit :wordCountsetlocal enabledelayedexpansion set word=100000set line=0for /f "delims=" %%i in (input.txt) do ( set /a line+=1 for %%j in (%%i) do ( if not !skip%%j!==true ( echo line !line! ^| word !word:~-5! - "%%~j" type input.txt | find /i /c "%%~j" > count.tmp set /p tmpvar=<count.tmp set tmpvar=000000000!tmpvar! set tmpvar=!tmpvar:~-10! set count[!word!]=!tmpvar! %%~j set "skip%%j=true" set /a word+=1 ) ))del count.tmp set wordcount=0for /f "tokens=1,2 delims= " %%i in ('set count ^| sort /+14 /r') do ( set /a wordcount+=1 for /f "tokens=2 delims==" %%k in ("%%i") do ( set word[!wordcount!]=!wordcount!. %%j - %%k )) clsfor %%i in (%*) do echo !word[%%i]!endlocalgoto:eof  Output 1. - 0000000140 I 2. - 0000000140 a 3. - 0000000118 He 4. - 0000000100 the 5. - 0000000080 an 6. - 0000000075 in 7. - 0000000066 at 8. - 0000000062 is 9. - 0000000058 on 10. - 0000000058 as 42. - 0000000010 with 101. - 0000000004 ears  ## Bracmat This solution assumes that words consists of characters that exist in a lowercase and a highercase version. So it won't work with many non-European alphabets. The built-in vap function can take either two or three arguments. The first argument must be the name of a function or a function definition. The second argument must be a string. The two-argument version maps the function to each character in the string. The three-argument version splits the string at each occurrence of the third argument, which must be a single character, and applies the function to the intervening substrings. The output of vap is a space-separated list of results from the function argument. The expression !('($arg:?A [($pivot) ?Z)) must be read as follows: The subexpression '($arg:?A [($pivot) ?Z) is a macro expression. The symbols arg and pivot, which are the right hand sides of $ operators with empty left hand side, are replaced by the actual values of !arg and !pivot. The whole subexpression is made the right hand side of a = operator with empty left hand side, e.g. =a b c d e:?A [2 ?Z. The = operator protects the subexpression against evaluation. By prefixing the expression with the ! unary operator (which normally is used to obtain the value of a variable), the pattern match operation a b c d e:?A [2 ?Z is executed, assigning a b to A and assigning c d e to Z.

The reason for using a macro expression is that the evaluation of a pattern match operation with pattern variable as in !arg:?A [!pivot ?Z is unecessary slow, since !pivot is evaluated up to !pivot+1 times.

  ( 10-most-frequent-words  =     MergeSort                        { Local variable declarations. }        types        sorted-words        frequency        type        most-frequent-words    .   ( MergeSort                      { Definition of function MergeSort. }        =   A N Z pivot          .   !arg:? [?N                 { [?N is a subpattern that counts the number of preceding elements }            & (   !N:>1                           { if N at least 2 ... }                & div$(!N.2):?pivot { divide N by 2 ... } & !('($arg:?A [($pivot) ?Z)) { split list in two halves A and Z ... } & MergeSort$!A+MergeSort$!Z { sort each of A and Z and return sum } | !arg { else just return a single element} ) ) & MergeSort { Sort }$ ( vap                 { Split second argument at each occurrence of third character and apply first argument to each chunk. }            $( (=.low$!arg)      { Return input, lowercased. }              .   str                $( vap { Vaporize second argument in UTF-8 or Latin-1 characters and apply first argument to each of them. }$ ( (                      =                        .   upp$!arg:low$!arg&\n  { Return newline instead of non-alphabetic character. }                          | !arg                  { Return (Euro-centric) alphabetic character.}                      )                    . get$(!arg,NEW STR) { Read input text as a single string. } ) ) . \n { Split at newlines } ) ) : ?sorted-words { Assign sum of (frequency*lowercasedword) terms to sorted-words. } & :?types { Initialize types as an empty list. } & whl { Loop until right hand side fails. } ' ( !sorted-words:#?frequency*%@?type+?sorted-words { Extract first frequency*type term from sum. } & (!frequency.!type) !types:?types { Prepend (frequency.type) pair to types list} ) & MergeSort$!types                                     { Sort the list of (frequency.type) pairs. }        : (?+[-11+?most-frequent-words|?most-frequent-words)   { Pick the last 10 terms from the sum returned by MergeSort. }      & !most-frequent-words                                   { Return the last 10 terms. }  )& out$(10-most-frequent-words$"135-0.txt")      { Call 10-most-frequent-words with name of inout file and print result to screen. }

Output

  (6661.it)
+ (7924.that)
+ (8622.was)
+ (9649.he)
+ (11219.in)
+ (13953.to)
+ (14546.a)
+ (14943.and)
+ (19954.of)
+ (41092.the)

## C

Library: GLib

Words are defined by the regular expression "\w+".

#include <stdbool.h>#include <stdio.h>#include <glib.h> typedef struct word_count_tag {    const char* word;    size_t count;} word_count; int compare_word_count(const void* p1, const void* p2) {    const word_count* w1 = p1;    const word_count* w2 = p2;    if (w1->count > w2->count)        return -1;    if (w1->count < w2->count)        return 1;    return 0;} bool get_top_words(const char* filename, size_t count) {    GError* error = NULL;    GMappedFile* mapped_file = g_mapped_file_new(filename, FALSE, &error);    if (mapped_file == NULL) {        fprintf(stderr, "%s\n", error->message);        g_error_free(error);        return false;    }    const char* text = g_mapped_file_get_contents(mapped_file);    if (text == NULL) {        fprintf(stderr, "File %s is empty\n", filename);        g_mapped_file_unref(mapped_file);        return false;    }    gsize file_size = g_mapped_file_get_length(mapped_file);    // Store word counts in a hash table    GHashTable* ht = g_hash_table_new_full(g_str_hash, g_str_equal,                                           g_free, g_free);    GRegex* regex = g_regex_new("\\w+", 0, 0, NULL);    GMatchInfo* match_info;    g_regex_match_full(regex, text, file_size, 0, 0, &match_info, NULL);    while (g_match_info_matches(match_info)) {        char* word = g_match_info_fetch(match_info, 0);        char* lower = g_utf8_strdown(word, -1);        g_free(word);        size_t* count = g_hash_table_lookup(ht, lower);        if (count != NULL) {            ++*count;            g_free(lower);        } else {            count = g_new(size_t, 1);            *count = 1;            g_hash_table_insert(ht, lower, count);        }        g_match_info_next(match_info, NULL);    }    g_match_info_free(match_info);    g_regex_unref(regex);    g_mapped_file_unref(mapped_file);     // Sort words in decreasing order of frequency    size_t size = g_hash_table_size(ht);    word_count* words = g_new(word_count, size);    GHashTableIter iter;    gpointer key, value;    g_hash_table_iter_init(&iter, ht);    for (size_t i = 0; g_hash_table_iter_next(&iter, &key, &value); ++i) {        words[i].word = key;        words[i].count = *(size_t*)value;    }    qsort(words, size, sizeof(word_count), compare_word_count);     // Print the most common words    if (count > size)        count = size;    printf("Top %lu words\n", count);    printf("Rank\tCount\tWord\n");    for (size_t i = 0; i < count; ++i)        printf("%lu\t%lu\t%s\n", i + 1, words[i].count, words[i].word);    g_free(words);    g_hash_table_destroy(ht);    return true;} int main(int argc, char** argv) {    if (argc != 2) {        fprintf(stderr, "usage: %s file\n", argv);        return EXIT_FAILURE;    }    if (!get_top_words(argv, 10))        return EXIT_FAILURE;    return EXIT_SUCCESS;}
Output:
Top 10 words
Rank	Count	Word
1	41039	the
2	19951	of
3	14942	and
4	14527	a
5	13941	to
6	11209	in
7	9646	he
8	8620	was
9	7922	that
10	6659	it


## C#

Translation of: D
using System;using System.Collections.Generic;using System.IO;using System.Linq;using System.Text.RegularExpressions; namespace WordCount {    class Program {        static void Main(string[] args) {            var text = File.ReadAllText("135-0.txt").ToLower();             var match = Regex.Match(text, "\\w+");            Dictionary<string, int> freq = new Dictionary<string, int>();            while (match.Success) {                string word = match.Value;                if (freq.ContainsKey(word)) {                    freq[word]++;                } else {                    freq.Add(word, 1);                }                 match = match.NextMatch();            }             Console.WriteLine("Rank  Word  Frequency");            Console.WriteLine("====  ====  =========");            int rank = 1;            foreach (var elem in freq.OrderByDescending(a => a.Value).Take(10)) {                Console.WriteLine("{0,2}    {1,-4}    {2,5}", rank++, elem.Key, elem.Value);            }        }    }}
Output:
Rank  Word  Frequency
====  ====  =========
1    the     41035
2    of      19946
3    and     14940
4    a       14577
5    to      13939
6    in      11204
7    he       9645
8    was      8619
9    that     7922
10    it       6659

## C++

#include <algorithm>#include <cstdlib>#include <fstream>#include <iostream>#include <iterator>#include <string>#include <unordered_map>#include <vector> int main(int ac, char** av) {  std::ios::sync_with_stdio(false);  int head = (ac > 1) ? std::atoi(av) : 10;  std::istreambuf_iterator<char> it(std::cin), eof;  std::filebuf file;  if (ac > 2) {    if (file.open(av, std::ios::in), file.is_open()) {      it = std::istreambuf_iterator<char>(&file);    } else return std::cerr << "file " << av << " open failed\n", 1;  }  auto alpha = [](unsigned c) { return c-'A' < 26 || c-'a' < 26; };  auto lower = [](unsigned c) { return c | '\x20'; };  std::unordered_map<std::string, int> counts;  std::string word;  for (; it != eof; ++it) {    if (alpha(*it)) {      word.push_back(lower(*it));    } else if (!word.empty()) {      ++counts[word];      word.clear();    }  }  if (!word.empty()) ++counts[word]; // if file ends w/o ws  std::vector<std::pair<const std::string,int> const*> out;  for (auto& count : counts) out.push_back(&count);  std::partial_sort(out.begin(),    out.size() < head ? out.end() : out.begin() + head,    out.end(), [](auto const* a, auto const* b) {      return a->second > b->second;    });  if (out.size() > head) out.resize(head);  for (auto const& count : out) {    std::cout << count->first << ' ' << count->second << '\n';  }  return 0;} 
Output:
$./a.out 10 135-0.txt the 41093 of 19954 and 14943 a 14558 to 13953 in 11219 he 9649 was 8622 that 7924 it 6661  ### Alternative Translation of: C# #include <algorithm>#include <iostream>#include <fstream>#include <map>#include <regex>#include <string>#include <vector> int main() { std::regex wordRgx("\\w+"); std::map<std::string, int> freq; std::string line; const int top = 10; std::ifstream in("135-0.txt"); if (!in.is_open()) { std::cerr << "Failed to open file\n"; return 1; } while (std::getline(in, line)) { auto words_itr = std::sregex_iterator( line.cbegin(), line.cend(), wordRgx); auto words_end = std::sregex_iterator(); while (words_itr != words_end) { auto match = *words_itr; auto word = match.str(); if (word.size() > 0) { transform (word.begin(), word.end(), word.begin(), ::tolower); auto entry = freq.find(word); if (entry != freq.end()) { entry->second++; } else { freq.insert(std::make_pair(word, 1)); } } words_itr = std::next(words_itr); } } in.close(); std::vector<std::pair<std::string, int>> pairs; for (auto iter = freq.cbegin(); iter != freq.cend(); ++iter) { pairs.push_back(*iter); } std::sort(pairs.begin(), pairs.end(), [](auto& a, auto& b) { return a.second > b.second; }); std::cout << "Rank Word Frequency\n" "==== ==== =========\n"; int rank = 1; for (auto iter = pairs.cbegin(); iter != pairs.cend() && rank <= top; ++iter) { std::printf("%2d %4s %5d\n", rank++, iter->first.c_str(), iter->second); } return 0;} Output: Rank Word Frequency ==== ==== ========= 1 the 36636 2 of 19615 3 and 14079 4 to 13535 5 a 13527 6 in 10256 7 was 8543 8 that 7324 9 he 6814 10 had 6139 ## Clojure (defn count-words [file n] (->> file slurp clojure.string/lower-case (re-seq #"\w+") frequencies (sort-by val >) (take n))) Output: user=> (count-words "135-0.txt" 10) (["the" 41036] ["of" 19946] ["and" 14940] ["a" 14589] ["to" 13939] ["in" 11204] ["he" 9645] ["was" 8619] ["that" 7922] ["it" 6659])  ## COBOL  IDENTIFICATION DIVISION. PROGRAM-ID. WordFrequency. AUTHOR. Bill Gunshannon. DATE-WRITTEN. 30 Jan 2020. ************************************************************ ** Program Abstract: ** Given a text file and an integer n, print the n most ** common words in the file (and the number of their ** occurrences) in decreasing frequency. ** ** A file named Parameter.txt provides this information. ** Format is: ** 12345678901234567890123456789012345678901234567890 ** |------------------|----| ** ^^^^^^^^^^^^^^^^ ^^^^ ** | | ** Source Text File Number of words with count ** 20 Characters 5 digits with leading zeroes ** ** ************************************************************ ENVIRONMENT DIVISION. INPUT-OUTPUT SECTION. FILE-CONTROL. SELECT Parameter-File ASSIGN TO "Parameter.txt" ORGANIZATION IS LINE SEQUENTIAL. SELECT Input-File ASSIGN TO Source-Text ORGANIZATION IS LINE SEQUENTIAL. SELECT Word-File ASSIGN TO "Word.txt" ORGANIZATION IS LINE SEQUENTIAL. SELECT Output-File ASSIGN TO "Output.txt" ORGANIZATION IS LINE SEQUENTIAL. SELECT Print-File ASSIGN TO "Printer.txt" ORGANIZATION IS LINE SEQUENTIAL. SELECT Sort-File ASSIGN TO DISK. DATA DIVISION. FILE SECTION. FD Parameter-File DATA RECORD IS Parameter-Record. 01 Parameter-Record. 05 Source-Text PIC X(20). 05 How-Many PIC 99999. FD Input-File DATA RECORD IS Input-Record. 01 Input-Record. 05 Input-Line PIC X(80). FD Word-File DATA RECORD IS Word-Record. 01 Word-Record. 05 Input-Word PIC X(20). FD Output-File DATA RECORD IS Output-Rec. 01 Output-Rec. 05 Output-Rec-Word PIC X(20). 05 Output-Rec-Word-Cnt PIC 9(5). FD Print-File DATA RECORD IS Print-Rec. 01 Print-Rec. 05 Print-Rec-Word PIC X(20). 05 Print-Rec-Word-Cnt PIC 9(5). SD Sort-File. 01 Sort-Rec. 05 Sort-Word PIC X(20). 05 Sort-Word-Cnt PIC 9(5). WORKING-STORAGE SECTION. 01 Eof PIC X VALUE 'F'. 01 InLine PIC X(80). 01 Word1 PIC X(20). 01 Current-Word PIC X(20). 01 Current-Word-Cnt PIC 9(5). 01 Pos PIC 99 VALUE 1. 01 Cnt PIC 99. 01 Report-Rank. 05 IRank PIC 99999 VALUE 1. 05 Rank PIC ZZZZ9. PROCEDURE DIVISION. Main-Program. ** ** Read the Parameters ** OPEN INPUT Parameter-File. READ Parameter-File. CLOSE Parameter-File. ** ** Open Files for first stage ** OPEN INPUT Input-File. OPEN OUTPUT Word-File. ** ** Pare\se the Source Text into a file of invidual words ** PERFORM UNTIL Eof = 'T' READ Input-File AT END MOVE 'T' TO Eof END-READ PERFORM Parse-a-Words MOVE SPACES TO Input-Record MOVE 1 TO Pos END-PERFORM. ** ** Cleanup from the first stage ** CLOSE Input-File Word-File ** ** Sort the individual words in alphabetical order ** SORT Sort-File ON ASCENDING KEY Sort-Word USING Word-File GIVING Word-File. ** ** Count each time a word is used ** PERFORM Collect-Totals. ** ** Sort data by number of usages per word ** SORT Sort-File ON DESCENDING KEY Sort-Word-Cnt USING Output-File GIVING Print-File. ** ** Show the work done ** OPEN INPUT Print-File. DISPLAY " Rank Word Frequency" PERFORM How-Many TIMES READ Print-File MOVE IRank TO Rank DISPLAY Rank " " Print-Rec ADD 1 TO IRank END-PERFORM. ** ** Cleanup ** CLOSE Print-File. CALL "C$DELETE" USING "Word.txt" ,0         CALL "C$DELETE" USING "Output.txt" ,0 STOP RUN. Parse-a-Words. INSPECT Input-Record CONVERTING '-.,"();:/[]{}!?|' TO SPACE PERFORM UNTIL Pos > FUNCTION STORED-CHAR-LENGTH(Input-Record) UNSTRING Input-Record DELIMITED BY SPACE INTO Word1 WITH POINTER Pos TALLYING IN Cnt MOVE FUNCTION TRIM(FUNCTION LOWER-CASE(Word1)) TO Word-Record IF Word-Record NOT EQUAL SPACES AND Word-Record IS ALPHABETIC THEN WRITE Word-Record END-IF END-PERFORM. Collect-Totals. MOVE 'F' to Eof OPEN INPUT Word-File OPEN OUTPUT Output-File READ Word-File MOVE Input-Word TO Current-Word MOVE 1 to Current-Word-Cnt PERFORM UNTIL Eof = 'T' READ Word-File AT END MOVE 'T' TO Eof END-READ IF FUNCTION TRIM(Word-Record) EQUAL FUNCTION TRIM(Current-Word) THEN ADD 1 to Current-Word-Cnt ELSE MOVE Current-Word TO Output-Rec-Word MOVE Current-Word-Cnt TO Output-Rec-Word-Cnt WRITE Output-Rec MOVE 1 to Current-Word-Cnt MOVE Word-Record TO Current-Word MOVE SPACES TO Input-Record END-IF END-PERFORM. CLOSE Word-File Output-File. END-PROGRAM.  Output:  Rank Word Frequency 1 the 40551 2 of 19806 3 and 14730 4 a 14351 5 to 13775 6 in 11074 7 he 09480 8 was 08613 9 that 07632 10 his 06446 11 it 06335 12 had 06181 13 is 06097 14 which 05135 15 with 04469  ## Common Lisp  (defun count-word (n pathname) (with-open-file (s pathname :direction :input) (loop for line = (read-line s nil nil) while line nconc (list-symb (drop-noise line)) into words finally (return (subseq (sort (pair words) #'> :key #'cdr) 0 n))))) (defun list-symb (s) (let ((*read-eval* nil)) (read-from-string (concatenate 'string "(" s ")")))) (defun drop-noise (s) (delete-if-not #'(lambda (x) (or (alpha-char-p x) (equal x #\space) (equal x #\-))) s)) (defun pair (words &aux (hash (make-hash-table)) ac) (dolist (word words) (incf (gethash word hash 0))) (maphash #'(lambda (e n) (push (,e . ,n) ac)) hash) ac)  Output: > (count-word 10 "c:/temp/135-0.txt") ((THE . 40738) (OF . 19922) (AND . 14878) (A . 14419) (TO . 13702) (IN . 11172) (HE . 9577) (WAS . 8612) (THAT . 7768) (IT . 6467))  ## Crystal require "http/client"require "regex" # Get the text from the internetresponse = HTTP::Client.get "https://www.gutenberg.org/files/135/135-0.txt"text = response.body text .downcase .scan(/[a-zA-ZáéíóúÁÉÍÓÚâêôäüöàèìòùñ']+/) .reduce({} of String => Int32) { |hash, match| word = match hash[word] = hash.fetch(word, 0) + 1 # using fetch to set a default value (1) to the new found word hash } .to_a # convert the returned hash to an array of tuples (String, Int32) -> {word, sum} .sort { |a, b| b <=> a }[0..9] # sort and get the first 10 elements .each_with_index(1) { |(word, n), i| puts "#{i} \t #{word} \t #{n}" } # print the result Output: 1 the 41092 2 of 19954 3 and 14943 4 a 14556 5 to 13953 6 in 11219 7 he 9649 8 was 8622 9 that 7924 10 it 6661  ## D import std.algorithm : sort;import std.array : appender, split;import std.range : take;import std.stdio : File, writefln, writeln;import std.typecons : Tuple;import std.uni : toLower; //Container for a word and how many times it has been seenalias Pair = Tuple!(string, "k", int, "v"); void main() { int[string] wcnt; //Read the file line by line foreach (line; File("135-0.txt").byLine) { //Split the words on whitespace foreach (word; line.split) { //Increment the times the word has been seen wcnt[word.toLower.idup]++; } } //Associative arrays cannot be sort, so put the key/value in an array auto wb = appender!(Pair[]); foreach(k,v; wcnt) { wb.put(Pair(k,v)); } Pair[] sw = wb.data.dup; //Sort the array, and display the top ten values writeln("Rank Word Frequency"); int rank=1; foreach (word; sw.sort!"a.v>b.v".take(10)) { writefln("%4s %-10s %9s", rank++, word.k, word.v); }} Output: Rank Word Frequency 1 the 40368 2 of 19863 3 and 14470 4 a 14277 5 to 13587 6 in 11019 7 he 9212 8 was 8346 9 that 7251 10 his 6414 ## Delphi Translation of: C#  program Word_frequency; {$APPTYPE CONSOLE} uses  System.SysUtils,  System.IOUtils,  System.Generics.Collections,  System.Generics.Defaults,  System.RegularExpressions; type  TWords = TDictionary<string, Integer>;   TFreqPair = TPair<string, Integer>;   TFreq = TArray<TFreqPair>; function CreateValueCompare: IComparer<TFreqPair>;begin  Result := TComparer<TFreqPair>.Construct(    function(const Left, Right: TFreqPair): Integer    begin      Result := Right.Value - Left.Value;    end);end; function WordFrequency(const Text: string): TFreq;var  words: TWords;  match: TMatch;  w: string;begin  words := TWords.Create();  match := TRegEx.Match(Text, '\w+');  while match.Success do  begin    w := match.Value;    if words.ContainsKey(w) then      words[w] := words[w] + 1    else      words.Add(w, 1);    match := match.NextMatch();  end;   Result := words.ToArray;  words.Free;  TArray.Sort<TFreqPair>(Result, CreateValueCompare);end; var  Text: string;  rank: integer;  Freq: TFreq;  w: TFreqPair; begin  Text := TFile.ReadAllText('135-0.txt').ToLower();   Freq := WordFrequency(Text);   Writeln('Rank  Word  Frequency');  Writeln('====  ====  =========');   for rank := 1 to 10 do  begin    w := Freq[rank - 1];    Writeln(format('%2d   %6s   %5d', [rank, w.Key, w.Value]));  end;   readln;end.
Output:
Rank  Word  Frequency
====  ====  =========
1      the   41040
2       of   19951
3      and   14942
4        a   14539
5       to   13941
6       in   11209
7       he    9646
8      was    8620
9     that    7922
10       it    6659


## F#

 open System.IOopen System.Text.RegularExpressionslet g=Regex("[A-Za-zÀ-ÿ]+").Matches(File.ReadAllText "135-0.txt")[for n in g do yield n.Value.ToLower()]|>List.countBy(id)|>List.sortBy(fun n->(-(snd n)))|>List.take 10|>List.iter(fun n->printfn "%A" n)
Output:
("the", 41088)
("of", 19949)
("and", 14942)
("a", 14596)
("to", 13951)
("in", 11214)
("he", 9648)
("was", 8621)
("that", 7924)
("it", 6661)


## Factor

This program expects stdin to read from a file via the command line. ( e.g. invoking the program in Windows: >factor word-count.factor < input.txt ) The definition of a word here is simply any string surrounded by some combination of spaces, punctuation, or newlines.

 USING: ascii io math.statistics prettyprint sequencessplitting ;IN: rosetta-code.word-count lines " " join " .,?!:;()\"-" split harvest [ >lower ] mapsorted-histogram <reversed> 10 head .
Output:
{
{ "the" 41021 }
{ "of" 19945 }
{ "and" 14938 }
{ "a" 14522 }
{ "to" 13938 }
{ "in" 11201 }
{ "he" 9600 }
{ "was" 8618 }
{ "that" 7822 }
{ "it" 6532 }
}


## FreeBASIC

  #Include "file.bi" type tally      as string s      as long lend type Sub quicksort(array() As String,begin As Long,Finish As Long) Dim As Long i=begin,j=finish Dim As String x =array(((I+J)\2)) While I <= J While array(I) < X :I+=1:Wend While array(J) > X :J-=1:Wend If I<=J Then Swap array(I),array(J): I+=1:J-=1 Wend If J >begin Then quicksort(array(),begin,J) If I <Finish Then quicksort(array(),I,Finish)End Sub Sub tallysort(array() As tally,begin As Long,Finish As long) Dim As Long i=begin,j=finish Dim As tally x =array(((I+J)\2)) While I <= J While array(I).l > X .l:I+=1:Wend While array(J).l < X .l:J-=1:Wend If I<=J Then Swap array(I),array(J): I+=1:J-=1 Wend If J >begin Then tallysort(array(),begin,J) If I <Finish Then tallysort(array(),I,Finish) End Sub  Function loadfile(file As String) As String	If Fileexists(file)=0 Then Print file;" not found":Sleep:End   Dim As Long  f=Freefile    Open file For Binary Access Read As #f    Dim As String text    If Lof(f) > 0 Then      text = String(Lof(f), 0)      Get #f, , text    End If    Close #f    Return textEnd Function Function String_Split(s_in As String,chars As String,result() As String) As Long    Dim As Long ctr,ctr2,k,n,LC=Len(chars)    Dim As boolean tally(Len(s_in))    #macro check_instring()    n=0    While n<Lc        If chars[n]=s_in[k] Then             tally(k)=true            If (ctr2-1) Then ctr+=1            ctr2=0            Exit While        End If        n+=1    Wend    #endmacro     #macro splice()    If tally(k) Then        If (ctr2-1) Then ctr+=1:result(ctr)=Mid(s_in,k+2-ctr2,ctr2-1)        ctr2=0    End If    #endmacro    '==================  LOOP TWICE =======================    For k  =0 To Len(s_in)-1        ctr2+=1:check_instring()    Next k     If ctr=0 Then         If Len(s_in) Andalso Instr(chars,Chr(s_in)) Then ctr=1':         End If    If ctr Then Redim result(1 To ctr): ctr=0:ctr2=0 Else  Return 0    For k  =0 To Len(s_in)-1        ctr2+=1:splice()    Next k    '===================== Last one ========================    If ctr2>0 Then        Redim Preserve result(1 To ctr+1)        result(ctr+1)=Mid(s_in,k+1-ctr2,ctr2)    End If     Return Ubound(result)End Function Redim As String s()redim as tally t()dim as string p1,p2,deliminatorsdim as long count,jmpdim as double tm=timer Var L=loadfile("rosettalesmiserables.txt")L=lcase(L)'get deliminatorsfor n as long=1 to 96      p1+=chr(n)nextfor n as long=123 to 255    p2+=chr(n)next deliminators=p1+p2 string_split(L,deliminators,s()) quicksort(s(),lbound(s),ubound(s)) For n As Long=lbound(s)  To ubound(s)-1      if s(n+1)=s(n) then jmp+=1      if s(n+1)<>s(n) then             count+=1            redim preserve t(1 to count)            t(count).s=s(n)            t(count).l=jmp            jmp=0            end ifNext tallysort(t(),lbound(t),ubound(t))'sort by frequencyprint "frequency","word"printfor n as long=lbound(t) to lbound(t)+9      print t(n).l,t(n).s      next Printprint "time for operation  ";timer-tm;" seconds"sleep
Output:
I saved and reloaded the file as ascii text.
frequency     word

41098        the
19955        of
14939        and
14557        a
13953        to
11219        in
9648         he
8621         was
7923         that
6660         it

time for operation   1.099869600031525 seconds



## Frink

This example shows some of the subtle and non-obvious power of Frink in processing text files in a language-aware and Unicode-aware fashion:

• Frink has a Unicode-aware function, wordList[str], which intelligently enumerates through the words in a string (and correctly handles compound words, hyphenated words, accented characters, etc.) It returns words, spaces, and punctuation marks separately. For the purposes of this program, "words" that do not contain any alphanumeric characters (as decided by the Unicode standard) are filtered out. These are likely punctuation and spaces. There is also a two-argument function, wordList[str, lang] which allows you to specify a language code e.g. "fr" to use the rules of French (or many other human languages) to perform correct word-breaking according to the rules of that language!
• The file fetched from Project Gutenberg is supposed to be encoded in UTF-8 character encoding, but their servers incorrectly send either that it is Windows-1252 encoded or send no character encoding at all, so this program fixes that.
• Frink has a Unicode-aware lowercase function, lc[str] that correctly handles accented characters and may even make a string longer.
• Frink can normalize Unicode characters with its normalizeUnicode function so the same word encoded two different ways in Unicode can be treated consistently. For example, a Unicode string can use various methods to encode what is essentially the same character/glyph. For example, the character ô can be represented as either "\u00F4" or "\u006F\u0302". The former is a "precomposed" character, "LATIN SMALL LETTER O WITH CIRCUMFLEX", and the latter is two Unicode codepoints, an o (LATIN SMALL LETTER O) followed by "COMBINING CIRCUMFLEX ACCENT". (This is usually referred to as a "decomposed" representation.) Unicode normalization rules can convert these "equivalent" encodings into a canonical representation. This makes two different strings which look equivalent to a human (but are very different in their codepoints) be treated as the same to a computer, and these programs will count them the same. Even if the Project Gutenberg document uses precomposed and decomposed representations for the same words, this program will fix it and count them the same! See the [Unicode Normal Forms] specification for more about these normalization rules. Frink implements all of them (NFC, NFD, NFKC, NFKD). NFC is the default in normalizeUnicode[str, encoding=NFC]. They're interesting!

How many other languages in this page do all or any of this correctly?

There are two sample programs below. First, a simple but powerful method that works in old versions of Frink:

d = new dictfor w = select[wordList[read[normalizeUnicode["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]], %r/[[:alnum:]]/ ]   d.increment[lc[w], 1] println[join["\n", first[reverse[sort[array[d], {|a,b| [email protected] <=> [email protected]}]], 10]]]
Output:
[the, 40802]
[of, 19933]
[and, 14924]
[a, 14450]
[to, 13719]
[in, 11184]
[he, 9636]
[was, 8617]
[that, 7901]
[it, 6641]


Next, a "showing off" one-liner that works in recent versions of Frink that uses the countToArray function which easily creates sorted frequency lists and the formatTable function that formats into a nice table with columns lined up, and still performs full Unicode-aware normalization, capitalization, and word-breaking:

formatTable[first[countToArray[select[wordList[lc[normalizeUnicode[read["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]]], %r/[[:alnum:]]/ ]], 10], "right"]
Output:
 the 36629
of 19602
and 14063
a 13447
to 13345
in 10259
was  8541
that  7303
he  6812


## Go

Translation of: Kotlin
package main import (    "fmt"    "io/ioutil"    "log"    "regexp"    "sort"    "strings") type keyval struct {    key string    val int} func main() {    reg := regexp.MustCompile(\p{Ll}+)    bs, err := ioutil.ReadFile("135-0.txt")    if err != nil {        log.Fatal(err)    }    text := strings.ToLower(string(bs))    matches := reg.FindAllString(text, -1)    groups := make(map[string]int)    for _, match := range matches {        groups[match]++    }    var keyvals []keyval    for k, v := range groups {        keyvals = append(keyvals, keyval{k, v})    }    sort.Slice(keyvals, func(i, j int) bool {        return keyvals[i].val > keyvals[j].val    })    fmt.Println("Rank  Word  Frequency")    fmt.Println("====  ====  =========")    for rank := 1; rank <= 10; rank++ {        word := keyvals[rank-1].key        freq := keyvals[rank-1].val        fmt.Printf("%2d    %-4s    %5d\n", rank, word, freq)    }}
Output:
Rank  Word  Frequency
====  ====  =========
1    the     41088
2    of      19949
3    and     14942
4    a       14596
5    to      13951
6    in      11214
7    he       9648
8    was      8621
9    that     7924
10    it       6661


## Groovy

Solution:

def topWordCounts = { String content, int n ->    def mapCounts = [:]    content.toLowerCase().split(/\W+/).each {        mapCounts[it] = (mapCounts[it] ?: 0) + 1    }    def top = (mapCounts.sort { a, b -> b.value <=> a.value }.collect{ it })[0..<n]    println "Rank Word Frequency\n==== ==== ========="    (0..<n).each { printf ("%4d %-4s %9d\n", it+1, top[it].key, top[it].value) }}

Test:

def rawText = "http://www.gutenberg.org/files/135/135-0.txt".toURL().texttopWordCounts(rawText, 10)

Output:

Rank Word Frequency
==== ==== =========
1 the      41036
2 of       19946
3 and      14940
4 a        14589
5 to       13939
6 in       11204
7 he        9645
8 was       8619
9 that      7922
10 it        6659

Translation of: Clojure
module Main where import Control.Category   -- (>>>)import Data.Char          -- toLower, isSpaceimport Data.List          -- sortBy, (Foldable(foldl')), filterimport Data.Ord           -- Downimport System.IO          -- stdin, ReadMode, openFile, hCloseimport System.Environment -- getArgs -- containersimport Data.Map.Strict (Map)import qualified Data.Map.Strict as Mimport qualified Data.IntMap.Strict as IM -- textimport Data.Text (Text)import qualified Data.Text as Timport qualified Data.Text.IO as T frequencies :: Ord a => [a] -> Map a Integerfrequencies = foldl' (\m k -> M.insertWith (+) k 1 m) M.empty{-# SPECIALIZE frequencies :: [Text] -> Map Text Integer #-} main :: IO ()main = do  args <- getArgs  (n,hand,filep) <- case length args of    0 -> return (10,stdin,False)    1 -> return (read $head args,stdin,False) _ -> let (ns:fp:_) = args in fmap (\h -> (read ns,h,True)) (openFile fp ReadMode) T.hGetContents hand >>= (T.map toLower >>> T.split isSpace >>> filter (not <<< T.null) >>> frequencies >>> M.toList >>> sortBy (comparing (Down <<< snd)) -- sort the opposite way >>> take n >>> print) when filep (hClose hand) Output: $ ./word_count 10 < ~/doc/les_miserables*
[("the",40368),("of",19863),("and",14470),("a",14277),("to",13587),("in",11019),("he",9212),("was",8346),("that",7251),("his",6414)]


Or, perhaps a little more simply:

import qualified Data.Text.IO as Timport qualified Data.Text as T import Data.List (group, sort, sortBy)import Data.Ord (comparing) frequentWords :: T.Text -> [(Int, T.Text)]frequentWords =  sortBy (flip $comparing fst) . fmap ((,) . length <*> head) . group . sort . T.words . T.toLower main :: IO ()main = T.readFile "miserables.txt" >>= (mapM_ print . take 10 . frequentWords) Output: (40370,"the") (19863,"of") (14470,"and") (14277,"a") (13587,"to") (11019,"in") (9212,"he") (8346,"was") (7251,"that") (6414,"his") ## J Text acquisition: store the entire text from the web page http://www.gutenberg.org/files/135/135-0.txt (the plain text UTF-8 link) into a file. This linux example uses ~/downloads/books/LesMis.txt . Program: Reading from left to right, 10 {. "ten take" from an array computed by words to the right. \:~ "sort descending" by items of the array computed by whatever is to the right. (#;{.)/.~ "tally linked with item" key ;: "words" parses the argument to its right as a j sentence. tolower changes to a common case Hence the remainder of the j sentence must clean after loading the file. The parenthesized expression (a.-.Alpha_j_,' ') computes to a vector of the j alphabet excluding [a-zA-Z ] ((e.&(a.-.Alpha_j_,' '))(,:&' '))} substitutes space character for the unwanted characters. 1!:1 reads the file named in the box <  10{.\:~(#;{.)/.~;:tolower((e.&(a.-.Alpha_j_,' '))(,:&' '))}1!:1<jpath'~/downloads/books/LesMis.txt' ┌─────┬────┐ │41093│the │ ├─────┼────┤ │19954│of │ ├─────┼────┤ │14943│and │ ├─────┼────┤ │14558│a │ ├─────┼────┤ │13953│to │ ├─────┼────┤ │11219│in │ ├─────┼────┤ │9649 │he │ ├─────┼────┤ │8622 │was │ ├─────┼────┤ │7924 │that│ ├─────┼────┤ │6661 │it │ └─────┴────┘  ## Java Translation of: Kotlin import java.io.IOException;import java.nio.file.Files;import java.nio.file.Path;import java.nio.file.Paths;import java.util.HashMap;import java.util.List;import java.util.Map;import java.util.regex.Matcher;import java.util.regex.Pattern;import java.util.stream.Collectors; public class WordCount { public static void main(String[] args) throws IOException { Path path = Paths.get("135-0.txt"); byte[] bytes = Files.readAllBytes(path); String text = new String(bytes); text = text.toLowerCase(); Pattern r = Pattern.compile("\\p{javaLowerCase}+"); Matcher matcher = r.matcher(text); Map<String, Integer> freq = new HashMap<>(); while (matcher.find()) { String word = matcher.group(); Integer current = freq.getOrDefault(word, 0); freq.put(word, current + 1); } List<Map.Entry<String, Integer>> entries = freq.entrySet() .stream() .sorted((i1, i2) -> Integer.compare(i2.getValue(), i1.getValue())) .limit(10) .collect(Collectors.toList()); System.out.println("Rank Word Frequency"); System.out.println("==== ==== ========="); int rank = 1; for (Map.Entry<String, Integer> entry : entries) { String word = entry.getKey(); Integer count = entry.getValue(); System.out.printf("%2d %-4s %5d\n", rank++, word, count); } }} Output: Rank Word Frequency ==== ==== ========= 1 the 41088 2 of 19949 3 and 14942 4 a 14596 5 to 13951 6 in 11214 7 he 9648 8 was 8621 9 that 7924 10 it 6661 ## jq The following solution uses the concept of a "bag of words" (bow), here realized as a JSON object with the words as keys and the frequency of a word as the corresponding value. To avoid issues with case folding, the "letters" here just the alphabet and hyphen, but a "word" may not begin with hyphen. Thus "the-the" would count as one word, and "-the" would be excluded.  < 135-0.txt jq -nR --argjson n 10 'def bow(stream): reduce stream as$word ({}; .[($word|tostring)] += 1); bow(inputs | gsub("[^-a-zA-Z]"; " ") | splits(" *") | ascii_downcase | select(test("^[a-z][-a-z]*$")))| to_entries| sort_by(.value)| .[- $n :]| reverse| from_entries'  #### Output  { "the": 41087, "of": 19937, "and": 14932, "a": 14552, "to": 13738, "in": 11209, "he": 9649, "was": 8621, "that": 7923, "it": 6661}  ## Julia Works with: Julia version 1.0  using FreqTables txt = read("les-mis.txt", String)words = split(replace(txt, r"\P{L}"i => " "))table = sort(freqtable(words); rev=true)println(table[1:10]) Output: Dim1 │ ───────┼────── "the" │ 36671 "of" │ 19618 "and" │ 14081 "to" │ 13541 "a" │ 13529 "in" │ 10265 "was" │ 8545 "that" │ 7326 "he" │ 6816 "had" │ 6140 ## KAP The below program defines the function 'stats' which accepts a filename containing the text. ∇ stats (file) { content ← "[\\h,.\"'\n-]+" regex:split unicode:toLower io:readFile file sorted ← (⍋⊇⊢) content selection ← 1,2≢/sorted words ← selection / sorted {⍵[10↑⍒⍵[;1];]} words ,[0.5] ≢¨ sorted ⊂⍨ +\selection} Output: ┏━━━━━━━━━━━━┓ ┃ "the" 40387┃ ┃ "of" 19913┃ ┃ "and" 14742┃ ┃ "a" 14289┃ ┃ "to" 13819┃ ┃ "in" 11088┃ ┃ "he" 9430┃ ┃ "was" 8597┃ ┃"that" 7516┃ ┃ "his" 6435┃ ┗━━━━━━━━━━━━┛ ## Kotlin The author of the Raku entry has given a good account of the difficulties with this task and, in the absence of any clarification on the various issues, I've followed a similar 'literal' approach. So, after first converting the text to lower case, I've assumed that a word is any sequence of one or more lower-case Unicode letters and obtained the same results as the Raku version. There is no change in the results if the numerals 0-9 are also regarded as letters. // version 1.1.3 import java.io.File fun main(args: Array<String>) { val text = File("135-0.txt").readText().toLowerCase() val r = Regex("""\p{javaLowerCase}+""") val matches = r.findAll(text) val wordGroups = matches.map { it.value } .groupBy { it } .map { Pair(it.key, it.value.size) } .sortedByDescending { it.second } .take(10) println("Rank Word Frequency") println("==== ==== =========") var rank = 1 for ((word, freq) in wordGroups) System.out.printf("%2d %-4s %5d\n", rank++, word, freq) } Output: Rank Word Frequency ==== ==== ========= 1 the 41088 2 of 19949 3 and 14942 4 a 14596 5 to 13951 6 in 11214 7 he 9648 8 was 8621 9 that 7924 10 it 6661  ## Liberty BASIC dim words$(100000,2)'words$(a,1)=the word, words$(a,2)=the countdim lines$(150000)open "135-0.txt" for input as #txtwhile EOF(#txt)=0 and total < 150000 input #txt, lines$(total)    total=total+1wendfor a = 1 to total    token$= "?" index=0 new=0 while token$ <> ""        new=0        index = index + 1        token$= lower$(word$(lines$(a),index))        token$=replstr$(token$,".","") token$=replstr$(token$,",","")        token$=replstr$(token$,";","") token$=replstr$(token$,"!","")        token$=replstr$(token$,"?","") token$=replstr$(token$,"-","")        token$=replstr$(token$,"_","") token$=replstr$(token$,"~","")        token$=replstr$(token$,"+","") token$=replstr$(token$,"0","")        token$=replstr$(token$,"1","") token$=replstr$(token$,"2","")        token$=replstr$(token$,"3","") token$=replstr$(token$,"4","")        token$=replstr$(token$,"5","") token$=replstr$(token$,"6","")        token$=replstr$(token$,"7","") token$=replstr$(token$,"8","")        token$=replstr$(token$,"9","") token$=replstr$(token$,"/","")        token$=replstr$(token$,"<","") token$=replstr$(token$,">","")        token$=replstr$(token$,":","") for b = 1 to newwordcount if words$(b,1)=token$then num=val(words$(b,2))+1                num$=str$(num)                if len(num$)=1 then num$="0000"+num$if len(num$)=2 then num$="000"+num$                if len(num$)=3 then num$="00"+num$if len(num$)=4 then num$="0"+num$                words$(b,2)=num$                new=1                exit for            end if        next b        if new<>1 then newwordcount=newwordcount+1:words$(newwordcount,1)=token$:words$(newwordcount,2)="00001":print newwordcount;" ";token$    wendnext aprintsort words$(), 1, newwordcount, 2print "Count Word"print "===== ================="for a = newwordcount to newwordcount-10 step -1 print words$(a,2);" ";words$(a,1)next aprint "-----------------------"print newwordcount;" unique words found."print "End of program"close #txtend  Output: Count Word ===== ================= 40292 the 19825 of 14703 and 14249 a 13594 to 122613 11061 in 09436 he 08579 was 07530 that 06428 his ----------------------- 29109 unique words found.  ## Lua Works with: lua version 5.3  -- This program takes two optional command line arguments. The first (arg)-- specifies the input file, or defaults to standard input. The second-- (arg) specifies the number of results to show, or defaults to 10. -- in freq, each key is a word and each value is its countlocal freq = {}for line in io.lines(arg) do -- %a stands for any letter for word in string.gmatch(string.lower(line), "%a+") do if not freq[word] then freq[word] = 1 else freq[word] = freq[word] + 1 end endend -- in array, each entry is an array whose first value is the count and whose-- second value is the wordlocal array = {}for word, count in pairs(freq) do table.insert(array, {count, word})endtable.sort(array, function (a, b) return a > b end) for i = 1, arg or 10 do io.write(string.format('%7d %s\n', array[i] , array[i]))end  Output: ❯ ./wordcount.lua 135-0.txt 41093 the 19954 of 14943 and 14558 a 13953 to 11219 in 9649 he 8622 was 7924 that 6661 it  Relevant documentation: io.lines gmatch patterns like %a ## Mathematica / Wolfram Language TakeLargest@WordCounts[Import["https://www.gutenberg.org/files/135/135-0.txt"], IgnoreCase->True]//Dataset Output: the 41088 of 19936 and 14931 a 14536 to 13738 in 11208 he 9607 was 8621 that 7825 it 6535  ## MATLAB / Octave  function [result,count] = word_frequency()URL='https://www.gutenberg.org/files/135/135-0.txt';text=webread(URL);DELIMITER={' ', ',', ';', ':', '.', '/', '*', '!', '?', '<', '>', '(', ')', '[', ']','{', '}', '&','$','§','"','”','“','-','—','‘','\t','\n','\r'};words  = sort(strsplit(lower(text),DELIMITER));flag   = [find(~strcmp(words(1:end-1),words(2:end))),length(words)]; dwords = words(flag);   % get distinct words, and ...count  = diff([0,flag]);  % ... the corresponding occurance frequency[tmp,idx] = sort(-count);       % sort according to occuranceresult = dwords(idx);count  = count(idx);for k  =  1:10,        fprintf(1,'%d\t%s\n',count(k),result{k})end
Output:
41039   the
19950   of
14942   and
14523   a
13941   to
11208   in
9605    he
8620    was
7824    that
6533    it


import tables, strutils, sequtils, httpclient proc take[T](s: openArray[T], n: int): seq[T] = s[0 ..< min(n, s.len)] var client = newHttpClient()var text = client.getContent("https://www.gutenberg.org/files/135/135-0.txt") var wordFrequencies = text.toLowerAscii.splitWhitespace.toCountTablewordFrequencies.sortfor (word, count) in toSeq(wordFrequencies.pairs).take(10):  echo alignLeft($count, 8), word Output: 40377 the 19870 of 14469 and 14278 a 13590 to 11025 in 9213 he 8347 was 7249 that 6414 his ## Objeck use System.IO.File;use Collection;use RegEx; class Rosetta { function : Main(args : String[]) ~ Nil { if(args->Size() <> 1) { return; }; input := FileReader->ReadFile(args); filter := RegEx->New("\\w+"); words := filter->Find(input); word_counts := StringMap->New(); each(i : words) { word := words->Get(i)->As(String); if(word <> Nil & word->Size() > 0) { word := word->ToLower(); if(word_counts->Has(word)) { count := word_counts->Find(word)->As(IntHolder); count->Set(count->Get() + 1); } else { word_counts->Insert(word, IntHolder->New(1)); }; }; }; count_words := IntMap->New(); words := word_counts->GetKeys(); each(i : words) { word := words->Get(i)->As(String); count := word_counts->Find(word)->As(IntHolder); count_words->Insert(count->Get(), word); }; counts := count_words->GetKeys(); counts->Sort(); index := 1; "Rank\tWord\tFrequency"->PrintLine(); "====\t====\t===="->PrintLine(); for(i := count_words->Size() - 1; i >= 0; i -= 1;) { if(count_words->Size() - 10 <= i) { count := counts->Get(i); word := count_words->Find(count)->As(String); "{$index}\t{$word}\t{$count}"->PrintLine();        index += 1;      };    };  }}

Output:

Rank    Word    Frequency
====    ====    ====
1       the     41036
2       of      19946
3       and     14940
4       a       14589
5       to      13939
6       in      11204
7       he      9645
8       was     8619
9       that    7922
10      it      6659


## OCaml

let () =  let n =    try int_of_string Sys.argv.(1)    with _ -> 10  in  let ic = open_in "135-0.txt" in  let h = Hashtbl.create 97 in  let w = Str.regexp "[^A-Za-zéèàêâôîûœ]+" in  try    while true do      let line = input_line ic in      let words = Str.split w line in      List.iter (fun word ->        let word = String.lowercase_ascii word in        match Hashtbl.find_opt h word with        | None -> Hashtbl.add h word 1        | Some x -> Hashtbl.replace h word (succ x)      ) words    done  with End_of_file ->    close_in ic;    let l = Hashtbl.fold (fun word count acc -> (word, count)::acc) h [] in    let s = List.sort (fun (_, c1) (_, c2) -> compare c2 c1) l in    let r = List.init n (fun i -> List.nth s i) in    List.iter (fun (word, count) ->      Printf.printf "%d  %s\n" count word    ) r
Output:
$ocaml str.cma word_freq.ml 41092 the 19954 of 14943 and 14554 a 13953 to 11219 in 9649 he 8622 was 7924 that 6661 it  ## Perl Translation of: Raku $top = 10; open $fh, "<", '135-0.txt';($text = join '', <$fh>) =~ tr/A-Z/a-z/ or die "Can't open '135-0.txt':$!\n"; @matcher = (    qr/[a-z]+/,     # simple 7-bit ASCII    qr/\w+/,        # word characters with underscore    qr/[a-z0-9]+/,  # word characters without underscore); for $reg (@matcher) { print "\nTop$top using regex: " . $reg . "\n"; @matches =$text =~ /$reg/g; my %words; for$w (@matches) { $words{$w}++ };    $c = 0; for$w ( sort { $words{$b} <=> $words{$a} } keys %words ) {        printf "%-7s %6d\n", $w,$words{$w}; last if ++$c >= $top; }} Output: Top 10 using regex: (?^:[a-z]+) the 41089 of 19949 and 14942 a 14608 to 13951 in 11214 he 9648 was 8621 that 7924 it 6661 Top 10 using regex: (?^:\w+) the 41036 of 19946 and 14940 a 14589 to 13939 in 11204 he 9645 was 8619 that 7922 it 6659 Top 10 using regex: (?^:[a-z0-9]+) the 41089 of 19949 and 14942 a 14608 to 13951 in 11214 he 9648 was 8621 that 7924 it 6661 ## Phix without javascript_semantics ?"loading..." constant subs = '\t'&"\r\n_.,\"\'!;:?][()|=<>#/*{}[email protected]%&$",
reps = repeat(' ',length(subs)),
fn = open("135-0.txt","r")
string text = lower(substitute_all(get_text(fn),subs,reps))
close(fn)
sequence words = append(sort(split(text,no_empty:=true)),"")
constant wf = new_dict()
string last = words
integer count = 1
for i=2 to length(words) do
if words[i]!=last then
setd({count,last},0,wf)
count = 0
last = words[i]
end if
count += 1
end for
count = 10
function visitor(object key, object /*data*/, object /*user_data*/)
?key
count -= 1
return count>0
end function
traverse_dict(routine_id("visitor"),0,wf,true)

Output:
loading...
{40743,"the"}
{19925,"of"}
{14881,"and"}
{14474,"a"}
{13704,"to"}
{11174,"in"}
{9623,"he"}
{8613,"was"}
{7867,"that"}
{6612,"it"}


## Phixmonti

include ..\Utilitys.pmt "loading..." ?"135-0.txt" "r" fopen var fn" "truewhile    fn fgets number? if drop fn fclose false else lower " " chain chain true endifendwhile "process..." ?len for    var i    i get dup 96 > swap 123 < and not if 32 i set endifendforsplit sort "count..." ?( ) var words"" var prev1 var nlen for    var i    i get dup prev ==    if        drop n 1 + var n    else        words ( n prev ) 0 put var words var prev 1 var n    endifendfordropwords sort10 for    -1 * get ?endfordrop
Output:
loading...
process...
count...
[41093, "the"]
[19954, "of"]
[14943, "and"]
[14558, "a"]
[13953, "to"]
[11219, "in"]
[9649, "he"]
[8622, "was"]
[7924, "that"]
[6661, "it"]

=== Press any key to exit ===

## PHP

 <?php preg_match_all('/\w+/', file_get_contents($argv),$words);$frecuency = array_count_values($words);arsort($frecuency); echo "Rank\tWord\tFrequency\n====\t====\t=========\n";$i = 1;foreach ($frecuency as$word => $count) { echo$i . "\t" . $word . "\t" .$count . "\n";    if ($i >= 10) { break; }$i++;}
Output:
Rank  Word  Frequency
====  ====  =========
1    the   36636
2     of   19615
3    and   14079
4     to   13535
5      a   13527
6     in   10256
7    was    8543
8   that    7324
9     he    6814


(setq *Delim " ^I^J^M-_.,\"'*[]?!&@#$%^:;")(setq *Skip (chop *Delim)) (de word+ NIL (prog1 (lowc (till *Delim T)) (while (member (peek) *Skip) (char)) ) ) (off B)(in "135-0.txt" (until (eof) (let W (word+) (if (idx 'B W T) (inc (car @)) (set W 1)) ) ) )(for L (head 10 (flip (by val sort (idx 'B)))) (println L (val L)) ) Output: "the" 41088 "of" 19949 "and" 14942 "a" 14545 "to" 13950 "in" 11214 "he" 9647 "was" 8620 "that" 7924 "it" 6661  ## Prolog Works with: SWI Prolog print_top_words(File, N):- read_file_to_string(File, String, [encoding(utf8)]), re_split("\\w+", String, Words), lower_case(Words, Lower), sort(1, @=<, Lower, Sorted), merge_words(Sorted, Counted), sort(2, @>, Counted, Top_words), writef("Top %w words:\nRank\tCount\tWord\n", [N]), print_top_words(Top_words, N, 1). lower_case([_], []):-!.lower_case([_, Word|Words], [Lower - 1|Rest]):- string_lower(Word, Lower), lower_case(Words, Rest). merge_words([], []):-!.merge_words([Word - C1, Word - C2|Words], Result):- !, C is C1 + C2, merge_words([Word - C|Words], Result).merge_words([W|Words], [W|Rest]):- merge_words(Words, Rest). print_top_words([], _, _):-!.print_top_words(_, 0, _):-!.print_top_words([Word - Count|Rest], N, R):- writef("%w\t%w\t%w\n", [R, Count, Word]), N1 is N - 1, R1 is R + 1, print_top_words(Rest, N1, R1). main:- print_top_words("135-0.txt", 10). Output: Top 15 words: Rank Count Word 1 41040 the 2 19951 of 3 14942 and 4 14539 a 5 13941 to 6 11209 in 7 9646 he 8 8620 was 9 7922 that 10 6659 it  ## PureBasic EnableExplicit Structure wordcount wkey$  count.iEndStructure Define token.c, word$, idx.i, start.i, arg$NewMap wordmap.i()NewList wordlist.wordcount() If OpenConsole("")    arg$= ProgramParameter(0) If arg$ = "" : End 1 : EndIf    start = ElapsedMilliseconds()  If ReadFile(0, arg$, #PB_Ascii) While Not Eof(0) token = ReadCharacter(0, #PB_Ascii) Select token Case 'A' To 'Z', 'a' To 'z' word$ + LCase(Chr(token))        Default          If word$wordmap(word$) + 1            word$= "" EndIf EndSelect Wend CloseFile(0) ForEach wordmap() AddElement(wordlist()) wordlist()\wkey$ = MapKey(wordmap())      wordlist()\count = wordmap()    Next    SortStructuredList(wordlist(), #PB_Sort_Descending, OffsetOf(wordcount\count), TypeOf(wordcount\count))    PrintN("Elapsed milliseconds: " + Str(ElapsedMilliseconds() - start))    PrintN("File: " + GetFilePart(arg$)) PrintN(~"Rank\tCount\t\t Word") If FirstElement(wordlist()) For idx = 1 To 10 Print(RSet(Str(idx), 2)) Print(~"\t") Print(wordlist()\wkey$)        Print(~"\t\t")        PrintN(RSet(Str(wordlist()\count), 6))                 If NextElement(wordlist()) = 0          Break        EndIf      Next    EndIf    EndIf  Input()EndIf End
Output:
Elapsed milliseconds: 462
File: 135-0.txt
Rank	Count		  Word
1	the		 41093
2	of		 19954
3	and		 14943
4	a		 14558
5	to		 13953
6	in		 11219
7	he		  9649
8	was		  8622
9	that		  7924
10	it		  6661


## Python

### Collections

#### Python2.7

import collectionsimport reimport stringimport sys def main():  counter = collections.Counter(re.findall(r"\w+",open(sys.argv).read().lower()))  print counter.most_common(int(sys.argv)) if __name__ == "__main__":  main()
Output:
$python wordcount.py 135-0.txt 10 [('the', 41036), ('of', 19946), ('and', 14940), ('a', 14589), ('to', 13939), ('in', 11204), ('he', 9645), ('was', 8619), ('that', 7922), ('it', 6659)]  #### Python3.6 from collections import Counterfrom re import findall les_mis_file = 'les_mis_135-0.txt' def _count_words(fname): with open(fname) as f: text = f.read() words = findall(r'\w+', text.lower()) return Counter(words) def most_common_words_in_file(fname, n): counts = _count_words(fname) for word, count in [['WORD', 'COUNT']] + counts.most_common(n): print(f'{word:>10} {count:>6}') if __name__ == "__main__": n = int(input('How many?: ')) most_common_words_in_file(les_mis_file, n) Output: How many?: 10 WORD COUNT the 41036 of 19946 and 14940 a 14586 to 13939 in 11204 he 9645 was 8619 that 7922 it 6659 ### Sorted and groupby Works with: Python version 3.7 """Word count task from Rosetta Codehttp://www.rosettacode.org/wiki/Word_count#Python"""from itertools import (groupby, starmap)from operator import itemgetterfrom pathlib import Pathfrom typing import (Iterable, List, Tuple) FILEPATH = Path('lesMiserables.txt')COUNT = 10 def main(): words_and_counts = most_frequent_words(FILEPATH) print(*words_and_counts[:COUNT], sep='\n') def most_frequent_words(filepath: Path, *, encoding: str = 'utf-8') -> List[Tuple[str, int]]: """ A list of word-frequency pairs sorted by their occurrences. The words are read from the given file. """ def word_and_frequency(word: str, words_group: Iterable[str]) -> Tuple[str, int]: return word, capacity(words_group) file_contents = filepath.read_text(encoding=encoding) words = file_contents.lower().split() grouped_words = groupby(sorted(words)) words_and_frequencies = starmap(word_and_frequency, grouped_words) return sorted(words_and_frequencies, key=itemgetter(1), reverse=True) def capacity(iterable: Iterable) -> int: """Returns a number of elements in an iterable""" return sum(1 for _ in iterable) if __name__ == '__main__': main()  Output: ('the', 40372) ('of', 19868) ('and', 14472) ('a', 14278) ('to', 13589) ('in', 11024) ('he', 9213) ('was', 8347) ('that', 7250) ('his', 6414) ### Collections, Sorted and Lambda  #!/usr/bin/python3import collectionsimport re count = 10 with open("135-0.txt") as f: text = f.read() word_freq = sorted( collections.Counter(sorted(re.split(r"\W+", text.lower()))).items(), key=lambda c: c, reverse=True,) for i in range(len(word_freq)): print("[{:2d}] {:>10} : {}".format(i + 1, word_freq[i], word_freq[i])) if i == count - 1: break  Output: [ 1] the : 41039 [ 2] of : 19951 [ 3] and : 14942 [ 4] a : 14527 [ 5] to : 13941 [ 6] in : 11209 [ 7] he : 9646 [ 8] was : 8620 [ 9] that : 7922  it : 6659 ## R I chose to remove apostrophes only if they're followed by an s (so "mom" and "mom's" will show up as the same word but "they" and "they're" won't). I also chose not to remove hyphens.  wordcount<-function(file,n){ punctuation=c("","~","!","@","#","$","%","^","&","*","(",")","_","+","=","{","[","}","]","|","\\",":",";","\"","<",",",">",".","?","/","'s")  wordlist=scan(file,what=character())  wordlist=tolower(wordlist)  for(i in 1:length(punctuation)){    wordlist=gsub(punctuation[i],"",wordlist,fixed=T)  }  df=data.frame("Word"=sort(unique(wordlist)),"Count"=rep(0,length(unique(wordlist))))  for(i in 1:length(unique(wordlist))){    df[i,2]=length(which(wordlist==df[i,1]))  }  df=df[order(df[,2],decreasing = T),]  row.names(df)=1:nrow(df)  return(df[1:n,])} 
Output:
> wordcount("MobyDick.txt",10)
Word Count
1   the 14346
2    of  6590
3   and  6340
4     a  4611
5    to  4572
6    in  4130
7  that  2903
8   his  2516
9    it  2308
10    i  1845


## Racket

#lang racket (define (all-words f (case-fold string-downcase))  (map case-fold (regexp-match* #px"\\w+" (file->string f)))) (define (l.|l| l) (cons (car l) (length l))) (define (counts l (>? >)) (sort (map l.|l| (group-by values l)) >? #:key cdr)) (module+ main  (take (counts (all-words "data/les-mis.txt")) 10))
Output:
'(("the" . 41036)
("of" . 19946)
("and" . 14940)
("a" . 14589)
("to" . 13939)
("in" . 11204)
("he" . 9645)
("was" . 8619)
("that" . 7922)
("it" . 6659))

## Raku

(formerly Perl 6)

Works with: Rakudo version 2020.08.1

Note: much of the following exposition is no longer critical to the task as the requirements have been updated, but is left here for historical and informational reasons.

This is slightly trickier than it appears initially. The task specifically states: "A word is a sequence of one or more contiguous letters", so contractions and hyphenated words are broken up. Initially we might reach for a regex matcher like /\w+/ , but \w includes underscore, which is not a letter but a punctuation connector; and this text is full of underscores since that is how Project Gutenberg texts denote italicized text. The underscores are not actually parts of the words though, they are markup.

We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various accented glyphs. Those are letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.)

Actually, in this case /A-Za-z/ returns very nearly the correct answer. Unfortunately, the name "Alèthe" appears once (only once!) in the text, gets incorrectly split into Al & the, and incorrectly reports 41089 occurrences of "the". The text has several words like "Panathenæa", "ça", "aérostiers" and "Keksekça" so the counts for 'a' are off too. The other 8 of the top 10 are "correct" using /A-Za-z/, but it is mostly by accident.

A more accurate regex matcher would be some kind of Unicode aware /\w/ minus underscore. It may also be useful, depending on your requirements, to recognize contractions with embedded apostrophes, hyphenated words, and hyphenated words broken across lines.

Here is a sample that shows the result when using various different matchers.

sub MAIN ($filename,$top = 10) {    my $file =$filename.IO.slurp.lc.subst(/ (<[\w]-[_]>'-')\n(<[\w]-[_]>) /, {$0 ~$1}, :g );    my @matcher = (        rx/ <[a..z]>+ /,    # simple 7-bit ASCII        rx/ \w+ /,          # word characters with underscore        rx/ <[\w]-[_]>+ /,  # word characters without underscore        rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /   # word characters without underscore but with hyphens and contractions    );    for @matcher -> $reg { say "\nTop$top using regex: ", $reg.raku; .put for$file.comb( $reg ).Bag.sort(-*.value)[^$top];    }}
Output:

Passing in the file name and 10:

Top 10 using regex: rx/ <[a..z]>+ /
the	41089
of	19949
and	14942
a	14608
to	13951
in	11214
he	9648
was	8621
that	7924
it	6661

Top 10 using regex: rx/ \w+ /
the	41035
of	19946
and	14940
a	14577
to	13939
in	11204
he	9645
was	8619
that	7922
it	6659

Top 10 using regex: rx/ <[\w]-[_]>+ /
the	41088
of	19949
and	14942
a	14596
to	13951
in	11214
he	9648
was	8621
that	7924
it	6661

Top 10 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
the	41081
of	19930
and	14934
a	14587
to	13735
in	11204
he	9607
was	8620
that	7825
it	6535

It can be difficult to figure out what words the different regexes do or don't match. Here are the three more complex regexes along with a list of "words" that are treated as being different using this regex as opposed to /a..z/. IE: It is lumped in as one of the top 10 word counts using /a..z/ but not with this regex.

Top 10 using regex: rx/ \w+ /
the	41035	alèthe _the _the_
of	19946	of_ _of_
and	14940	_and_ paternoster_and
a	14577	_ça aïe ça keksekça aérostiers _a poréa panathenæa
to	13939	to_ _to
in	11204	_in
he	9645	_he
was	8619	_was
that	7922	_that
it	6659	_it

Top 10 using regex: rx/ <[\w]-[_]>+ /
the	41088	alèthe
of	19949
and	14942
a	14596	poréa ça aérostiers panathenæa aïe keksekça
to	13951
in	11214
he	9648
was	8621
that	7924
it	6661

Top 10 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
the	41081	will-o'-the-wisps alèthe skip-the-gutter police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change jean-the-screw will-o'-the-wisp
and	14934	come-and-see so-and-so cock-and-bull hide-and-seek sambre-and-meuse
a	14587	keksekça l'a ça now-a-days vis-a-vis a-dreaming police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change poréa panathenæa aérostiers a-hunting aïe die-of-hunger-if-you-have-a-fire
to	13735	to-morrow to-day hand-to-hand to-night well-to-do face-to-face
in	11204	in-pace son-in-law father-in-law whippers-in general-in-chief sons-in-law
he	9607	he's he'll
was	8620	police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change
that	7825	that's pick-me-down-that
it	6535	it's it'll

One nice thing is this isn't special cased. It will work out of the box for any text / language.

Russian? No problem.

$raku wf 14741-0.txt 5 Top 5 using regex: rx/ <[a..z]>+ / the 176 of 119 gutenberg 93 project 87 to 80 Top 5 using regex: rx/ \w+ / и 860 в 579 не 290 на 222 ты 195 Top 5 using regex: rx/ <[\w]-[_]>+ / и 860 в 579 не 290 на 222 ты 195 Top 5 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* / и 860 в 579 не 290 на 222 ты 195 Greek? Sure, why not. $ raku wf 39963-0.txt 5
Top 5 using regex: rx/ <[a..z]>+ /
the	187
of	123
gutenberg	93
project	87
to	82

Top 5 using regex: rx/ \w+ /
και	1628
εις	986
δε	982
του	895
των	859

Top 5 using regex: rx/ <[\w]-[_]>+ /
και	1628
εις	986
δε	982
του	895
των	859

Top 5 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
και	1628
εις	986
δε	982
του	895
των	859

Of course, for the first matcher, we are asking specifically to match Latin ASCII, so we end up with... well... Latin ASCII; but the other 3 match any Unicode characters.

## REXX

### version 1

This REXX version doesn't need to sort the list of words.

Extra code was added to handle some foreign letters   (non-Latin)   and also handle most accented letters.

This version recognizes all the accented letters that are present in the required/specified text (file)   (and some other non-Latin letters as well).

This means that the word     Alèthe     is treated as one word,   not as two words     Al   the     (and not thereby adding two separate words).

This version also supports words that contain embedded apostrophes ( ' )
[that is, within a word,   but not those words that start or end with an apostrophe;   for those encapsulated words,   the apostrophe is elided].

Thus,   it's   is counted separately from   it   and/or   its.

Since REXX doesn't support UTF-8 encodings, code was added to this REXX version to support the accented letters in the mandated input file.

/*REXX pgm displays top 10 words in a file (includes foreign letters),  case is ignored.*/parse arg fID top .                              /*obtain optional arguments from the CL*/if fID=='' | fID==","  then fID= 'les_mes.txt'   /*None specified? Then use the default.*/if top=='' | top==","  then top= 10              /*  "      "        "   "   "     "    */call init                                        /*initialize varied bunch of variables.*/call rdrsay right('word', 40)  " "  center(' rank ', 6)  "  count "   /*display title for output*/say right('════', 40)  " "  center('══════', 6)  " ═══════"   /*   "    title separator.*/      do  until otops==tops | tops>top            /*process enough words to satisfy  TOP.*/     WL=;         mk= 0;    otops= tops          /*initialize the word list (to a NULL).*/           do n=1  for c;    z= !.n;      k= @.z  /*process the list of words in the file*/          if k==mk  then WL= WL z                /*handle cases of tied number of words.*/          if k> mk  then do;  mk=k;  WL=z;  end  /*this word count is the current max.  */          end   /*n*/      wr= max( length(' rank '), length(top) )    /*find the maximum length of the rank #*/           do d=1  for words(WL);  y= word(WL, d) /*process all words in the  word list. */          if d==1  then w= max(10, length(@.y) ) /*use length of the first number used. */          say right(y, 40)         right( commas(tops), wr)          right(commas(@.y), w)          @.y= .                                 /*nullify word count for next go 'round*/          end   /*d*/                            /* [↑]  this allows a non-sorted list. */      tops= tops + words(WL)                      /*correctly handle any  tied  rankings.*/     end        /*until*/exit                                             /*stick a fork in it,  we're all done. *//*──────────────────────────────────────────────────────────────────────────────────────*/commas: parse arg ?;  do jc=length(?)-3  to 1  by -3; ?=insert(',', ?, jc); end;  return ?16bit:  do k=1 for xs; _=word(x,k); $=changestr('├'left(_,1),$,right(_,1)); end;  return/*──────────────────────────────────────────────────────────────────────────────────────*/init:   x= 'Çà åÅ çÇ êÉ ëÉ áà óâ ªæ ºç ¿è ⌐é ¬ê ½ë «î »ï ▒ñ ┤ô ╣ù ╗û ╝ü';     xs= words(x)        abcL="abcdefghijklmnopqrstuvwxyz'"       /*lowercase letters of Latin alphabet. */        abcU= abcL;            upper abcU        /*uppercase version of Latin alphabet. */        accL= 'üéâÄàÅÇêëèïîìéæôÖòûùÿáíóúÑ'       /*some lowercase accented characters.  */        accU= 'ÜéâäàåçêëèïîìÉÆôöòûùÿáíóúñ'       /*  "  uppercase    "         "        */        accG= 'αßΓπΣσµτΦΘΩδφε'                   /*  "  upper/lowercase Greek letters.  */        ll= abcL || abcL ||accL ||accL || accG               /*chars of  after letters. */        uu= abcL || abcU ||accL ||accU || accG || xrange()   /*  "    " before    "     */        @.= 0;    q= "'";    totW= 0;    !.= @.;    c= 0;    tops= 1;          return/*──────────────────────────────────────────────────────────────────────────────────────*/rdr:   do #=0  while lines(fID)\==0; $=linein(fID) /*loop whilst there're lines in file.*/ if pos('├',$) \== 0  then call 16bit       /*are there any  16-bit  characters ?*/       $= translate($, ll, uu)                    /*trans. uppercase letters to lower. */          do while $\= ''; parse var$  z  $/*process each word in the$  line. */          parse var  z     z1  2  zr  ''  -1  zL   /*obtain: first, middle, & last char.*/          if z1==q  then do; z=zr; if z==''  then iterate; end /*starts with apostrophe?*/          if zL==q  then z= strip(left(z, length(z) - 1))      /*ends     "       "    ?*/          if z==''  then iterate                               /*if Z is now null, skip.*/          if @.z==0  then do;  c=c+1; !.c=z;  end  /*bump word cnt; assign word to array*/          totW= totW + 1;      @.z= @.z + 1        /*bump total words; bump a word count*/          end   /*while*/       end      /*#*/    say commas(totW)     ' words found  ('commas(c)    "unique)  in "    commas(#),                         ' records read from file: '     fID;        say;          return
output   when using the default inputs:
574,122  words found  (23,414 unique)  in  67,663  records read from file:  les_mes.txt

word    rank    count
════   ══════  ═══════
the      1     41,088
of      2     19,949
and      3     14,942
a      4     14,595
to      5     13,950
in      6     11,214
he      7      9,607
was      8      8,620
that      9      7,826
it     10      6,535


To see a list of the top 1,000 words that show (among other things) words like   it's   and other accented words, see the   discussion   page for this task.

### version 2

Inspired by version 1 and adapted for ooRexx. It ignores all characters other than a-z and A-Z (which are translated to a-z).

/*REXX program   reads  and  displays  a  count  of words a file.  Word case is ignored.*/Call time 'R'abc='abcdefghijklmnopqrstuvwxyz'abcABC=abc||translate(abc)parse arg fID_top                                /*obtain optional arguments from the CL*/Parse Var fid_top fid ',' topif fID=='' then fID= 'mis.TXT'                   /* Use default if not specified        */if top=='' then top= 10                          /* Use default if not specified        */occ.=0                                           /* occurrences of word (stem) in file  */wn=0Do While lines(fid)>0                            /*loop whilst there are lines in file. */  line=linein(fID)  line=translate(line,abc||abc,abcABC||xrange('00'x,'ff'x)) /*use only lowercase letters*/  Do While line<>''    Parse Var line word line                       /* take a word                         */    If occ.word=0 Then Do                          /* not yet in word list                */      wn=wn+1      word.wn=word      End    occ.word=occ.word+1    End  EndSay 'We found' wn 'different words'say right('word',40) ' rank   count '            /* header                              */say right('----',40) '------ -------'            /* separator.                          */tops=0Do Until tops>=top | tops>=wn                    /*process enough words to satisfy  TOP.*/  max_occ=0  tl=''                                          /*initialize (possibly) a list of words*/  Do wi=1 To wn                                  /*process the list of words in the file*/    word=word.wi                                 /* take a word from the list           */    Select      When occ.word>max_occ Then Do              /* most occurrences so far             */        tl=word                                  /* candidate for output                */        max_occ=occ.word                         /* current maximum occurrences         */        End      When occ.word=max_occ Then Do              /* tied                                */        tl=tl word                               /* add to output candidate             */        End      Otherwise                                  /* no candidate (yet)                  */        Nop      End    End    do d=1 for words(tl)      word=word(tl,d)      say right(word,40) right(tops+1,4) right(occ.word,8)      occ.word=0                                /*nullify this word count for next time*/      End    tops=tops+words(tl)                         /*correctly handle the tied rankings.  */  endSay time('E') 'seconds elapsed'
Output:
We found 22820 different words
word  rank   count
---- ------ -------
the    1    41089
of    2    19949
and    3    14942
a    4    14608
to    5    13951
in    6    11214
he    7     9648
was    8     8621
that    9     7924
it   10     6661
1.750000 seconds elapsed

## Ring

 # project : Word count fp = fopen("Miserables.txt","r")str = fread(fp, getFileSize(fp))fclose(fp)  mis =substr(str, " ", nl)mis = lower(mis)mis = str2list(mis)count = list(len(mis))ready = []for n = 1 to len(mis)     flag = 0     for m = 1 to len(mis)           if mis[n] = mis[m] and n != m              for p = 1 to len(ready)                    if m = ready[p]                       flag = 1                    ok              next              if flag = 0                 count[n] = count[n] + 1                               ok           ok     next     if flag = 0        add(ready, n)     oknextfor n = 1 to len(count)     for m = n + 1 to len(count)          if count[m] > count[n]             temp = count[n]             count[n] = count[m]             count[m] = temp             temp = mis[n]             mis[n] = mis[m]             mis[m] = temp          ok     nextnextfor n = 1 to 10     see mis[n] + " " + (count[n] + 1) + nlnext func getFileSize fp        c_filestart = 0        c_fileend = 2        fseek(fp,0,c_fileend)        nfilesize = ftell(fp)        fseek(fp,0,c_filestart)        return nfilesize func swap(a, b)        temp = a        a = b        b = temp        return [a, b] 

Output:

the	41089
of	19949
and	14942
a	14608
to	13951
in	11214
he	9648
was	8621
that	7924
it	6661


## Ruby

 class String  def wc  n = Hash.new(0)  downcase.scan(/[A-Za-zÀ-ÿ]+/) { |g| n[g] += 1 }  n.sort{|n,g| n<=>g}  endend open('135-0.txt') { |n| n.read.wc[-10,10].each{|n| puts n.to_s+"->"+n.to_s} } 
Output:
it->6661
that->7924
was->8621
he->9648
in->11214
to->13951
a->14596
and->14942
of->19949
the->41088


### Tally and max_by

Works with: Ruby version 2.7
RE = /[[:alpha:]]+/count =  open("135-0.txt").read.downcase.scan(RE).tally.max_by(10, &:last)count.each{|ar| puts ar.join("->") } 
Output:
the->41092
of->19954
and->14943
a->14546
to->13953
in->11219
he->9649
was->8622
that->7924
it->6661


### Chain of Enumerables

wf = File.read("135-0.txt", :encoding => "UTF-8")  .downcase  .scan(/\w+/)  .each_with_object(Hash.new(0)) { |word, hash| hash[word] += 1 }  .sort_by { |k, v| v }  .reverse  .take(10)  .each_with_index { |w, i|  printf "[%2d] %10s : %d\n",         i += 1,         w,         w} 
Output:
[ 1]        the : 41040
[ 2]         of : 19951
[ 3]        and : 14942
[ 4]          a : 14539
[ 5]         to : 13941
[ 6]         in : 11209
[ 7]         he : 9646
[ 8]        was : 8620
[ 9]       that : 7922
         it : 6659


## Rust

use std::cmp::Reverse;use std::collections::HashMap;use std::fs::File;use std::io::{BufRead, BufReader}; extern crate regex;use regex::Regex; fn word_count(file: File, n: usize) {    let word_regex = Regex::new("(?i)[a-z']+").unwrap();     let mut words = HashMap::new();    for line in BufReader::new(file).lines() {        word_regex            .find_iter(&line.expect("Read error"))            .map(|m| m.as_str())            .for_each(|word| {                *words.entry(word.to_lowercase()).or_insert(0) += 1;            });    }     let mut words: Vec<_> = words.iter().collect();    words.sort_unstable_by_key(|&(word, count)| (Reverse(count), word));     for (word, count) in words.iter().take(n) {        println!("{:8} {:>8}", word, count);    }} fn main() {    word_count(File::open("135-0.txt").expect("File open error"), 10)}
Output:
the         41083
of          19948
and         14941
a           14604
to          13951
in          11212
he           9604
was          8621
that         7824
it           6534


## Scala

### Featuring online remote file as input

Output:

Best seen running in your browser Scastie (remote JVM).

import scala.io.Source object WordCount extends App {   val url = "http://www.gutenberg.org/files/135/135-0.txt"  val header = "Rank Word  Frequency\n==== ======== ======"   def wordCnt =    Source.fromURL(url).getLines()      .filter(_.nonEmpty)      .flatMap(_.split("""\W+""")).toSeq      .groupBy(_.toLowerCase())      .mapValues(_.size).toSeq      .sortWith { case ((_, v0), (_, v1)) => v0 > v1 }      .take(10).zipWithIndex   println(header)  wordCnt.foreach {    case ((word, count), rank) => println(f"${rank + 1}%4d$word%-8s $count%6d") } println(s"\nSuccessfully completed without errors. [total${scala.compat.Platform.currentTime - executionStart} ms]") }
Output:
Rank Word  Frequency
==== ======== ======
1 the       41036
2 of        19946
3 and       14940
4 a         14589
5 to        13939
6 in        11204
7 he         9645
8 was        8619
9 that       7922
10 it         6659

Successfully completed without errors. [total 4528 ms]

## Seed7

The Seed7 program uses the function getHttp, to get the file 135-0.txt directly from Gutemberg. The library scanfile.s7i provides getSimpleSymbol, to get words from a fle. The words are converted to lower case, to assure that "The" and "the" are considered the same.

$include "seed7_05.s7i"; include "gethttp.s7i"; include "strifile.s7i"; include "scanfile.s7i"; include "chartype.s7i"; include "console.s7i"; const type: wordHash is hash [string] integer;const type: countHash is hash [integer] array string; const proc: main is func local var file: inFile is STD_NULL; var string: aWord is ""; var wordHash: numberOfWords is wordHash.EMPTY_HASH; var countHash: countWords is countHash.EMPTY_HASH; var array integer: countKeys is 0 times 0; var integer: index is 0; var integer: number is 0; begin OUT := STD_CONSOLE; inFile := openStrifile(getHttp("www.gutenberg.org/files/135/135-0.txt")); while hasNext(inFile) do aWord := lower(getSimpleSymbol(inFile)); if aWord <> "" and aWord in letter_char then if aWord in numberOfWords then incr(numberOfWords[aWord]); else numberOfWords @:= [aWord] 1; end if; end if; end while; countWords := flip(numberOfWords); countKeys := sort(keys(countWords)); writeln("Word Frequency"); for index range length(countKeys) downto length(countKeys) - 9 do number := countKeys[index]; for aWord range sort(countWords[number]) do writeln(aWord rpad 8 <& number); end for; end for; end func; Output: Word Frequency the 41036 of 19946 and 14940 a 14589 to 13939 in 11204 he 9645 was 8619 that 7922 it 6659  ## Sidef var count = Hash()var file = File(ARGV \\ '135-0.txt') file.open_r.each { |line| line.lc.scan(/[\pL]+/).each { |word| count{word} := 0 ++ }} var top = count.sort_by {|_,v| v }.last(10).flip top.each { |pair| say "#{pair.key}\t-> #{pair.value}"} Output: the -> 41088 of -> 19949 and -> 14942 a -> 14596 to -> 13951 in -> 11214 he -> 9648 was -> 8621 that -> 7924 it -> 6661  ## Simula COMMENT COMPILE WITH$ cim -m64 word-count.sim;BEGIN     COMMENT ----- CLASSES FOR GENERAL USE ;     ! ABSTRACT HASH KEY TYPE ;    CLASS HASHKEY;    VIRTUAL:        PROCEDURE HASH IS            INTEGER PROCEDURE HASH;;        PROCEDURE EQUALTO IS            BOOLEAN PROCEDURE EQUALTO(K); REF(HASHKEY) K;;    BEGIN    END HASHKEY;     ! ABSTRACT HASH VALUE TYPE ;    CLASS HASHVAL;    BEGIN        ! THERE IS NOTHING REQUIRED FOR THE VALUE TYPE ;    END HASHVAL;     CLASS HASHMAP;    BEGIN        CLASS INNERHASHMAP(N); INTEGER N;        BEGIN             INTEGER PROCEDURE INDEX(K); REF(HASHKEY) K;            BEGIN                INTEGER I;                IF K == NONE THEN                    ERROR("HASHMAP.INDEX: NONE IS NOT A VALID KEY");                I := MOD(K.HASH,N);            LOOP:                IF KEYTABLE(I) == NONE OR ELSE KEYTABLE(I).EQUALTO(K) THEN                    INDEX := I                ELSE BEGIN                    I := IF I+1 = N THEN 0 ELSE I+1;                    GO TO LOOP;                END;            END INDEX;             ! PUT SOMETHING IN ;            PROCEDURE PUT(K,V); REF(HASHKEY) K; REF(HASHVAL) V;            BEGIN                INTEGER I;                IF V == NONE THEN                    ERROR("HASHMAP.PUT: NONE IS NOT A VALID VALUE");                I := INDEX(K);                IF KEYTABLE(I) == NONE THEN BEGIN                    IF SIZE = N THEN                        ERROR("HASHMAP.PUT: TABLE FILLED COMPLETELY");                    KEYTABLE(I) :- K;                    VALTABLE(I) :- V;                    SIZE := SIZE+1;                END ELSE                    VALTABLE(I) :- V;            END PUT;             ! GET SOMETHING OUT ;            REF(HASHVAL) PROCEDURE GET(K); REF(HASHKEY) K;            BEGIN                INTEGER I;                IF K == NONE THEN                    ERROR("HASHMAP.GET: NONE IS NOT A VALID KEY");                I := INDEX(K);                IF KEYTABLE(I) == NONE THEN                    GET :- NONE ! ERROR("HASHMAP.GET: KEY NOT FOUND");                ELSE                    GET :- VALTABLE(I);            END GET;             PROCEDURE CLEAR;            BEGIN                INTEGER I;                FOR I := 0 STEP 1 UNTIL N-1 DO BEGIN                    KEYTABLE(I) :- NONE;                    VALTABLE(I) :- NONE;                END;                SIZE := 0;            END CLEAR;             ! DATA MEMBERS OF CLASS HASHMAP ;            REF(HASHKEY) ARRAY KEYTABLE(0:N-1);            REF(HASHVAL) ARRAY VALTABLE(0:N-1);            INTEGER SIZE;         END INNERHASHMAP;         PROCEDURE PUT(K,V); REF(HASHKEY) K; REF(HASHVAL) V;        BEGIN            IF IMAP.SIZE >= 0.75 * IMAP.N THEN            BEGIN                COMMENT RESIZE HASHMAP ;                REF(INNERHASHMAP) NEWIMAP;                REF(ITERATOR) IT;                NEWIMAP :- NEW INNERHASHMAP(2 * IMAP.N);                IT :- NEW ITERATOR(THIS HASHMAP);                WHILE IT.MORE DO                BEGIN                    REF(HASHKEY) KEY;                    KEY :- IT.NEXT;                    NEWIMAP.PUT(KEY, IMAP.GET(KEY));                END;                IMAP.CLEAR;                IMAP :- NEWIMAP;            END;            IMAP.PUT(K, V);        END;         REF(HASHVAL) PROCEDURE GET(K); REF(HASHKEY) K;            GET :- IMAP.GET(K);         PROCEDURE CLEAR;            IMAP.CLEAR;         INTEGER PROCEDURE SIZE;            SIZE := IMAP.SIZE;         REF(INNERHASHMAP) IMAP;         IMAP :- NEW INNERHASHMAP(16);    END HASHMAP;     CLASS ITERATOR(H); REF(HASHMAP) H;    BEGIN        INTEGER POS,KEYCOUNT;         BOOLEAN PROCEDURE MORE;            MORE := KEYCOUNT < H.SIZE;         REF(HASHKEY) PROCEDURE NEXT;        BEGIN            INSPECT H DO            INSPECT IMAP DO            BEGIN                WHILE KEYTABLE(POS) == NONE DO                    POS := POS+1;                NEXT :- KEYTABLE(POS);                KEYCOUNT := KEYCOUNT+1;                POS := POS+1;            END;        END NEXT;     END ITERATOR;     COMMENT ----- PROBLEM SPECIFIC CLASSES ;     HASHKEY CLASS TEXTHASHKEY(T); VALUE T; TEXT T;    BEGIN        INTEGER PROCEDURE HASH;        BEGIN            INTEGER I;            T.SETPOS(1);            WHILE T.MORE DO                I := 31*I+RANK(T.GETCHAR);            HASH := I;        END HASH;        BOOLEAN PROCEDURE EQUALTO(K); REF(HASHKEY) K;            EQUALTO := T = K QUA TEXTHASHKEY.T;    END TEXTHASHKEY;     HASHVAL CLASS COUNTER;    BEGIN        INTEGER COUNT;    END COUNTER;     REF(INFILE) INF;    REF(HASHMAP) MAP;    REF(TEXTHASHKEY) KEY;    REF(COUNTER) VAL;    REF(ITERATOR) IT;    TEXT LINE, WORD;    INTEGER I, J, MAXCOUNT, LINENO;    INTEGER ARRAY MAXCOUNTS(1:10);    REF(TEXTHASHKEY) ARRAY MAXWORDS(1:10);     WORD :- BLANKS(1000);    MAP :- NEW HASHMAP;     COMMENT MAP WORDS TO COUNTERS ;     INF :- NEW INFILE("135-0.txt");    INF.OPEN(BLANKS(4096));    WHILE NOT INF.LASTITEM DO    BEGIN        BOOLEAN INWORD;         PROCEDURE SAVE;        BEGIN            IF WORD.POS > 1 THEN            BEGIN                KEY :- NEW TEXTHASHKEY(WORD.SUB(1, WORD.POS - 1));                VAL :- MAP.GET(KEY);                IF VAL == NONE THEN                BEGIN                    VAL :- NEW COUNTER;                    MAP.PUT(KEY, VAL);                END;                VAL.COUNT := VAL.COUNT + 1;                WORD := " ";                WORD.SETPOS(1);            END;        END SAVE;         LINENO := LINENO + 1;        LINE :- COPY(INF.IMAGE).STRIP; INF.INIMAGE;         COMMENT SEARCH WORDS IN LINE ;        COMMENT A WORD IS ANY SEQUENCE OF LETTERS ;         INWORD := FALSE;        LINE.SETPOS(1);        WHILE LINE.MORE DO        BEGIN            CHARACTER CH;            CH := LINE.GETCHAR;            IF CH >= 'a' AND CH <= 'z' THEN                CH := CHAR(RANK(CH) - RANK('a') + RANK('A'));            IF CH >= 'A' AND CH <= 'Z' THEN            BEGIN                IF NOT INWORD THEN                BEGIN                    SAVE;                    INWORD := TRUE;                END;                WORD.PUTCHAR(CH);            END ELSE            BEGIN                IF INWORD THEN                BEGIN                    SAVE;                    INWORD := FALSE;                END;            END;        END;        SAVE; COMMENT LAST WORD ;    END;    INF.CLOSE;     COMMENT FIND 10 MOST COMMON WORDS ;     IT :- NEW ITERATOR(MAP);    WHILE IT.MORE DO    BEGIN        KEY :- IT.NEXT;        VAL :- MAP.GET(KEY);        FOR I := 1 STEP 1 UNTIL 10 DO        BEGIN            IF VAL.COUNT >= MAXCOUNTS(I) THEN            BEGIN                FOR J := 10 STEP -1 UNTIL I + 1 DO                BEGIN                    MAXCOUNTS(J) := MAXCOUNTS(J - 1);                    MAXWORDS(J) :- MAXWORDS(J - 1);                END;                MAXCOUNTS(I) := VAL.COUNT;                MAXWORDS(I) :- KEY;                GO TO BREAK;            END;        END;    BREAK:    END;     COMMENT OUTPUT 10 MOST COMMON WORDS ;     FOR I := 1 STEP 1 UNTIL 10 DO    BEGIN        IF MAXWORDS(I) =/= NONE THEN        BEGIN            OUTINT(MAXCOUNTS(I), 10);            OUTTEXT(" ");            OUTTEXT(MAXWORDS(I) QUA TEXTHASHKEY.T);            OUTIMAGE;        END;    END; END 
Output:
     41089 THE
19949 OF
14942 AND
14608 A
13951 TO
11214 IN
9648 HE
8621 WAS
7924 THAT
6661 IT

6 garbage collection(s) in 0.2 seconds.


## Swift

import Foundation func printTopWords(path: String, count: Int) throws {    // load file contents into a string    let text = try String(contentsOfFile: path, encoding: String.Encoding.utf8)    var dict = Dictionary<String, Int>()    // split text into words, convert to lowercase and store word counts in dict    let regex = try NSRegularExpression(pattern: "\\w+")    regex.enumerateMatches(in: text, range: NSRange(text.startIndex..., in: text)) {        (match, _, _) in        guard let match = match else { return }        let word = String(text[Range(match.range, in: text)!]).lowercased()        dict[word, default: 0] += 1    }    // sort words by number of occurrences    let wordCounts = dict.sorted(by: {$0.1 >$1.1})    // print the top count words    print("Rank\tWord\tCount")    for (i, (word, n)) in wordCounts.prefix(count).enumerated() {        print("\(i + 1)\t\(word)\t\(n)")    }} do {    try printTopWords(path: "135-0.txt", count: 10)} catch {    print(error.localizedDescription)}
Output:
Rank	Word	Count
1	the	41039
2	of	19951
3	and	14942
4	a	14527
5	to	13941
6	in	11209
7	he	9646
8	was	8620
9	that	7922
10	it	6659


## Tcl

lassign $argv headwhile { [gets stdin line] >= 0 } { foreach word [regexp -all -inline {[A-Za-z]+}$line] {        dict incr wordcount [string tolower $word] }} set sorted [lsort -stride 2 -index 1 -int -decr$wordcount]foreach {word count} [lrange $sorted 0 [expr {$head * 2 - 1}]] {    puts "$count\t$word"}

./wordcount-di.tcl 10 < 135-0.txt

Output:
41093   the
19954   of
14943   and
14558   a
13953   to
11219   in
9649    he
8622    was
7924    that
6661    it


## TMG

McIlroy's Unix TMG:

/* Input format: N text                                         *//* Only lowercase letters can constitute a word in text.        *//* (c) 2020, Andrii Makukha, 2-clause BSD licence.              */ progrm: readn/error        table(freq) table(chain) [firstword = ~0]loop:   not(!<<>>) output    |   [j=777] batch/loop loop;                   /* Main loop */ /* To use less stack, divide input into batches.                *//* (Avoid interpreting entire input as a single "sentence".)    */batch:  [j<=0?] succ     |  word/skip [j--] skip batch;skip:   string(other);not:    params(1) (any($1) fail | ());readn: string(!<<0123456789>>) readint(n) skip;error: diag(( ={ <ERROR: input must start with a number> * } )); /* Process a word */word: smark any(letter) string(letter) scopy locate/new [freq[k]++] newmax;locate: find(freq, k);new: enter(freq, k) [freq[k] = 1] newmax [firstword = firstword==~0 ? k : firstword] enter(chain, i) [chain[i]=prevword] [prevword=k];newmax: [max = max<freq[k] ? freq[k] : max]; /* Output logic */output: [next=max]outmax: [max=next] [next=0] [max>0?] [j = prevword] cycle/outmax;cycle: [i = j] [k = freq[i]] [n>0?] ( [max==freq[i]?] parse(wn) | [(freq[i]<max) & (next<freq[i])?] [next = freq[i]] | ()) [i != firstword?] [j = chain[i]] cycle;wn: getnam(freq, i) [k = freq[i]] decimal(k) [n--] = { 2 < > 1 * }; /* Reads decimal integer */readint: proc(n;i) ignore(<<>>) [n=0] intaint1: [n = n*12+i] inta\int1;inta: char(i) [i<72?] [(i =- 60)>=0?]; /* Variables */prevword: 0; /* Head of the linked list */firstword: 0; /* First word's index to know where to stop output */k: 0;i: 0;j: 0;n: 0; /* Number of most frequent words to display */ max: 0; /* Current highest number of occurrences */next: 0; /* Next highest number of occurrences */ /* Tables */freq: 0;chain: 0; /* Character classes */letter: <<abcdefghijklmnopqrstuvwxyz>>;other: !<<abcdefghijklmnopqrstuvwxyz>>; Unix TMG didn't have tolower builtin. Therefore, you would use it together with tr: cat file | tr A-Z a-z > file1; ./a.out file1 Additionally, because 1972 TMG only understood ASCII characters, you might want to strip down the diacritics (e.g., é → e): cat file | uni2ascii -B | tr A-Z a-z > file1; ./a.out file1 ## UNIX Shell Works with: Bash Works with: zsh This is derived from Doug McIlroy's original 6-line note in the ACM article cited in the task. #!/bin/sh<"$1" tr -cs A-Za-z '\n' | tr A-Z a-z | LC_ALL=C sort | uniq -c | sort -rn | head -n "$2" Output: $ ./wordcount.sh 135-0.txt 10
41089 the
19949 of
14942 and
14608 a
13951 to
11214 in
9648 he
8621 was
7924 that
6661 it


### Original + URL import

This is Doug McIlroy's original solution but follows other solutions in importing the task's text file from the web and directly specifying the 10 most commonly used words.

curl "https://www.gutenberg.org/files/135/135-0.txt" | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 10q
Output:
41096 the
19955 of
14939 and
14558 a
13954 to
11218 in
9649 he
8622 was
7924 that
6661 it

## VBA

In order to use it, you have to adapt the PATHFILE Const.

 Option Explicit Private Const PATHFILE As String = "C:\HOME\VBA\ROSETTA" Sub Main()Dim arrDim Dict As ObjectDim Book As String, temp As StringDim T#T = Timer   Book = ExtractTxt(PATHFILE & "\les miserables.txt")   temp = RemovePunctuation(Book)   temp = UCase(temp)   arr = Split(temp, " ")   Set Dict = CreateObject("Scripting.Dictionary")   FillDictionary Dict, arr   Erase arr   SortDictByFreq Dict, arr   DisplayTheTopMostUsedWords arr, 10 Debug.Print "Words different in this book : " & Dict.CountDebug.Print "-------------------------"Debug.Print ""Debug.Print "Optionally : "Debug.Print "Frequency of the word MISERABLE : " & DisplayFrequencyOf("MISERABLE", Dict)Debug.Print "Frequency of the word DISASTER : " & DisplayFrequencyOf("DISASTER", Dict)Debug.Print "Frequency of the word ROSETTA_CODE : " & DisplayFrequencyOf("ROSETTA_CODE", Dict)Debug.Print "-------------------------"Debug.Print "Execution Time : " & Format(Timer - T, "0.000") & " sec."End Sub Private Function ExtractTxt(strFile As String) As String'http://rosettacode.org/wiki/File_input/output#VBADim i As Integer   i = FreeFile   Open strFile For Input As #i       ExtractTxt = Input(LOF(1), #i)   Close #iEnd Function Private Function RemovePunctuation(strBook As String) As StringDim T, i As Integer, temp As StringConst PUNCT As String = """,;:!?."   T = Split(StrConv(PUNCT, vbUnicode), Chr(0))   temp = strBook   For i = LBound(T) To UBound(T) - 1      temp = Replace(temp, T(i), " ")   Next   temp = Replace(temp, "--", " ")   temp = Replace(temp, "...", " ")   temp = Replace(temp, vbCrLf, " ")   RemovePunctuation = Replace(temp, "  ", " ")End Function Private Sub FillDictionary(d As Object, a As Variant)Dim L As Long   For L = LBound(a) To UBound(a)      If a(L) <> "" Then _         d(a(L)) = d(a(L)) + 1   NextEnd Sub Private Sub SortDictByFreq(d As Object, myArr As Variant)Dim KDim L As Long   ReDim myArr(1 To d.Count, 1 To 2)   For Each K In d.keys      L = L + 1      myArr(L, 1) = K      myArr(L, 2) = CLng(d(K))   Next   SortArray myArr, LBound(myArr), UBound(myArr), 2End Sub Private Sub SortArray(a, Le As Long, Ri As Long, Col As Long)Dim ref As Long, L As Long, r As Long, temp As Variant   ref = a((Le + Ri) \ 2, Col)   L = Le   r = Ri   Do         Do While a(L, Col) < ref            L = L + 1         Loop         Do While ref < a(r, Col)            r = r - 1         Loop         If L <= r Then            temp = a(L, 1)            a(L, 1) = a(r, 1)            a(r, 1) = temp            temp = a(L, 2)            a(L, 2) = a(r, 2)            a(r, 2) = temp            L = L + 1            r = r - 1         End If   Loop While L <= r   If L < Ri Then SortArray a, L, Ri, Col   If Le < r Then SortArray a, Le, r, ColEnd Sub Private Sub DisplayTheTopMostUsedWords(arr As Variant, Nb As Long)Dim L As Long, i As Integer   i = 1   Debug.Print "Rank Word    Frequency"   Debug.Print "==== ======= ========="   For L = UBound(arr) To UBound(arr) - Nb + 1 Step -1      Debug.Print Left(CStr(i) & "    ", 5) & Left(arr(L, 1) & "       ", 8) & " " & Format(arr(L, 2), "0 000")      i = i + 1   NextEnd Sub Private Function DisplayFrequencyOf(Word As String, d As Object) As Long   If d.Exists(Word) Then _      DisplayFrequencyOf = d(Word)End Function
Output:
Words different in this book : 25884
-------------------------
Rank Word    Frequency
==== ======= =========
1    THE      40 831
2    OF       19 807
3    AND      14 860
4    A        14 453
5    TO       13 641
6    IN       11 133
7    HE       9 598
8    WAS      8 617
9    THAT     7 807
10   IT       6 517

Optionally :
Frequency of the word MISERABLE : 35
Frequency of the word DISASTER : 12
Frequency of the word ROSETTA_CODE : 0
-------------------------
Execution Time : 7,785 sec.

## Wren

Translation of: Go
Library: Wren-str
Library: Wren-sort
Library: Wren-fmt
Library: Wren-pattern

I've taken the view that 'letter' means either a letter or digit for Unicode codepoints up to 255. I haven't included underscore, hyphen nor apostrophe as these usually separate compound words.

Not very quick (runs in about 47 seconds on my system) though this is partially due to Wren not having regular expressions and the string pattern matching module being written in Wren itself rather than C.

If the Go example is re-run today (21 October 2020), then the output matches this Wren example precisely though it appears that the text file has changed since the former was written more than 2 years ago.

import "io" for Fileimport "/str" for Strimport "/sort" for Sortimport "/fmt" for Fmtimport "/pattern" for Pattern var fileName = "135-0.txt"var text = File.read(fileName).trimEnd()var groups = {}// match runs of A-z, a-z, 0-9 and any non-ASCII letters with code-points < 256var p = Pattern.new("+1&w")var lines = text.split("\n")for (line in lines) {    var ms = p.findAll(line)    for (m in ms) {        var t = Str.lower(m.text)        groups[t] = groups.containsKey(t) ? groups[t] + 1 : 1    }}var keyVals = groups.toListSort.quick(keyVals, 0, keyVals.count - 1) { |i, j| (j.value - i.value).sign }System.print("Rank  Word  Frequency")System.print("====  ====  =========")for (rank in 1..10) {    var word = keyVals[rank-1].key    var freq = keyVals[rank-1].value    Fmt.print("$2d$-4s    $5d", rank, word, freq)} Output: Rank Word Frequency ==== ==== ========= 1 the 41092 2 of 19954 3 and 14943 4 a 14546 5 to 13953 6 in 11219 7 he 9649 8 was 8622 9 that 7924 10 it 6661  ## XQuery let$maxentries := 10,    $uri := 'https://www.gutenberg.org/files/135/135-0.txt'return<words in="{$uri}" top="{$maxentries}"> {( let$doc := unparsed-text($uri),$tokens := (               tokenize($doc, '\W+')[normalize-space()] ! lower-case(.) ! normalize-unicode(., 'NFC') ) return for$token in $tokens let$key := $token group by$key    let $count := count($token)    order by $count descending return <word key="{$key}" count="{$count}"/>)[position()=(1 to$maxentries)]}</words>
Output:
<words in="https://www.gutenberg.org/files/135/135-0.txt" top="10">  <word key="the" count="41092"/>  <word key="of" count="19954"/>  <word key="and" count="14943"/>  <word key="a" count="14545"/>  <word key="to" count="13953"/>  <word key="in" count="11219"/>  <word key="he" count="9649"/>  <word key="was" count="8622"/>  <word key="that" count="7924"/>  <word key="it" count="6661"/></words>

## zkl

fname,count := vm.arglist;	// grab cammand line args    // words may have leading or trailing "_", ie "the" and "_the"File(fname).pump(Void,"toLower",  // read the file line by line and hash words   RegExp("[a-z]+").pump.fp1(Dictionary().incV))  // line-->(word:count,..).toList().copy().sort(fcn(a,b){ b<a })[0,count.toInt()] // hash-->list.pump(String,Void.Xplode,"%s,%s\n".fmt).println();
Output:
\$ zkl bbb ~/Documents/Les\ Miserables.txt 10
the,41089
of,19949
and,14942
a,14608
to,13951
in,11214
he,9648
was,8621
that,7924
it,6661
`