Word frequency

Task

Given a text file and an integer n, print/display the n most common words in the file (and the number of their occurrences) in decreasing frequency.

For the purposes of this task:

A word is a sequence of one or more contiguous letters.
You are free to define what a letter is.
Underscores, accented letters, apostrophes, hyphens, and other special characters can be handled at your discretion.
You may treat a compound word like well-dressed as either one word or two.
The word it's could also be one or two words as you see fit.
You may also choose not to support non US-ASCII characters.
Assume words will not span multiple lines.
Don't worry about normalization of word spelling differences.
Treat color and colour as two distinct words.
Uppercase letters are considered equivalent to their lowercase counterparts.
Words of equal frequency can be listed in any order.
Feel free to explicitly state the thoughts behind the program decisions.

Show example output using Les Misérables from Project Gutenberg as the text file input and display the top 10 most used words.

History

This task was originally taken from programming pearls from Communications of the ACM June 1986 Volume 29 Number 6 where this problem is solved by Donald Knuth using literate programming and then critiqued by Doug McIlroy, demonstrating solving the problem in a 6 line Unix shell script (provided as an example below).

References

McIlroy's program

Other tasks related to string operations:

Metrics

Counting

Remove/replace

Anagrams/Derangements/shuffling

Find/Search/Determine

Formatting

Song lyrics/poems/Mad Libs/phrases

Tokenize

Sequences

11l

DefaultDict[String, Int] cnt
L(word) re:‘\w+’.find_strings(File(‘135-0.txt’).read().lowercase())
   cnt[word]++
print(sorted(cnt.items(), key' wordc -> wordc[1], reverse' 1B)[0.<10])

Output:

[(the, 41045), (of, 19953), (and, 14939), (a, 14527), (to, 13942), (in, 11210), (he, 9646), (was, 8620), (that, 7922), (it, 6659)]

Ada

This version uses a character set to match valid characters in a token. Another version could use a pointer to a function returning a boolean to match valid characters (allowing to use functions such as Is_Alphanumeric), but AFAIK there is no "Find_Token" method that uses one.

Works with: Ada version 2012

with Ada.Command_Line;
with Ada.Text_IO;
with Ada.Integer_Text_IO;
with Ada.Strings.Maps;
with Ada.Strings.Fixed;
with Ada.Characters.Handling;
with Ada.Containers.Indefinite_Ordered_Maps;
with Ada.Containers.Indefinite_Ordered_Sets;
with Ada.Containers.Ordered_Maps;

procedure Word_Frequency is
    package TIO renames Ada.Text_IO;

    package String_Counters is new Ada.Containers.Indefinite_Ordered_Maps(String, Natural);
    package String_Sets is new Ada.Containers.Indefinite_Ordered_Sets(String);
    package Sorted_Counters is new Ada.Containers.Ordered_Maps
      (Natural,
       String_Sets.Set,
       "=" => String_Sets."=",
       "<" => ">");
    -- for sorting by decreasing number of occurrences and ascending lexical order

    procedure Increment(Key : in String; Element : in out Natural) is
    begin
        Element := Element + 1;
    end Increment;

    path : constant String := Ada.Command_Line.Argument(1);
    how_many : Natural := 10;
    set : constant Ada.Strings.Maps.Character_Set := Ada.Strings.Maps.To_Set(ranges => (('a', 'z'), ('0', '9')));
    F : TIO.File_Type;
    first : Positive;
    last : Natural;
    from : Positive;
    counter : String_Counters.Map;
    sorted_counts : Sorted_Counters.Map;
    C1 : String_Counters.Cursor;
    C2 : Sorted_Counters.Cursor;
    tmp_set : String_Sets.Set;
begin
    -- read file and count words
    TIO.Open(F, name => path, mode => TIO.In_File);
    while not TIO.End_Of_File(F) loop
       declare
          line : constant String := Ada.Characters.Handling.To_Lower(TIO.Get_Line(F));
       begin
          from := line'First;
          loop
             Ada.Strings.Fixed.Find_Token(line(from .. line'Last), set, Ada.Strings.Inside, first, last);
             exit when last < First;
             C1 := counter.Find(line(first .. last));
             if String_Counters.Has_Element(C1) then
                counter.Update_Element(C1, Increment'Access);
             else
                counter.Insert(line(first .. last), 1);
             end if;
             from := last + 1;
          end loop;
       end;
    end loop;
    TIO.Close(F);

    -- fill Natural -> StringSet Map
    C1 := counter.First;
    while String_Counters.Has_Element(C1) loop
       if sorted_counts.Contains(String_Counters.Element(C1)) then
          tmp_set := sorted_counts.Element(String_Counters.Element(C1));
          tmp_set.Include(String_Counters.Key(C1));
       else
          sorted_counts.Include(String_Counters.Element(C1), String_Sets.To_Set(String_Counters.Key(C1)));
       end if;
       String_Counters.Next(C1);
    end loop;

    -- output
    C2 := sorted_counts.First;
    while Sorted_Counters.Has_Element(C2) loop
       for Item of Sorted_Counters.Element(C2) loop
          Ada.Integer_Text_IO.Put(TIO.Standard_Output, Sorted_Counters.Key(C2), width => 9);
          TIO.Put(TIO.Standard_Output, " ");
          TIO.Put_Line(Item);
       end loop;
       Sorted_Counters.Next(C2);
       how_many := how_many - 1;
       exit when how_many = 0;
    end loop;
end Word_Frequency;

Output:

$ ./word_frequency 135-0.txt
    41093 the
    19954 of
    14943 and
    14558 a
    13953 to
    11219 in
     9649 he
     8622 was
     7924 that
     6661 it

ALGOL 68

Works with: ALGOL 68G version Any - tested with release 2.8.3.win32

Uses the associative array implementations in ALGOL_68/prelude.

# find the n most common words in a file                             #
# use the associative array in the Associate array/iteration task    #
# but with integer values                                            #
PR read "aArrayBase.a68" PR
MODE AAKEY   = STRING;
MODE AAVALUE = INT;
AAVALUE init element value = 0;
# returns text converted to upper case                                 #
OP   TOUPPER  = ( STRING text )STRING:
     BEGIN
        STRING result := text;
        FOR ch pos FROM LWB result TO UPB result DO
            IF is lower( result[ ch pos ] ) THEN result[ ch pos ] := to upper( result[ ch pos ] ) FI
        OD;
        result
     END # TOUPPER # ;
# returns text converted to an INT or -1 if text is not a number     #
OP   TOINT    = ( STRING text )INT:
     BEGIN
        INT  result     := 0;
        BOOL is numeric := TRUE;
        FOR ch pos FROM UPB text BY -1 TO LWB text WHILE is numeric DO
            CHAR c = text[ ch pos ];
            is numeric := is numeric AND c >= "0" AND c <= "9";
            IF is numeric THEN ( result *:= 10 ) +:= ABS c - ABS "0" FI        
        OD;
        IF is numeric THEN result ELSE -1 FI
     END # TOINT # ;
# returns TRUE if c is a letter, FALSE otherwise                     #
OP   ISLETTER    = ( CHAR c )BOOL:
        IF ( c >= "a" AND c <= "z" )
        OR ( c >= "A" AND c <= "Z" )
        THEN TRUE
        ELSE char in string( c, NIL, "ÇåçêëÆôöÿÖØáóÔ" )
        FI # ISLETER # ;
# get the file name and number of words from then commmand line      #
STRING file name       := "pg-les-misrables.txt";
INT    number of words := 10;
FOR arg pos TO argc - 1 DO
    STRING arg upper = TOUPPER argv( arg pos );
    IF   arg upper = "FILE"   THEN
        file name := argv( arg pos + 1 )
    ELIF arg upper  = "NUMBER" THEN
        number of words := TOINT argv( arg pos + 1 )
    FI
OD;
IF  FILE input file;
    open( input file, file name, stand in channel ) /= 0
THEN
    # failed to open the file #
    print( ( "Unable to open """ + file name + """", newline ) )
ELSE
    # file opened OK #
    print( ( "Processing: ", file name, newline ) );
    BOOL at eof := FALSE;
    BOOL at eol := FALSE;
    # set the EOF handler for the file #
    on logical file end( input file, ( REF FILE f )BOOL:
                                     BEGIN
                                         # note that we reached EOF on the #
                                         # latest read #
                                         at eof := TRUE;
                                         # return TRUE so processing can continue #
                                         TRUE
                                     END
                       );
    # set the end-of-line handler for the file so get word can see line boundaries #
    on line end( input file
               , ( REF FILE f )BOOL:
                 BEGIN
                     # note we reached end-of-line #
                     at eol := TRUE;
                     # return FALSE to use the default eol handling  #
                     # i.e. just get the next charactefr             #
                     FALSE
                 END
               );
    # get the words from the file and store the counts in an associative array #
    REF AARRAY words := INIT LOC AARRAY;
    INT word count := 0;
    CHAR c := " ";
    WHILE get( input file, ( c ) );
          NOT at eof
    DO
        WHILE NOT ISLETTER c AND NOT at eof DO get( input file, ( c ) ) OD;
        STRING word := "";
        at eol      := FALSE;
        WHILE ISLETTER c AND NOT at eol AND NOT at eof DO word +:= c; get( input file, ( c ) ) OD;
        word count +:= 1;
        words // TOUPPER word +:= 1
    OD;
    close( input file );
    print( ( file name, " contains ", whole( word count, 0 ), " words", newline ) );
    # find the most used words                                       #
    [ number of words ]STRING top words;
    [ number of words ]INT    top counts;
    FOR i TO number of words DO top words[ i ] := ""; top counts[ i ] := 0 OD;
    REF AAELEMENT w := FIRST words;
    WHILE w ISNT nil element DO
        INT    count  = value OF w;
        STRING word   = key   OF w;
        BOOL   found := FALSE;
        FOR i TO number of words WHILE NOT found DO
            IF count > top counts[ i ] THEN
                # found a word that is used nore than a current      #
                # most used word                                     #
                found := TRUE;
                # move the other words down one place                #
                FOR move pos FROM number of words BY - 1 TO i + 1 DO
                    top counts[ move pos ] := top counts[ move pos - 1 ];
                    top words [ move pos ] := top words [ move pos - 1 ]
                OD;
                # install the new word                               #
                top counts[ i ] := count;
                top words [ i ] := word
            FI
        OD;
        w := NEXT words
    OD;
    print( ( whole( number of words, 0 ), " most used words:", newline ) );
    print( ( " count  word", newline ) );
    FOR i TO number of words DO
        print( ( whole( top counts[ i ], -6 ), ": ", top words[ i ], newline ) )
    OD
FI

Output:

Processing: pg-les-misrables.txt
pg-les-misrables.txt contains 578381 words
10 most used words:
 count  word
 39333: THE
 19154: OF
 14628: AND
 14229: A
 13431: TO
 11275: HE
 10879: IN
  8236: WAS
  7527: THAT
  6491: IT

APL

Works with: GNU APL

⍝⍝ NOTE: input text is assumed to be encoded in ISO-8859-1
⍝⍝ (The suggested example '135-0.txt' of Les Miserables on
⍝⍝ Project Gutenberg is in UTF-8.)
⍝⍝
⍝⍝ Use Unix 'iconv' if required
⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝
∇r ← lowerAndStrip s;stripped;mixedCase
 ⍝⍝ Convert text to lowercase, punctuation and newlines to spaces
 stripped ← '               abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*'
 mixedCase ← ⎕av[11],' ,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
 r ← stripped[mixedCase ⍳ s]
∇

⍝⍝ Return the _n_ most frequent words and a count of their occurrences
⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝
∇r ← n wordCount fname ;D;wl;sidx;swv;pv;wc;uw;sortOrder
  D ← lowerAndStrip (⎕fio['read_file'] fname)  ⍝ raw text with newlines
  wl ← (~ D ∊ ' ') ⊂ D
  sidx ← ⍒wl
  swv ← wl[sidx]
  pv ← +\ 1,~2 ≡/ swv
  wc ← ∊ ⍴¨ pv ⊂ pv
  uw ← 1 ⊃¨ pv ⊂ swv
  sortOrder ← ⍒wc
  r ← n↑[2] uw[sortOrder],[0.5]wc[sortOrder]
∇

      5 wordCount '135-0.txt'
   the    of   and     a    to 
 41042 19952 14938 14526 13942

AppleScript

(*
    For simplicity here, words are considered to be uninterrupted sequences of letters and/or digits.
    The set text is too messy to warrant faffing around with anything more sophisticated.
    The first letter in each word is upper-cased and the rest lower-cased for case equivalence and presentation.
    Where more than n words qualify for the top n or fewer places, all are included in the result.
*)

use AppleScript version "2.4" -- OS X 10.10 (Yosemite) or later
use framework "Foundation"
use scripting additions

on wordFrequency(filePath, n)
    set |⌘| to current application
    
    -- Get the text and "capitalize" it (lower-case except for the first letters in words).
    set theText to |⌘|'s class "NSString"'s stringWithContentsOfFile:(filePath) usedEncoding:(missing value) |error|:(missing value)
    set theText to theText's capitalizedStringWithLocale:(|⌘|'s class "NSLocale"'s currentLocale()) -- Yosemite compatible.
    -- Split it at the non-word characters.
    set nonWordCharacters to |⌘|'s class "NSCharacterSet"'s alphanumericCharacterSet()'s invertedSet()
    set theWords to theText's componentsSeparatedByCharactersInSet:(nonWordCharacters)
    
    -- Use a counted set to count the individual words' occurrences.
    set countedSet to |⌘|'s class "NSCountedSet"'s alloc()'s initWithArray:(theWords)
    
    -- Build a list of word/frequency records, excluding any empty strings left over from the splitting above.
    set mutableSet to |⌘|'s class "NSMutableSet"'s setWithSet:(countedSet)
    tell mutableSet to removeObject:("")
    script o
        property discreteWords : mutableSet's allObjects() as list
        property wordsAndFrequencies : {}
    end script
    set discreteWordCount to (count o's discreteWords)
    repeat with i from 1 to discreteWordCount
        set thisWord to item i of o's discreteWords
        set end of o's wordsAndFrequencies to {thisWord:thisWord, frequency:(countedSet's countForObject:(thisWord)) as integer}
    end repeat
    
    -- Convert to NSMutableArray, reverse-sort the result on the frequencies, and convert back to list.
    set wordsAndFrequencies to |⌘|'s class "NSMutableArray"'s arrayWithArray:(o's wordsAndFrequencies)
    set descendingByFrequency to |⌘|'s class "NSSortDescriptor"'s sortDescriptorWithKey:("frequency") ascending:(false)
    tell wordsAndFrequencies to sortUsingDescriptors:({descendingByFrequency})
    set o's wordsAndFrequencies to wordsAndFrequencies as list
    
    if (discreteWordCount > n) then
        -- If there are more than n records, check for any immediately following the nth which may have the same frequency as it.
        set nthHighestFrequency to frequency of item n of o's wordsAndFrequencies
        set qualifierCount to n
        repeat with i from (n + 1) to discreteWordCount
            if (frequency of item i of o's wordsAndFrequencies = nthHighestFrequency) then
                set qualifierCount to i
            else
                exit repeat
            end if
        end repeat
    else
        -- Otherwise reduce n to the actual number of discrete words.
        set n to discreteWordCount
        set qualifierCount to discreteWordCount
    end if
    
    -- Compose a text report from the qualifying words and frequencies.
    if (qualifierCount = n) then
        set output to {"The " & n & " most frequently occurring words in the file are:"}
    else
        set output to {(qualifierCount as text) & " words share the " & ((n as text) & " highest frequencies in the file:")}
    end if
    repeat with i from 1 to qualifierCount
        set {thisWord:thisWord, frequency:frequency} to item i of o's wordsAndFrequencies
        set end of output to thisWord & ":      " & (tab & frequency)
    end repeat
    set astid to AppleScript's text item delimiters
    set AppleScript's text item delimiters to linefeed
    set output to output as text
    set AppleScript's text item delimiters to astid
    
    return output
end wordFrequency

-- Test code:
set filePath to POSIX path of ((path to desktop as text) & "www.rosettacode.org:Word frequency:135-0.txt")
set n to 10
return wordFrequency(filePath, n)

Output:

"The 10 most frequently occurring words in the file are:
The:      	41092
Of:      	19954
And:      	14943
A:      	14545
To:      	13953
In:      	11219
He:      	9649
Was:      	8622
That:      	7924
It:      	6661"

Arturo

findFrequency: function [file, count][
    freqs: #[]
    r: {/[[:alpha:]]+/}
    loop flatten map split.lines read file 'l -> match lower l r 'word [
        if not? key? freqs word -> freqs\[word]: 0
        freqs\[word]: freqs\[word] + 1
    ]
    freqs: sort.values.descending freqs
    result: new []
    loop 0..dec count 'x [
        'result ++ @[@[get keys freqs x, get values freqs x]]
    ]
    return result
]

loop findFrequency "https://www.gutenberg.org/files/135/135-0.txt" 10 'pair [
    print pair
]

Output:

the 41096 
of 19955 
and 14939 
a 14558 
to 13954 
in 11218 
he 9649 
was 8622 
that 7924 
it 6661

AutoHotkey

URLDownloadToFile, http://www.gutenberg.org/files/135/135-0.txt, % A_temp "\tempfile.txt"
FileRead, H, % A_temp "\tempfile.txt"
FileDelete,  % A_temp "\tempfile.txt"
words := []
while pos := RegExMatch(H, "\b[[:alpha:]]+\b", m, A_Index=1?1:pos+StrLen(m))
	words[m] := words[m] ? words[m] + 1 : 1
for word, count in words
	list .= count "`t" word "`r`n"
Sort, list, RN
loop, parse, list, `n, `r
{
	result .= A_LoopField "`r`n"
	if A_Index = 10
		break
}
MsgBox % "Freq`tWord`n" result
return

Outputs:

Freq	Word
41036	The
19946	of
14940	and
14589	A
13939	TO
11204	in
9645	HE
8619	WAS
7922	THAT
6659	it

AWK

# syntax: GAWK -f WORD_FREQUENCY.AWK [-v show=x] LES_MISERABLES.TXT
#
# A word is anything separated by white space.
# Therefor "this" and "this." are different.
# But "This" and "this" are identical.
# As I am "free to define what a letter is" I have chosen to allow
# numerics and all special characters as they are usually considered
# parts of words in text processing applications.
#
{   nbytes += length($0) + 2 # +2 for CR/LF
    nfields += NF
    $0 = tolower($0)
    for (i=1; i<=NF; i++) {
      arr[$i]++
    }
}
END {
    show = (show == "") ? 10 : show
    width1 = length(show)
    PROCINFO["sorted_in"] = "@val_num_desc"
    for (i in arr) {
      if (width2 == 0) { width2 = length(arr[i]) }
      if (n++ >= show) { break }
      printf("%*d %*d %s\n",width1,n,width2,arr[i],i)
    }
    printf("input: %d records, %d bytes, %d words of which %d are unique\n",NR,nbytes,nfields,length(arr))
    exit(0)
}

Output:

 1 40372 the
 2 19868 of
 3 14472 and
 4 14278 a
 5 13589 to
 6 11024 in
 7  9213 he
 8  8347 was
 9  7250 that
10  6414 his
input: 73829 records, 3369772 bytes, 568744 words of which 50394 are unique

BASIC

QB64

This is a rather long code. I fulfilled the requirement with QB64. It "cleans" each word so it takes as a word anything that begins and ends with a letter. It works with arrays. Amazing the speed of QB64 to do this job with such a big file as Les Miserables.txt.

OPTION _EXPLICIT

' SUBs and FUNCTIONs
DECLARE SUB CountWords (FromString AS STRING)
DECLARE SUB QuickSort (lLeftN AS LONG, lRightN AS LONG, iMode AS INTEGER)
DECLARE SUB ShowResults ()
DECLARE SUB ShowCompletion ()
DECLARE SUB TopCounted ()
DECLARE FUNCTION InsertWord& (WhichWord AS STRING)
DECLARE FUNCTION BinarySearch& (LookFor AS STRING, RetPos AS INTEGER)
DECLARE FUNCTION CleanWord$ (WhichWord AS STRING)

' Var
DIM iFile AS INTEGER
DIM iCol AS INTEGER
DIM iFil AS INTEGER
DIM iStep AS INTEGER
DIM iBar AS INTEGER
DIM iBlock AS INTEGER
DIM lIni AS LONG
DIM lEnd AS LONG
DIM lLines AS LONG
DIM lLine AS LONG
DIM lLenF AS LONG
DIM iRuns AS INTEGER
DIM lMaxWords AS LONG
DIM sTimer AS SINGLE
DIM strFile AS STRING
DIM strKey AS STRING
DIM strText AS STRING
DIM strDate AS STRING
DIM strTime AS STRING
DIM strBar AS STRING
DIM lWords AS LONG
DIM strWords AS STRING
CONST AddWords = 100
CONST TopCount = 10
CONST FALSE = 0, TRUE = NOT FALSE

' Initialize
iFile = FREEFILE
lIni = 1
strDate = DATE$
strTime = TIME$
lEnd = 0
lMaxWords = 1000
REDIM strWords(lIni TO lMaxWords) AS STRING
REDIM lWords(lIni TO lMaxWords) AS LONG
REDIM lTopWords(1) AS LONG
REDIM strTopWords(1) AS STRING

' ---Main program loop
$RESIZE:SMOOTH
DO
    DO
        CLS
        PRINT "This program will count how many words are in a text file and shows the 10"
        PRINT "most used of them."
        PRINT
        INPUT "Document to open (TXT file) (f=see files): ", strFile
        IF UCASE$(strFile) = "F" THEN
            strFile = ""
            FILES
            DO: LOOP UNTIL INKEY$ <> ""
        END IF
    LOOP UNTIL strFile <> ""
    OPEN strFile FOR BINARY AS #iFile
    IF LOF(iFile) > 0 THEN
        iRuns = iRuns + 1
        CLOSE #iFile

        ' Opens the document file to analyze it
        sTimer = TIMER
        ON TIMER(1) GOSUB ShowAdvance
        OPEN strFile FOR INPUT AS #iFile
        lLenF = LOF(iFile)
        PRINT "Looking for words in "; strFile; ". File size:"; STR$(lLenF); ". ";: iCol = POS(0): PRINT "Initializing";
        COLOR 23: PRINT "...";: COLOR 7

        ' Count how many lines has the file
        lLines = 0
        DO WHILE NOT EOF(iFile)
            LINE INPUT #iFile, strText
            lLines = lLines + 1
        LOOP
        CLOSE #iFile

        ' Shows the bar
        LOCATE , iCol: PRINT "Initialization complete."
        PRINT
        PRINT "Processing"; lLines; "lines";: COLOR 23: PRINT "...": COLOR 7
        iFil = CSRLIN
        iCol = POS(0)
        iBar = 80
        iBlock = 80 / lLines
        IF iBlock = 0 THEN iBlock = 1
        PRINT STRING$(iBar, 176)
        lLine = 0
        iStep = lLines * iBlock / iBar
        IF iStep = 0 THEN iStep = 1
        IF iStep > 20 THEN
            TIMER ON
        END IF
        OPEN strFile FOR INPUT AS #iFile
        DO WHILE NOT EOF(iFile)
            lLine = lLine + 1
            IF (lLine MOD iStep) = 0 THEN
                strBar = STRING$(iBlock * (lLine / iStep), 219)
                LOCATE iFil, 1
                PRINT strBar
                ShowCompletion
            END IF
            LINE INPUT #iFile, strText
            CountWords strText
            strKey = INKEY$
        LOOP
        ShowCompletion
        CLOSE #iFile
        TIMER OFF
        LOCATE iFil - 1, 1
        PRINT "Done!" + SPACE$(70)
        strBar = STRING$(iBar, 219)
        LOCATE iFil, 1
        PRINT strBar
        LOCATE iFil + 5, 1
        PRINT "Finishing";: COLOR 23: PRINT "...";: COLOR 7
        ShowResults

        ' Frees the RAM
        lMaxWords = 1000
        lEnd = 0
        REDIM strWords(lIni TO lMaxWords) AS STRING
        REDIM lWords(lIni TO lMaxWords) AS LONG

    ELSE
        CLOSE #iFile
        KILL strFile
        PRINT
        PRINT "Document does not exist."
    END IF
    PRINT
    PRINT "Try again? (Y/n)"
    DO
        strKey = UCASE$(INKEY$)
    LOOP UNTIL strKey = "Y" OR strKey = "N" OR strKey = CHR$(13) OR strKey = CHR$(27)
LOOP UNTIL strKey = "N" OR strKey = CHR$(27) OR iRuns >= 99

CLS
IF iRuns >= 99 THEN
    PRINT "Maximum runs reached for this session."
END IF

PRINT "End of program"
PRINT "Start date/time: "; strDate; " "; strTime
PRINT "End date/time..: "; DATE$; " "; TIME$
END
' ---End main program

ShowAdvance:
ShowCompletion
RETURN

FUNCTION BinarySearch& (LookFor AS STRING, RetPos AS INTEGER)
    ' Var
    DIM lFound AS LONG
    DIM lLow AS LONG
    DIM lHigh AS LONG
    DIM lMid AS LONG
    DIM strLookFor AS STRING
    SHARED lIni AS LONG
    SHARED lEnd AS LONG
    SHARED lMaxWords AS LONG
    SHARED strWords() AS STRING
    SHARED lWords() AS LONG

    ' Binary search for stated word in the list
    lLow = lIni
    lHigh = lEnd
    lFound = 0
    strLookFor = UCASE$(LookFor)
    DO WHILE (lFound < 1) AND (lLow <= lHigh)
        lMid = (lHigh + lLow) / 2
        IF strWords(lMid) = strLookFor THEN
            lFound = lMid
        ELSEIF strWords(lMid) > strLookFor THEN
            lHigh = lMid - 1
        ELSE
            lLow = lMid + 1
        END IF
    LOOP

    ' Should I return the position if not found?
    IF lFound = 0 AND RetPos THEN
        IF lEnd < 1 THEN
            lFound = 1
        ELSEIF strWords(lMid) > strLookFor THEN
            lFound = lMid
        ELSE
            lFound = lMid + 1
        END IF
    END IF

    ' Return the value
    BinarySearch = lFound
    
END FUNCTION

FUNCTION CleanWord$ (WhichWord AS STRING)
    ' Var
    DIM iSeek AS INTEGER
    DIM iStep AS INTEGER
    DIM bOK AS INTEGER
    DIM strWord AS STRING
    DIM strChar AS STRING

    strWord = WhichWord

    ' Look for trailing wrong characters
    strWord = LTRIM$(RTRIM$(strWord))
    IF LEN(strWord) > 0 THEN
        iStep = 0
        DO
            ' Determines if step will be forward or backwards
            IF iStep = 0 THEN
                iStep = -1
            ELSE
                iStep = 1
            END IF

            ' Sets the initial value of iSeek
            IF iStep = -1 THEN
                iSeek = LEN(strWord)
            ELSE
                iSeek = 1
            END IF

            bOK = FALSE
            DO
                strChar = MID$(strWord, iSeek, 1)
                SELECT CASE strChar
                    CASE "A" TO "Z"
                        bOK = TRUE
                    CASE CHR$(129) TO CHR$(154)
                        bOK = TRUE
                    CASE CHR$(160) TO CHR$(165)
                        bOK = TRUE
                END SELECT

                ' If it is not a character valid as a letter, please move one space
                IF NOT bOK THEN
                    iSeek = iSeek + iStep
                END IF

                ' If no letter was recognized, then exit the loop
                IF iSeek < 1 OR iSeek > LEN(strWord) THEN
                    bOK = TRUE
                END IF
            LOOP UNTIL bOK

            IF iStep = -1 THEN
                ' Reviews if a word was encountered
                IF iSeek > 0 THEN
                    strWord = LEFT$(strWord, iSeek)
                ELSE
                    strWord = ""
                END IF
            ELSEIF iStep = 1 THEN
                IF iSeek <= LEN(strWord) THEN
                    strWord = MID$(strWord, iSeek)
                ELSE
                    strWord = ""
                END IF
            END IF
        LOOP UNTIL iStep = 1 OR strWord = ""
    END IF

    ' Return the result
    CleanWord = strWord

END FUNCTION

SUB CountWords (FromString AS STRING)
    ' Var
    DIM iStart AS INTEGER
    DIM iLenW AS INTEGER
    DIM iLenS AS INTEGER
    DIM iLenD AS INTEGER
    DIM strString AS STRING
    DIM strWord AS STRING
    DIM lWhichWord AS LONG
    SHARED lEnd AS LONG
    SHARED lMaxWords AS LONG
    SHARED strWords() AS STRING
    SHARED lWords() AS LONG
  
    ' Converts to Upper Case and cleans leading and trailing spaces
    strString = UCASE$(FromString)
    strString = LTRIM$(RTRIM$(strString))

    ' Get words from string
    iStart = 1
    iLenW = 0
    iLenS = LEN(strString)
    DO WHILE iStart <= iLenS
        iLenW = INSTR(iStart, strString, " ")
        IF iLenW = 0 AND iStart <= iLenS THEN
            iLenW = iLenS + 1
        END IF
        strWord = MID$(strString, iStart, iLenW - iStart)

        ' Will remove mid dashes or apostrophe or "â€"
        iLenD = INSTR(strWord, "-")
        IF iLenD < 1 THEN
            iLenD = INSTR(strWord, "'")
            IF iLenD < 1 THEN
                iLenD = INSTR(strWord, "â€")
            END IF
        END IF
        IF iLenD >= 2 THEN
            strWord = LEFT$(strWord, iLenD - 1)
            iLenW = iStart + (iLenD - 1)
        END IF
        strWord = CleanWord(strWord)

        IF strWord <> "" THEN
            ' Look for the word to be counted
            lWhichWord = BinarySearch(strWord, FALSE)

            ' If the word doesn't exist in the list, add it
            IF lWhichWord = 0 THEN
                lWhichWord = InsertWord(strWord)
            END IF

            ' Count the word
            IF lWhichWord > 0 THEN
                lWords(lWhichWord) = lWords(lWhichWord) + 1
            END IF
        END IF
        iStart = iLenW + 1
    LOOP

END SUB

' Here a word will be inserted in the list
FUNCTION InsertWord& (WhichWord AS STRING)
    ' Var
    DIM lFound AS LONG
    DIM lWord AS LONG
    DIM strWord AS STRING
    SHARED lIni AS LONG
    SHARED lEnd AS LONG
    SHARED lMaxWords AS LONG
    SHARED strWords() AS STRING
    SHARED lWords() AS LONG

    ' Look for the word in the list
    strWord = UCASE$(WhichWord)
    lFound = BinarySearch(WhichWord, TRUE)
    IF lFound > 0 THEN
        ' Add one word
        lEnd = lEnd + 1

        ' Verifies if there is still room for a new word
        IF lEnd > lMaxWords THEN
            lMaxWords = lMaxWords + AddWords ' Other words
            IF lMaxWords > 32767 THEN
                IF lEnd <= 32767 THEN
                    lMaxWords = 32767
                ELSE
                    lFound = -1
                END IF
            END IF

            IF lFound > 0 THEN
                REDIM _PRESERVE strWords(lIni TO lMaxWords) AS STRING
                REDIM _PRESERVE lWords(lIni TO lMaxWords) AS LONG
            END IF
        END IF

        IF lFound > 0 THEN
            ' Move the words below this
            IF lEnd > 1 THEN
                FOR lWord = lEnd TO lFound + 1 STEP -1
                    strWords(lWord) = strWords(lWord - 1)
                    lWords(lWord) = lWords(lWord - 1)
                NEXT lWord
            END IF
    
            ' Insert the word in the position
            strWords(lFound) = strWord
            lWords(lFound) = 0
        END IF
    END IF

    InsertWord = lFound
END FUNCTION

SUB QuickSort (lLeftN AS LONG, lRightN AS LONG, iMode AS INTEGER)
    ' Var
    DIM lPivot AS LONG
    DIM lLeftNIdx AS LONG
    DIM lRightNIdx AS LONG
    SHARED lWords() AS LONG
    SHARED strWords() AS STRING

    ' Clasifies from highest to lowest
    lLeftNIdx = lLeftN
    lRightNIdx = lRightN
    IF (lRightN - lLeftN) > 0 THEN
        lPivot = (lLeftN + lRightN) / 2
        DO WHILE (lLeftNIdx <= lPivot) AND (lRightNIdx >= lPivot)
            IF iMode = 0 THEN ' Ascending
                DO WHILE (lWords(lLeftNIdx) < lWords(lPivot)) AND (lLeftNIdx <= lPivot)
                    lLeftNIdx = lLeftNIdx + 1
                LOOP
                DO WHILE (lWords(lRightNIdx) > lWords(lPivot)) AND (lRightNIdx >= lPivot)
                    lRightNIdx = lRightNIdx - 1
                LOOP
            ELSE ' Descending
                DO WHILE (lWords(lLeftNIdx) > lWords(lPivot)) AND (lLeftNIdx <= lPivot)
                    lLeftNIdx = lLeftNIdx + 1
                LOOP
                DO WHILE (lWords(lRightNIdx) < lWords(lPivot)) AND (lRightNIdx >= lPivot)
                    lRightNIdx = lRightNIdx - 1
                LOOP
            END IF
            SWAP lWords(lLeftNIdx), lWords(lRightNIdx)
            SWAP strWords(lLeftNIdx), strWords(lRightNIdx)
            lLeftNIdx = lLeftNIdx + 1
            lRightNIdx = lRightNIdx - 1
            IF (lLeftNIdx - 1) = lPivot THEN
                lRightNIdx = lRightNIdx + 1
                lPivot = lRightNIdx
            ELSEIF (lRightNIdx + 1) = lPivot THEN
                lLeftNIdx = lLeftNIdx - 1
                lPivot = lLeftNIdx
            END IF
        LOOP
        QuickSort lLeftN, lPivot - 1, iMode
        QuickSort lPivot + 1, lRightN, iMode
    END IF
END SUB

SUB ShowCompletion ()
    ' Var
    SHARED iFil AS INTEGER
    SHARED lLine AS LONG
    SHARED lLines AS LONG
    SHARED lEnd AS LONG

    LOCATE iFil + 1, 1
    PRINT "Lines analyzed :"; lLine
    PRINT USING "% of completion: ###%"; (lLine / lLines) * 100
    PRINT "Words found....:"; lEnd
END SUB

SUB ShowResults ()
    ' Var
    DIM iMaxL AS INTEGER
    DIM iMaxW AS INTEGER
    DIM lWord AS LONG
    DIM lHowManyWords AS LONG
    DIM strString AS STRING
    DIM strFileR AS STRING
    SHARED lIni AS LONG
    SHARED lEnd AS LONG
    SHARED lLenF AS LONG
    SHARED lMaxWords AS LONG
    SHARED sTimer AS SINGLE
    SHARED strFile AS STRING
    SHARED strWords() AS STRING
    SHARED lWords() AS LONG
    SHARED strTopWords() AS STRING
    SHARED lTopWords() AS LONG
    SHARED iRuns AS INTEGER

    ' Show results

    ' Creates file name
    lWord = INSTR(strFile, ".")
    IF lWord = 0 THEN lWord = LEN(strFile)
    strFileR = LEFT$(strFile, lWord)
    IF RIGHT$(strFileR, 1) <> "." THEN strFileR = strFileR + "."

    ' Retrieves the longest word found and the highest count
    FOR lWord = lIni TO lEnd
        ' Gets the longest word found
        IF LEN(strWords(lWord)) > iMaxL THEN
            iMaxL = LEN(strWords(lWord))
        END IF

        lHowManyWords = lHowManyWords + lWords(lWord)
    NEXT lWord
    IF iMaxL > 60 THEN iMaxW = 60 ELSE iMaxW = iMaxL

    ' Gets top counted
    TopCounted

    ' Shows the results
    CLS
    PRINT "File analyzed : "; strFile
    PRINT "Length of file:"; lLenF
    PRINT "Time lapse....:"; TIMER - sTimer;"seconds"
    PRINT "Words found...:"; lHowManyWords; "(Unique:"; STR$(lEnd); ")"
    PRINT "Longest word..:"; iMaxL
    PRINT
    PRINT "The"; TopCount; "most used are:"
    PRINT STRING$(iMaxW, "-"); "+"; STRING$(80 - (iMaxW + 1), "-")
    PRINT " Word"; SPACE$(iMaxW - 5); "| Count"
    PRINT STRING$(iMaxW, "-"); "+"; STRING$(80 - (iMaxW + 1), "-")
    strString = "\" + SPACE$(iMaxW - 2) + "\| #########,"
    FOR lWord = lIni TO TopCount
        PRINT USING strString; strTopWords(lWord); lTopWords(lWord)
    NEXT lWord
    PRINT STRING$(iMaxW, "-"); "+"; STRING$(80 - (iMaxW + 1), "-")
    PRINT "See files "; strFileR + "S" + LTRIM$(STR$(iRuns)); " and "; strFileR + "C" + LTRIM$(STR$(iRuns)); " to see the full list."
END SUB

SUB TopCounted ()
    ' Var
    DIM lWord AS LONG
    DIM strFileR AS STRING
    DIM iFile AS INTEGER
    CONST Descending = 1
    SHARED lIni AS LONG
    SHARED lEnd AS LONG
    SHARED lMaxWords AS LONG
    SHARED strWords() AS STRING
    SHARED lWords() AS LONG
    SHARED strTopWords() AS STRING
    SHARED lTopWords() AS LONG
    SHARED iRuns AS INTEGER
    SHARED strFile AS STRING

    ' Assigns new dimmentions
    REDIM strTopWords(lIni TO TopCount) AS STRING
    REDIM lTopWords(lIni TO TopCount) AS LONG

    ' Saves the current values
    lWord = INSTR(strFile, ".")
    IF lWord = 0 THEN lWord = LEN(strFile)
    strFileR = LEFT$(strFile, lWord)
    IF RIGHT$(strFileR, 1) <> "." THEN strFileR = strFileR + "."
    iFile = FREEFILE
    OPEN strFileR + "S" + LTRIM$(STR$(iRuns)) FOR OUTPUT AS #iFile
    FOR lWord = lIni TO lEnd
        WRITE #iFile, strWords(lWord), lWords(lWord)
    NEXT lWord
    CLOSE #iFile

    ' Classifies the counted in descending order
    QuickSort lIni, lEnd, Descending

    ' Now, saves the required values in the arrays
    FOR lWord = lIni TO TopCount
        strTopWords(lWord) = strWords(lWord)
        lTopWords(lWord) = lWords(lWord)
    NEXT lWord

    ' Now, saves the order from the file
    OPEN strFileR + "C" + LTRIM$(STR$(iRuns)) FOR OUTPUT AS #iFile
    FOR lWord = lIni TO lEnd
        WRITE #iFile, strWords(lWord), lWords(lWord)
    NEXT lWord
    CLOSE #iFile

END SUB

Output:

This program will count how many words are in a text file and shows the 10 
most used of them.

Document to open (TXT file) (f=see files): miserabl.txt
Looking for words in miserabl.txt. File size: 3369775. Initialization complete.

Processing... Done!
Lines analyzed : 72917
% of completion: 100%
Words found....: 23288

Finishing...

Lenght of file: 3369775
Time lapse....: 35 seconds
Words found...: 578614 (Unique: 23538)
Longest word..: 25

The 10 most used are:
---------------------------+------------------------------------------------------------------------
Word                       | Count
---------------------------+------------------------------------------------------------------------
THE                        |     40,751
OF                         |     19,949
AND                        |     14,891
A                          |     14,430
TO                         |     13,923
IN                         |     11,189
HE                         |      9,605
WAS                        |      8,617
THAT                       |      7,833
IT                         |      6.579
---------------------------+------------------------------------------------------------------------
See files miserabl.S1 and miserabl.C1 to see the full list.

Try again? (Y/n)

BaCon

Removing all punctuation, digits, tabs and carriage returns. So "This", "this" and "this." are the same. Full support for UTF8 characters in words. The code itself could be smaller, but for sake of clarity all has been written explicitly.

' We do not count superfluous spaces as words
OPTION COLLAPSE TRUE

' Optional: use TRE regex library to speed up the program
PRAGMA RE tre INCLUDE <tre/regex.h> LDFLAGS -ltre

' We're using associative arrays
DECLARE frequency ASSOC NUMBER

' Load the text and remove all punctuation, digits, tabs and cr
book$ = EXTRACT$(LOAD$("miserables.txt"), "[[:punct:]]|[[:digit:]]|[\t\r]", TRUE)

' Count each word in lowercase
FOR word$ IN REPLACE$(book$, NL$, CHR$(32))
    INCR frequency(LCASE$(word$))
NEXT

' Sort the associative array and then map the index to a string array
LOOKUP frequency TO term$ SIZE x SORT DOWN

' Show results
FOR i = 0 TO 9
    PRINT term$[i], " : ", frequency(term$[i])
NEXT

Output:

the : 40440
of : 19903
and : 14738
a : 14306
to : 13630
in : 11083
he : 9452
was : 8605
that : 7535
his : 6434

Batch File

This takes a very long time per word thus I have chosen to feed it a 200 line sample and go from there.

You could cut the length of this down drastically if you didn't need to be able to recall the word at nth position and wished only to display the top 10 words.

@echo off

call:wordCount 1 2 3 4 5 6 7 8 9 10 42 101

pause>nul
exit

:wordCount
setlocal enabledelayedexpansion

set word=100000
set line=0
for /f "delims=" %%i in (input.txt) do (
	set /a line+=1
	for %%j in (%%i) do (
		if not !skip%%j!==true (
			echo line !line! ^| word !word:~-5! - "%%~j"
			
			type input.txt | find /i /c "%%~j" > count.tmp
			set /p tmpvar=<count.tmp
			
			set tmpvar=000000000!tmpvar!
			set tmpvar=!tmpvar:~-10!
			set count[!word!]=!tmpvar! %%~j
			
			set "skip%%j=true"
			set /a word+=1
		)
	)
)
del count.tmp

set wordcount=0
for /f "tokens=1,2 delims= " %%i in ('set count ^| sort /+14 /r') do (
	set /a wordcount+=1
	for /f "tokens=2 delims==" %%k in ("%%i") do (
		set word[!wordcount!]=!wordcount!. %%j - %%k
	)
)

cls
for %%i in (%*) do echo !word[%%i]!
endlocal
goto:eof

Output

1.  - 0000000140 I
2.  - 0000000140 a
3.  - 0000000118 He
4.  - 0000000100 the
5.  - 0000000080 an
6.  - 0000000075 in
7.  - 0000000066 at
8.  - 0000000062 is
9.  - 0000000058 on
10.  - 0000000058 as
42.  - 0000000010 with
101.  - 0000000004 ears

Bracmat

This solution assumes that words consists of characters that exist in a lowercase and a highercase version. So it won't work with many non-European alphabets.

The built-in vap function can take either two or three arguments. The first argument must be the name of a function or a function definition. The second argument must be a string. The two-argument version maps the function to each character in the string. The three-argument version splits the string at each occurrence of the third argument, which must be a single character, and applies the function to the intervening substrings. The output of vap is a space-separated list of results from the function argument.

The expression !('($arg:?A [($pivot) ?Z)) must be read as follows:

The subexpression '($arg:?A [($pivot) ?Z) is a macro expression. The symbols arg and pivot, which are the right hand sides of $ operators with empty left hand side, are replaced by the actual values of !arg and !pivot. The whole subexpression is made the right hand side of a = operator with empty left hand side, e.g. =a b c d e:?A [2 ?Z. The = operator protects the subexpression against evaluation. By prefixing the expression with the ! unary operator (which normally is used to obtain the value of a variable), the pattern match operation a b c d e:?A [2 ?Z is executed, assigning a b to A and assigning c d e to Z.

The reason for using a macro expression is that the evaluation of a pattern match operation with pattern variable as in !arg:?A [!pivot ?Z is unecessary slow, since !pivot is evaluated up to !pivot+1 times.

  ( 10-most-frequent-words
  =     MergeSort                        { Local variable declarations. }
        types
        sorted-words
        frequency
        type
        most-frequent-words
    .   ( MergeSort                      { Definition of function MergeSort. }
        =   A N Z pivot
          .   !arg:? [?N                 { [?N is a subpattern that counts the number of preceding elements }
            & (   !N:>1                           { if N at least 2 ... }
                & div$(!N.2):?pivot               {     divide N by 2 ... }
                & !('($arg:?A [($pivot) ?Z))      {     split list in two halves A and Z ... }
                & MergeSort$!A+MergeSort$!Z       {     sort each of A and Z and return sum }
              | !arg                              { else just return a single element}
              )
        )
      &     MergeSort             { Sort }
          $ ( vap                 { Split second argument at each occurrence of third character and apply first argument to each chunk. }
            $ ( (=.low$!arg)      { Return input, lowercased. }
              .   str
                $ ( vap           { Vaporize second argument in UTF-8 or Latin-1 characters and apply first argument to each of them. }
                  $ ( (
                      =
                        .   upp$!arg:low$!arg&\n  { Return newline instead of non-alphabetic character. }
                          | !arg                  { Return (Euro-centric) alphabetic character.}
                      )
                    . get$(!arg,NEW STR) { Read input text as a single string. }
                    )
                  )
              . \n                       { Split at newlines }
              )
            )
        : ?sorted-words                  { Assign sum of (frequency*lowercasedword) terms to sorted-words. }
      & :?types                          { Initialize types as an empty list. }
      &   whl                            { Loop until right hand side fails. }
        ' ( !sorted-words:#?frequency*%@?type+?sorted-words    { Extract first frequency*type term from sum. }
          & (!frequency.!type) !types:?types                   { Prepend (frequency.type) pair to types list}
          )
      &   MergeSort$!types                                     { Sort the list of (frequency.type) pairs. }
        : (?+[-11+?most-frequent-words|?most-frequent-words)   { Pick the last 10 terms from the sum returned by MergeSort. }
      & !most-frequent-words                                   { Return the last 10 terms. }
  )
& out$(10-most-frequent-words$"135-0.txt")      { Call 10-most-frequent-words with name of inout file and print result to screen. }

Output

  (6661.it)
+ (7924.that)
+ (8622.was)
+ (9649.he)
+ (11219.in)
+ (13953.to)
+ (14546.a)
+ (14943.and)
+ (19954.of)
+ (41092.the)

C

Library: GLib

Words are defined by the regular expression "\w+".

#include <stdbool.h>
#include <stdio.h>
#include <glib.h>

typedef struct word_count_tag {
    const char* word;
    size_t count;
} word_count;

int compare_word_count(const void* p1, const void* p2) {
    const word_count* w1 = p1;
    const word_count* w2 = p2;
    if (w1->count > w2->count)
        return -1;
    if (w1->count < w2->count)
        return 1;
    return 0;
}

bool get_top_words(const char* filename, size_t count) {
    GError* error = NULL;
    GMappedFile* mapped_file = g_mapped_file_new(filename, FALSE, &error);
    if (mapped_file == NULL) {
        fprintf(stderr, "%s\n", error->message);
        g_error_free(error);
        return false;
    }
    const char* text = g_mapped_file_get_contents(mapped_file);
    if (text == NULL) {
        fprintf(stderr, "File %s is empty\n", filename);
        g_mapped_file_unref(mapped_file);
        return false;
    }
    gsize file_size = g_mapped_file_get_length(mapped_file);
    // Store word counts in a hash table
    GHashTable* ht = g_hash_table_new_full(g_str_hash, g_str_equal,
                                           g_free, g_free);
    GRegex* regex = g_regex_new("\\w+", 0, 0, NULL);
    GMatchInfo* match_info;
    g_regex_match_full(regex, text, file_size, 0, 0, &match_info, NULL);
    while (g_match_info_matches(match_info)) {
        char* word = g_match_info_fetch(match_info, 0);
        char* lower = g_utf8_strdown(word, -1);
        g_free(word);
        size_t* count = g_hash_table_lookup(ht, lower);
        if (count != NULL) {
            ++*count;
            g_free(lower);
        } else {
            count = g_new(size_t, 1);
            *count = 1;
            g_hash_table_insert(ht, lower, count);
        }
        g_match_info_next(match_info, NULL);
    }
    g_match_info_free(match_info);
    g_regex_unref(regex);
    g_mapped_file_unref(mapped_file);

    // Sort words in decreasing order of frequency
    size_t size = g_hash_table_size(ht);
    word_count* words = g_new(word_count, size);
    GHashTableIter iter;
    gpointer key, value;
    g_hash_table_iter_init(&iter, ht);
    for (size_t i = 0; g_hash_table_iter_next(&iter, &key, &value); ++i) {
        words[i].word = key;
        words[i].count = *(size_t*)value;
    }
    qsort(words, size, sizeof(word_count), compare_word_count);

    // Print the most common words
    if (count > size)
        count = size;
    printf("Top %lu words\n", count);
    printf("Rank\tCount\tWord\n");
    for (size_t i = 0; i < count; ++i)
        printf("%lu\t%lu\t%s\n", i + 1, words[i].count, words[i].word);
    g_free(words);
    g_hash_table_destroy(ht);
    return true;
}

int main(int argc, char** argv) {
    if (argc != 2) {
        fprintf(stderr, "usage: %s file\n", argv[0]);
        return EXIT_FAILURE;
    }
    if (!get_top_words(argv[1], 10))
        return EXIT_FAILURE;
    return EXIT_SUCCESS;
}

Output:

Top 10 words
Rank	Count	Word
1	41039	the
2	19951	of
3	14942	and
4	14527	a
5	13941	to
6	11209	in
7	9646	he
8	8620	was
9	7922	that
10	6659	it

C#

Translation of: D

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;

namespace WordCount {
    class Program {
        static void Main(string[] args) {
            var text = File.ReadAllText("135-0.txt").ToLower();

            var match = Regex.Match(text, "\\w+");
            Dictionary<string, int> freq = new Dictionary<string, int>();
            while (match.Success) {
                string word = match.Value;
                if (freq.ContainsKey(word)) {
                    freq[word]++;
                } else {
                    freq.Add(word, 1);
                }

                match = match.NextMatch();
            }

            Console.WriteLine("Rank  Word  Frequency");
            Console.WriteLine("====  ====  =========");
            int rank = 1;
            foreach (var elem in freq.OrderByDescending(a => a.Value).Take(10)) {
                Console.WriteLine("{0,2}    {1,-4}    {2,5}", rank++, elem.Key, elem.Value);
            }
        }
    }
}

Output:

Rank  Word  Frequency
====  ====  =========
 1    the     41035
 2    of      19946
 3    and     14940
 4    a       14577
 5    to      13939
 6    in      11204
 7    he       9645
 8    was      8619
 9    that     7922
10    it       6659

C++

#include <algorithm>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>
#include <unordered_map>
#include <vector>

int main(int ac, char** av) {
  std::ios::sync_with_stdio(false);
  int head = (ac > 1) ? std::atoi(av[1]) : 10;
  std::istreambuf_iterator<char> it(std::cin), eof;
  std::filebuf file;
  if (ac > 2) {
    if (file.open(av[2], std::ios::in), file.is_open()) {
      it = std::istreambuf_iterator<char>(&file);
    } else return std::cerr << "file " << av[2] << " open failed\n", 1;
  }
  auto alpha = [](unsigned c) { return c-'A' < 26 || c-'a' < 26; };
  auto lower = [](unsigned c) { return c | '\x20'; };
  std::unordered_map<std::string, int> counts;
  std::string word;
  for (; it != eof; ++it) {
    if (alpha(*it)) {
      word.push_back(lower(*it));
    } else if (!word.empty()) {
      ++counts[word];
      word.clear();
    }
  }
  if (!word.empty()) ++counts[word]; // if file ends w/o ws
  std::vector<std::pair<const std::string,int> const*> out;
  for (auto& count : counts) out.push_back(&count);
  std::partial_sort(out.begin(),
    out.size() < head ? out.end() : out.begin() + head,
    out.end(), [](auto const* a, auto const* b) {
      return a->second > b->second;
    });
  if (out.size() > head) out.resize(head);
  for (auto const& count : out) {
    std::cout << count->first << ' ' << count->second << '\n';
  }
  return 0;
}

Output:

$ ./a.out 10 135-0.txt 
the 41093
of 19954
and 14943
a 14558
to 13953
in 11219
he 9649
was 8622
that 7924
it 6661

Alternative

Translation of: C#

#include <algorithm>
#include <iostream>
#include <fstream>
#include <map>
#include <regex>
#include <string>
#include <vector>

int main() {
    std::regex wordRgx("\\w+");
    std::map<std::string, int> freq;
    std::string line;
    const int top = 10;

    std::ifstream in("135-0.txt");
    if (!in.is_open()) {
        std::cerr << "Failed to open file\n";
        return 1;
    }
    while (std::getline(in, line)) {
        auto words_itr = std::sregex_iterator(
            line.cbegin(), line.cend(), wordRgx);
        auto words_end = std::sregex_iterator();
        while (words_itr != words_end) {
            auto match = *words_itr;
            auto word = match.str();
            if (word.size() > 0) {
                transform (word.begin(), word.end(), word.begin(), ::tolower);
                auto entry = freq.find(word);
                if (entry != freq.end()) {
                    entry->second++;
                } else {
                    freq.insert(std::make_pair(word, 1));
                }
            }
            words_itr = std::next(words_itr);
        }
    }
    in.close();

    std::vector<std::pair<std::string, int>> pairs;
    for (auto iter = freq.cbegin(); iter != freq.cend(); ++iter) {
        pairs.push_back(*iter);
    }
    std::sort(pairs.begin(), pairs.end(), [](auto& a, auto& b) {
        return a.second > b.second;
    });

    std::cout << "Rank  Word  Frequency\n"
                 "====  ====  =========\n";
    int rank = 1;
    for (auto iter = pairs.cbegin(); iter != pairs.cend() && rank <= top; ++iter) {
        std::printf("%2d   %4s   %5d\n", rank++, iter->first.c_str(), iter->second);
    }

    return 0;
}

Output:

Rank  Word  Frequency
====  ====  =========
 1    the   36636
 2     of   19615
 3    and   14079
 4     to   13535
 5      a   13527
 6     in   10256
 7    was    8543
 8   that    7324
 9     he    6814
10    had    6139

C++20

Translation of: C#

#include <algorithm>
#include <iostream>
#include <format>
#include <fstream>
#include <map>
#include <ranges>
#include <regex>
#include <string>
#include <vector>

int main() {
    std::ifstream in("135-0.txt");
    std::string text{
        std::istreambuf_iterator<char>{in}, std::istreambuf_iterator<char>{}
    };
    in.close();

    std::regex word_rx("\\w+");
    std::map<std::string, int> freq;
    for (const auto& a : std::ranges::subrange(
        std::sregex_iterator{ text.cbegin(),text.cend(), word_rx }, std::sregex_iterator{}
    ))
    {
        auto word = a.str();
        transform(word.begin(), word.end(), word.begin(), ::tolower);
        freq[word]++;
    }

    std::vector<std::pair<std::string, int>> pairs;
    for (const auto& elem : freq)
    {
        pairs.push_back(elem);
    }

    std::ranges::sort(pairs, std::ranges::greater{}, &std::pair<std::string, int>::second);

    std::cout << "Rank  Word  Frequency\n"
        "====  ====  =========\n";
    for (int rank=1; const auto& [word, count] : pairs | std::views::take(10))
    {
        std::cout << std::format("{:2}   {:>4}   {:5}\n", rank++, word, count);
    }
}

Output:

Rank  Word  Frequency
====  ====  =========
 0    the   41043
 1     of   19952
 2    and   14938
 3      a   14539
 4     to   13942
 5     in   11208
 6     he    9646
 7    was    8620
 8   that    7922
 9     it    6659

Clojure

(defn count-words [file n]
  (->> file
    slurp
    clojure.string/lower-case
    (re-seq #"\w+")
    frequencies
    (sort-by val >)
    (take n)))

Output:

user=> (count-words "135-0.txt" 10)
(["the" 41036] ["of" 19946] ["and" 14940] ["a" 14589] ["to" 13939]
 ["in" 11204] ["he" 9645] ["was" 8619] ["that" 7922] ["it" 6659])

COBOL

       IDENTIFICATION DIVISION.
       PROGRAM-ID. WordFrequency.
       AUTHOR.  Bill Gunshannon.
       DATE-WRITTEN.  30 Jan 2020.
      ************************************************************
      ** Program Abstract:
      **   Given a text file and an integer n, print the n most
      **   common words in the file (and the number of their
      **   occurrences) in decreasing frequency.
      **
      **   A file named Parameter.txt provides this information.
      **   Format is:
      **   12345678901234567890123456789012345678901234567890
      **   |------------------|----|
      **     ^^^^^^^^^^^^^^^^  ^^^^
      **          |              |
      **     Source Text File   Number of words with count
      **       20 Characters      5 digits with leading zeroes
      **
      **
      ************************************************************
       
       ENVIRONMENT DIVISION.
       
       INPUT-OUTPUT SECTION.
       FILE-CONTROL.
            SELECT Parameter-File ASSIGN TO "Parameter.txt"
                 ORGANIZATION IS LINE SEQUENTIAL.
            SELECT Input-File ASSIGN TO Source-Text
                 ORGANIZATION IS LINE SEQUENTIAL.
            SELECT Word-File ASSIGN TO "Word.txt"
                 ORGANIZATION IS LINE SEQUENTIAL.
            SELECT Output-File ASSIGN TO "Output.txt"
                 ORGANIZATION IS LINE SEQUENTIAL.
            SELECT Print-File ASSIGN TO "Printer.txt"
                 ORGANIZATION IS LINE SEQUENTIAL.
            SELECT Sort-File     ASSIGN TO DISK.
       
       DATA DIVISION.
       
       FILE SECTION.
       
       FD  Parameter-File
           DATA RECORD IS Parameter-Record.
       01  Parameter-Record.
           05 Source-Text               PIC X(20).
           05 How-Many                  PIC 99999.

       FD  Input-File
           DATA RECORD IS Input-Record.
       01  Input-Record.
           05 Input-Line                PIC X(80).

       FD  Word-File
           DATA RECORD IS Word-Record.
       01  Word-Record.
           05 Input-Word               PIC X(20).

       FD  Output-File
           DATA RECORD IS Output-Rec.
       01  Output-Rec.
           05  Output-Rec-Word         PIC X(20).
           05  Output-Rec-Word-Cnt     PIC 9(5).

       FD  Print-File
           DATA RECORD IS Print-Rec.
       01  Print-Rec.
           05  Print-Rec-Word          PIC X(20).
           05  Print-Rec-Word-Cnt      PIC 9(5).
       
       SD  Sort-File.
       01  Sort-Rec.
           05  Sort-Word               PIC X(20).
           05  Sort-Word-Cnt           PIC 9(5).
       
       
       WORKING-STORAGE SECTION.
       
       01 Eof                    PIC X     VALUE 'F'.
       01 InLine                 PIC X(80).
       01 Word1                  PIC X(20).
       01 Current-Word           PIC X(20).
       01 Current-Word-Cnt       PIC 9(5).
       01 Pos                    PIC 99
                 VALUE 1.
       01 Cnt                    PIC 99.
       01 Report-Rank.
          05  IRank              PIC 99999
                 VALUE 1.
          05 Rank                PIC ZZZZ9.
       
       PROCEDURE DIVISION.
       
       Main-Program.
      **
      **  Read the Parameters
      **
         OPEN INPUT Parameter-File.
         READ Parameter-File.
         CLOSE Parameter-File.

      **
      **  Open Files for first stage
      **
         OPEN INPUT  Input-File.
         OPEN OUTPUT  Word-File.

      **
      **  Pare\se the Source Text into a file of invidual words
      **
         PERFORM UNTIL Eof = 'T'
            READ Input-File 
               AT END MOVE 'T' TO Eof
            END-READ

         PERFORM Parse-a-Words

         MOVE SPACES TO Input-Record
         MOVE 1 TO Pos
         END-PERFORM.
     
      **
      **  Cleanup from the first stage
      **
         CLOSE Input-File Word-File

      **
      **  Sort the individual words in alphabetical order
      **
         SORT Sort-File
              ON ASCENDING KEY Sort-Word
              USING Word-File
              GIVING Word-File.

      **
      **  Count each time a word is used
      **
         PERFORM Collect-Totals.

      **
      **  Sort data by number of usages per word
      **
         SORT Sort-File
              ON DESCENDING KEY Sort-Word-Cnt
              USING Output-File
              GIVING Print-File.

      **
      **  Show the work done
      **
         OPEN INPUT Print-File.
            DISPLAY " Rank  Word               Frequency"
         PERFORM How-Many TIMES
            READ Print-File
            MOVE IRank TO Rank
            DISPLAY Rank "  " Print-Rec
            ADD 1 TO IRank
         END-PERFORM.

      **
      **  Cleanup
      **
         CLOSE Print-File.
         CALL "C$DELETE" USING "Word.txt" ,0
         CALL "C$DELETE" USING "Output.txt" ,0

         STOP RUN.
         

        Parse-a-Words.
          INSPECT Input-Record CONVERTING '-.,"();:/[]{}!?|' TO SPACE 
          PERFORM UNTIL Pos > FUNCTION STORED-CHAR-LENGTH(Input-Record) 


          UNSTRING Input-Record DELIMITED BY SPACE INTO Word1 
                    WITH POINTER Pos TALLYING IN Cnt 
          MOVE FUNCTION TRIM(FUNCTION LOWER-CASE(Word1)) TO Word-Record
          
          IF Word-Record NOT EQUAL SPACES AND Word-Record IS ALPHABETIC
             THEN WRITE Word-Record
          END-IF

          END-PERFORM.

       Collect-Totals.
          MOVE 'F' to Eof
          OPEN INPUT Word-File
          OPEN OUTPUT Output-File
             READ Word-File
             MOVE Input-Word TO Current-Word
             MOVE 1 to Current-Word-Cnt
          PERFORM UNTIL Eof = 'T'
             READ Word-File
                AT END MOVE 'T' TO Eof
             END-READ

             IF FUNCTION TRIM(Word-Record) 
                    EQUAL 
                           FUNCTION TRIM(Current-Word)
                THEN
                     ADD 1 to Current-Word-Cnt
                ELSE
                     MOVE Current-Word TO Output-Rec-Word
                     MOVE Current-Word-Cnt TO Output-Rec-Word-Cnt
                     WRITE Output-Rec
                     MOVE 1 to Current-Word-Cnt
                     MOVE Word-Record TO Current-Word
                     MOVE SPACES TO Input-Record
            END-IF 
           
          END-PERFORM.
          CLOSE Word-File Output-File.
       END-PROGRAM.

Output:

 Rank  Word               Frequency
    1  the                 40551
    2  of                  19806
    3  and                 14730
    4  a                   14351
    5  to                  13775
    6  in                  11074
    7  he                  09480
    8  was                 08613
    9  that                07632
   10  his                 06446
   11  it                  06335
   12  had                 06181
   13  is                  06097
   14  which               05135
   15  with                04469

Common Lisp

(defun count-word (n pathname)
  (with-open-file (s pathname :direction :input)
    (loop for line = (read-line s nil nil) while line
          nconc (list-symb (drop-noise line)) into words
          finally (return (subseq (sort (pair words)
                                        #'> :key #'cdr)
                                  0 n)))))

  (defun list-symb (s)
    (let ((*read-eval* nil))
      (read-from-string (concatenate 'string "(" s ")"))))

(defun drop-noise (s)
  (delete-if-not #'(lambda (x) (or (alpha-char-p x)
                                   (equal x #\space)
                                   (equal x #\-))) s))

(defun pair (words &aux (hash (make-hash-table)) ac)
  (dolist (word words) (incf (gethash word hash 0)))
  (maphash #'(lambda (e n) (push `(,e . ,n) ac)) hash) ac)

Output:

> (count-word 10 "c:/temp/135-0.txt")
((THE . 40738) (OF . 19922) (AND . 14878) (A . 14419) (TO . 13702) (IN . 11172)
 (HE . 9577) (WAS . 8612) (THAT . 7768) (IT . 6467))

Crystal

require "http/client"
require "regex"

# Get the text from the internet
response = HTTP::Client.get "https://www.gutenberg.org/files/135/135-0.txt"
text = response.body

text
  .downcase
  .scan(/[a-zA-ZáéíóúÁÉÍÓÚâêôäüöàèìòùñ']+/)
  .reduce({} of String => Int32) { |hash, match|
    word = match[0]
    hash[word] = hash.fetch(word, 0) + 1 # using fetch to set a default value (1) to the new found word
    hash
  }
  .to_a                                        # convert the returned hash to an array of tuples (String, Int32) -> {word, sum}
  .sort { |a, b| b[1] <=> a[1] }[0..9]         # sort and get the first 10 elements
  .each_with_index(1) { |(word, n), i| puts "#{i} \t #{word} \t #{n}" } # print the result

Output:

1        the     41092
2        of      19954
3        and     14943
4        a       14556
5        to      13953
6        in      11219
7        he      9649
8        was     8622
9        that    7924
10       it      6661

D

import std.algorithm : sort;
import std.array : appender, split;
import std.range : take;
import std.stdio : File, writefln, writeln;
import std.typecons : Tuple;
import std.uni : toLower;

//Container for a word and how many times it has been seen
alias Pair = Tuple!(string, "k", int, "v");

void main() {
    int[string] wcnt;

    //Read the file line by line
    foreach (line; File("135-0.txt").byLine) {
        //Split the words on whitespace
        foreach (word; line.split) {
            //Increment the times the word has been seen
            wcnt[word.toLower.idup]++;
        }
    }

    //Associative arrays cannot be sort, so put the key/value in an array
    auto wb = appender!(Pair[]);
    foreach(k,v; wcnt) {
        wb.put(Pair(k,v));
    }
    Pair[] sw = wb.data.dup;

    //Sort the array, and display the top ten values
    writeln("Rank  Word        Frequency");
    int rank=1;
    foreach (word; sw.sort!"a.v>b.v".take(10)) {
        writefln("%4s  %-10s  %9s", rank++, word.k, word.v);
    }
}

Output:

Rank  Word        Frequency
   1  the             40368
   2  of              19863
   3  and             14470
   4  a               14277
   5  to              13587
   6  in              11019
   7  he               9212
   8  was              8346
   9  that             7251
  10  his              6414

Delphi

Library: System.SysUtils

Library: System.IOUtils

Library: System.Generics.Collections

Library: System.Generics.Defaults

Library: System.RegularExpressions

Translation of: C#

program Word_frequency;

{$APPTYPE CONSOLE}

uses
  System.SysUtils,
  System.IOUtils,
  System.Generics.Collections,
  System.Generics.Defaults,
  System.RegularExpressions;

type
  TWords = TDictionary<string, Integer>;

  TFreqPair = TPair<string, Integer>;

  TFreq = TArray<TFreqPair>;

function CreateValueCompare: IComparer<TFreqPair>;
begin
  Result := TComparer<TFreqPair>.Construct(
    function(const Left, Right: TFreqPair): Integer
    begin
      Result := Right.Value - Left.Value;
    end);
end;

function WordFrequency(const Text: string): TFreq;
var
  words: TWords;
  match: TMatch;
  w: string;
begin
  words := TWords.Create();
  match := TRegEx.Match(Text, '\w+');
  while match.Success do
  begin
    w := match.Value;
    if words.ContainsKey(w) then
      words[w] := words[w] + 1
    else
      words.Add(w, 1);
    match := match.NextMatch();
  end;

  Result := words.ToArray;
  words.Free;
  TArray.Sort<TFreqPair>(Result, CreateValueCompare);
end;

var
  Text: string;
  rank: integer;
  Freq: TFreq;
  w: TFreqPair;

begin
  Text := TFile.ReadAllText('135-0.txt').ToLower();

  Freq := WordFrequency(Text);

  Writeln('Rank  Word  Frequency');
  Writeln('====  ====  =========');

  for rank := 1 to 10 do
  begin
    w := Freq[rank - 1];
    Writeln(format('%2d   %6s   %5d', [rank, w.Key, w.Value]));
  end;

  readln;
end.

Output:

Rank  Word  Frequency
====  ====  =========
 1      the   41040
 2       of   19951
 3      and   14942
 4        a   14539
 5       to   13941
 6       in   11209
 7       he    9646
 8      was    8620
 9     that    7922
10       it    6659

F#

open System.IO
open System.Text.RegularExpressions
let g=Regex("[A-Za-zÀ-ÿ]+").Matches(File.ReadAllText "135-0.txt")
[for n in g do yield n.Value.ToLower()]|>List.countBy(id)|>List.sortBy(fun n->(-(snd n)))|>List.take 10|>List.iter(fun n->printfn "%A" n)

Output:

("the", 41088)
("of", 19949)
("and", 14942)
("a", 14596)
("to", 13951)
("in", 11214)
("he", 9648)
("was", 8621)
("that", 7924)
("it", 6661)

Factor

This program expects stdin to read from a file via the command line. ( e.g. invoking the program in Windows: >factor word-count.factor < input.txt ) The definition of a word here is simply any string surrounded by some combination of spaces, punctuation, or newlines.

USING: ascii io math.statistics prettyprint sequences
splitting ;
IN: rosetta-code.word-count

lines " " join " .,?!:;()\"-" split harvest [ >lower ] map
sorted-histogram <reversed> 10 head .

Output:

{
    { "the" 41021 }
    { "of" 19945 }
    { "and" 14938 }
    { "a" 14522 }
    { "to" 13938 }
    { "in" 11201 }
    { "he" 9600 }
    { "was" 8618 }
    { "that" 7822 }
    { "it" 6532 }
}

FreeBASIC

 #Include "file.bi"
 type tally
      as string s
      as long l
end type
 
Sub quicksort(array() As String,begin As Long,Finish As Long)
 Dim As Long i=begin,j=finish
 Dim As String x =array(((I+J)\2))
 While I <= J
 While array(I) < X :I+=1:Wend
 While array(J) > X :J-=1:Wend
 If I<=J Then Swap array(I),array(J): I+=1:J-=1
 Wend
 If J >begin Then quicksort(array(),begin,J)
 If I <Finish Then quicksort(array(),I,Finish)
End Sub

Sub tallysort(array() As tally,begin As Long,Finish As long)
 Dim As Long i=begin,j=finish
 Dim As tally x =array(((I+J)\2))
 While I <= J
 While array(I).l > X .l:I+=1:Wend
 While array(J).l < X .l:J-=1:Wend
 If I<=J Then Swap array(I),array(J): I+=1:J-=1
 Wend
 If J >begin Then tallysort(array(),begin,J)
 If I <Finish Then tallysort(array(),I,Finish)
 End Sub


Function loadfile(file As String) As String
	If Fileexists(file)=0 Then Print file;" not found":Sleep:End
   Dim As Long  f=Freefile
    Open file For Binary Access Read As #f
    Dim As String text
    If Lof(f) > 0 Then
      text = String(Lof(f), 0)
      Get #f, , text
    End If
    Close #f
    Return text
End Function

Function String_Split(s_in As String,chars As String,result() As String) As Long
    Dim As Long ctr,ctr2,k,n,LC=Len(chars)
    Dim As boolean tally(Len(s_in))
    #macro check_instring()
    n=0
    While n<Lc
        If chars[n]=s_in[k] Then 
            tally(k)=true
            If (ctr2-1) Then ctr+=1
            ctr2=0
            Exit While
        End If
        n+=1
    Wend
    #endmacro
    
    #macro splice()
    If tally(k) Then
        If (ctr2-1) Then ctr+=1:result(ctr)=Mid(s_in,k+2-ctr2,ctr2-1)
        ctr2=0
    End If
    #endmacro
    '==================  LOOP TWICE =======================
    For k  =0 To Len(s_in)-1
        ctr2+=1:check_instring()
    Next k
     If ctr=0 Then
         If Len(s_in) Andalso Instr(chars,Chr(s_in[0])) Then ctr=1':
         End If
    If ctr Then Redim result(1 To ctr): ctr=0:ctr2=0 Else  Return 0
    For k  =0 To Len(s_in)-1
        ctr2+=1:splice()
    Next k
    '===================== Last one ========================
    If ctr2>0 Then
        Redim Preserve result(1 To ctr+1)
        result(ctr+1)=Mid(s_in,k+1-ctr2,ctr2)
    End If
   
    Return Ubound(result)
End Function

Redim As String s()
redim as tally t()
dim as string p1,p2,deliminators
dim as long count,jmp
dim as double tm=timer

Var L=loadfile("rosettalesmiserables.txt")
L=lcase(L)
'get deliminators
for n as long=1 to 96
      p1+=chr(n)
next
for n as long=123 to 255
    p2+=chr(n)
next

deliminators=p1+p2

string_split(L,deliminators,s())

quicksort(s(),lbound(s),ubound(s))

For n As Long=lbound(s)  To ubound(s)-1
      if s(n+1)=s(n) then jmp+=1
      if s(n+1)<>s(n) then 
            count+=1
            redim preserve t(1 to count)
            t(count).s=s(n)
            t(count).l=jmp
            jmp=0
            end if
Next

tallysort(t(),lbound(t),ubound(t))'sort by frequency
print "frequency","word"
print
for n as long=lbound(t) to lbound(t)+9
      print t(n).l,t(n).s
      next

Print
print "time for operation  ";timer-tm;" seconds"
sleep

Output:

I saved and reloaded the file as ascii text.
frequency     word

 41098        the
 19955        of
 14939        and
 14557        a
 13953        to
 11219        in
 9648         he
 8621         was
 7923         that
 6660         it

time for operation   1.099869600031525 seconds

Frink

This example shows some of the subtle and non-obvious power of Frink in processing text files in a language-aware and Unicode-aware fashion:

Frink has a Unicode-aware function, wordList[str], which intelligently enumerates through the words in a string (and correctly handles compound words, hyphenated words, accented characters, etc.) It returns words, spaces, and punctuation marks separately. For the purposes of this program, "words" that do not contain any alphanumeric characters (as decided by the Unicode standard) are filtered out. These are likely punctuation and spaces. There is also a two-argument function, wordList[str, lang] which allows you to specify a language code e.g. "fr" to use the rules of French (or many other human languages) to perform correct word-breaking according to the rules of that language!

The file fetched from Project Gutenberg is supposed to be encoded in UTF-8 character encoding, but their servers incorrectly send either that it is Windows-1252 encoded or send no character encoding at all, so this program fixes that.

Frink has a Unicode-aware lowercase function, lc[str] that correctly handles accented characters and may even make a string longer.

Frink can normalize Unicode characters with its normalizeUnicode function so the same word encoded two different ways in Unicode can be treated consistently. For example, a Unicode string can use various methods to encode what is essentially the same character/glyph. For example, the character ô can be represented as either "\u00F4" or "\u006F\u0302". The former is a "precomposed" character, "LATIN SMALL LETTER O WITH CIRCUMFLEX", and the latter is two Unicode codepoints, an o (LATIN SMALL LETTER O) followed by "COMBINING CIRCUMFLEX ACCENT". (This is usually referred to as a "decomposed" representation.) Unicode normalization rules can convert these "equivalent" encodings into a canonical representation. This makes two different strings which look equivalent to a human (but are very different in their codepoints) be treated as the same to a computer, and these programs will count them the same. Even if the Project Gutenberg document uses precomposed and decomposed representations for the same words, this program will fix it and count them the same! See the [Unicode Normal Forms] specification for more about these normalization rules. Frink implements all of them (NFC, NFD, NFKC, NFKD). NFC is the default in normalizeUnicode[str, encoding=NFC]. They're interesting!

How many other languages in this page do all or any of this correctly?

There are two sample programs below. First, a simple but powerful method that works in old versions of Frink:

d = new dict
for w = select[wordList[read[normalizeUnicode["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]], %r/[[:alnum:]]/ ]
   d.increment[lc[w], 1]

println[join["\n", first[reverse[sort[array[d], {|a,b| a@1 <=> b@1}]], 10]]]

Output:

[the, 40802]
[of, 19933]
[and, 14924]
[a, 14450]
[to, 13719]
[in, 11184]
[he, 9636]
[was, 8617]
[that, 7901]
[it, 6641]

Next, a "showing off" one-liner that works in recent versions of Frink that uses the countToArray function which easily creates sorted frequency lists and the formatTable function that formats into a nice table with columns lined up, and still performs full Unicode-aware normalization, capitalization, and word-breaking:

formatTable[first[countToArray[select[wordList[lc[normalizeUnicode[read["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]]]], %r/[[:alnum:]]/ ]], 10], "right"]

Output:

 the 36629
  of 19602
 and 14063
   a 13447
  to 13345
  in 10259
 was  8541
that  7303
  he  6812
 had  6133

FutureBasic

Task said: "Feel free to explicitly state the thoughts behind the program decisions." Thus the heavy comments.

include "NSLog.incl"

local fn WordFrequency( textStr as CFStringRef, caseSensitive as Boolean, ascendingOrder as Boolean ) as CFStringRef
'~'1
CFStringRef     wrd
CFDictionaryRef dict

// Depending on the value of the caseSensitive Boolean function parameter above, lowercase incoming text
if caseSensitive == NO then textStr = fn StringLowercaseString( textStr )

// Trim non-alphabetic characters from string, and separate individual words with a space
CFStringRef tempStr = fn ArrayComponentsJoinedByString( fn StringComponentsSeparatedByCharactersInSet( textStr, fn CharacterSetInvertedSet( fn CharacterSetLetterSet ) ), @" " )

// Prepare separators to parse string into array
CFMutableCharacterSetRef separators = fn MutableCharacterSetInit

// Informally, this set is the set of all non-whitespace characters used to separate linguistic units in scripts, such as periods, dashes, parentheses, and so on.
MutableCharacterSetFormUnionWithCharacterSet( separators, fn CharacterSetPunctuationSet )

// A character set containing all the whitespace and newline characters including characters in Unicode General Category Z*, U+000A U+000D, and U+0085.
MutableCharacterSetFormUnionWithCharacterSet( separators, fn CharacterSetWhitespaceAndNewlineSet )

// Create array of separated words
CFArrayRef tempArr = fn StringComponentsSeparatedByCharactersInSet( tempStr, separators )

// Create a counted set with each word and its frequency
CountedSetRef freqencies = fn CountedSetWithArray( tempArr )

// Enumerate each word-frequency pair in the counted set...
EnumeratorRef enumRef = fn CountedSetObjectEnumerator( freqencies )

// .. and use it to create array of words in counted set
CFArrayRef array = fn EnumeratorAllObjects( enumRef )

// Create an empty mutable array
CFMutableArrayRef wordArr = fn MutableArrayWithCapacity( 0 )

// Create word counter
NSInteger totalWords = 0
// Enumerate each unique word, get its frequency, create its own key/value pair dictionary, add each dictionary into master array
for wrd in array
totalWords++
// Create dictionary with frequency and matching word
dict = @{ @"count":fn NumberWithUnsignedInteger( fn CountedSetCountForObject( freqencies, wrd ) ), @"object":wrd }
// Add each dictionary to the master mutable array, checking for a valid word by length
if ( fn StringLength( wrd ) != 0 )
MutableArrayAddObject( wordArr, dict )
end if
next

// Store the total words as a global application property
AppSetProperty( @"totalWords", fn StringWithFormat( @"%d", totalWords - 1 ) )

// Sort the array in ascending or descending order as determined by the ascendingOrder Boolean function input parameter
SortDescriptorRef descriptors = fn SortDescriptorWithKey( @"count", ascendingOrder )
CFArrayRef sortedArray = fn ArraySortedArrayUsingDescriptors( wordArr, @[descriptors] )

// Create an empty mutable string
CFMutableStringRef mutStr = fn MutableStringWithCapacity( 0 )

// Use each dictionary in sorted array to build the formatted output string
NSInteger count = 1
for dict in sortedArray
MutableStringAppendString( mutStr, fn StringWithFormat( @"%-7d %-7lu %@\n", count, fn StringIntegerValue( fn DictionaryValueForKey( dict, @"count" ) ), fn DictionaryValueForKey( dict, @"object"  ) ) )
count++
next

// Create an immutable output string from mutable the string
CFStringRef resultStr = fn StringWithFormat( @"%@", mutStr )
end fn = resultStr


local fn ParseTextFromWebsite( webSite as CFStringRef )
// Convert incoming string to URL
CFURLRef textURL = fn URLWithString( webSite )
// Read contents of URL into a string
CFStringRef textStr = fn StringWithContentsOfURL( textURL, NSUTF8StringEncoding, NULL )

// Start timer
CFAbsoluteTime startTime = fn CFAbsoluteTimeGetCurrent
// Calculate frequency of words in text and sort by occurrence
CFStringRef frequencyStr = fn WordFrequency( textStr, NO, NO )
// Log results and post post processing time
NSLogClear
NSLog( @"%@", frequencyStr )
NSLog( @"Total unique words in document: %@", fn AppProperty( @"totalWords" ) )
// Stop timer and log elapsed processing time
NSLog( @"Elapsed time: %f milliseconds.", ( fn CFAbsoluteTimeGetCurrent - startTime ) * 1000.0 )
end fn

dispatchglobal
// Pass url for Les Misérables on Project Gutenberg and parse in background
fn ParseTextFromWebsite( @"https://www.gutenberg.org/files/135/135-0.txt" )
dispatchend

HandleEvents

Output:

1       41095   the
2       19955   of
3       14939   and
4       14546   a
5       13954   to
6       11218   in
7       9649    he
8       8622    was
9       7924    that
10      6661    it
11      6470    his
12      6193    is

//-------------------

22900   1       millstones
22901   1       fumbles
22902   1       shunned
22903   1       avoids
22904   1       poitevin
22905   1       muleteer
22906   1       idolizes
22907   1       lapsed
22908   1       reptitalmus
22909   1       bled
22910   1       isabella

Total unique words in document: 22910
Elapsed time: 595.407963 milliseconds.

Go

Translation of: Kotlin

package main

import (
    "fmt"
    "io/ioutil"
    "log"
    "regexp"
    "sort"
    "strings"
)

type keyval struct {
    key string
    val int
}

func main() {
    reg := regexp.MustCompile(`\p{Ll}+`)
    bs, err := ioutil.ReadFile("135-0.txt")
    if err != nil {
        log.Fatal(err)
    }
    text := strings.ToLower(string(bs))
    matches := reg.FindAllString(text, -1)
    groups := make(map[string]int)
    for _, match := range matches {
        groups[match]++
    }
    var keyvals []keyval
    for k, v := range groups {
        keyvals = append(keyvals, keyval{k, v})
    }
    sort.Slice(keyvals, func(i, j int) bool {
        return keyvals[i].val > keyvals[j].val
    })
    fmt.Println("Rank  Word  Frequency")
    fmt.Println("====  ====  =========")
    for rank := 1; rank <= 10; rank++ {
        word := keyvals[rank-1].key
        freq := keyvals[rank-1].val
        fmt.Printf("%2d    %-4s    %5d\n", rank, word, freq)
    }
}

Output:

Rank  Word  Frequency
====  ====  =========
 1    the     41088
 2    of      19949
 3    and     14942
 4    a       14596
 5    to      13951
 6    in      11214
 7    he       9648
 8    was      8621
 9    that     7924
10    it       6661

Groovy

Solution:

def topWordCounts = { String content, int n ->
    def mapCounts = [:]
    content.toLowerCase().split(/\W+/).each {
        mapCounts[it] = (mapCounts[it] ?: 0) + 1
    }
    def top = (mapCounts.sort { a, b -> b.value <=> a.value }.collect{ it })[0..<n]
    println "Rank Word Frequency\n==== ==== ========="
    (0..<n).each { printf ("%4d %-4s %9d\n", it+1, top[it].key, top[it].value) }
}

Test:

def rawText = "http://www.gutenberg.org/files/135/135-0.txt".toURL().text
topWordCounts(rawText, 10)

Output:

Rank Word Frequency
==== ==== =========
   1 the      41036
   2 of       19946
   3 and      14940
   4 a        14589
   5 to       13939
   6 in       11204
   7 he        9645
   8 was       8619
   9 that      7922
  10 it        6659

Haskell

Lazy IO with pure Map, arrows

Translation of: Clojure

module Main where

import Control.Category   -- (>>>)
import Data.Char          -- toLower, isSpace
import Data.List          -- sortBy, (Foldable(foldl')), filter -- '
import Data.Ord           -- Down
import System.IO          -- stdin, ReadMode, openFile, hClose
import System.Environment -- getArgs

-- containers
import Data.Map.Strict (Map)
import qualified Data.Map.Strict as M
import qualified Data.IntMap.Strict as IM

-- text
import Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Text.IO as T

frequencies :: Ord a => [a] -> Map a Integer
frequencies = foldl' (\m k -> M.insertWith (+) k 1 m) M.empty -- '
{-# SPECIALIZE frequencies :: [Text] -> Map Text Integer #-}

main :: IO ()
main = do
  args <- getArgs
  (n,hand,filep) <- case length args of
    0 -> return (10,stdin,False)
    1 -> return (read $ head args,stdin,False)
    _ -> let (ns:fp:_) = args
         in fmap (\h -> (read ns,h,True)) (openFile fp ReadMode)
  T.hGetContents hand >>=
    (T.map toLower
      >>> T.split isSpace
      >>> filter (not <<< T.null)
      >>> frequencies
      >>> M.toList
      >>> sortBy (comparing (Down <<< snd)) -- sort the opposite way
      >>> take n
      >>> print)
  when filep (hClose hand)

Output:

$ ./word_count 10 < ~/doc/les_miserables*
[("the",40368),("of",19863),("and",14470),("a",14277),("to",13587),("in",11019),("he",9212),("was",8346),("that",7251),("his",6414)]

Lazy IO, map of IORefs

Using IORefs as values in the map seems to give a ~2x speedup on large files. The below code is based on https://github.com/composewell/streamly-examples/blob/master/examples/WordFrequency.hs , but still using lazy IO to avoid the extra library dependency (in production you should use a streaming library like streamly/conduit/io-streams):

module Main where

import Control.Monad      (foldM, when)
import Data.Char          (isSpace, toLower)
import Data.List          (sortOn, filter)
import Data.Ord           (Down(..))
import System.IO          (stdin, IOMode(..), openFile, hClose)
import System.Environment (getArgs)
import Data.IORef         (IORef(..), newIORef, readIORef, modifyIORef') -- '

-- containers
import Data.HashMap.Strict (HashMap)
import qualified Data.HashMap.Strict as M

-- text
import Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Text.IO as T

frequencies :: [Text] -> IO (HashMap Text (IORef Int))
frequencies = foldM (flip (M.alterF alter)) M.empty
 where
  alter Nothing    = Just <$> newIORef (1 :: Int)
  alter (Just ref) = modifyIORef' ref (+ 1) >> return (Just ref) -- '

main :: IO ()
main = do
  args <- getArgs
  when (length args /= 1) (error "expecting 1 arg (number of words to print)")
  let maxw = read $ head args -- no error handling, to simplify the example
  T.hGetContents stdin >>= \contents -> do
    freqtable <- frequencies $ filter (not . T.null) $ T.split isSpace $ T.map toLower contents
    counts <-
        let readRef (w, ref) = do
                cnt <- readIORef ref
                return (w, cnt)
         in mapM readRef $ M.toList freqtable
    print $ take maxw $ sortOn (Down . snd) counts

Output:

$ ./word_count 10 < ~/doc/les_miserables*
[("the",40378),("of",19869),("and",14468),("a",14278),("to",13590),("in",11025),("he",9213),("was",8347),("that",7249),("his",6414)]

Lazy IO, short code, but not streaming

Or, perhaps a little more simply, though not streaming (will read everything into memory, don't use on big files):

import qualified Data.Text.IO as T
import qualified Data.Text as T

import Data.List (group, sort, sortBy)
import Data.Ord (comparing)

frequentWords :: T.Text -> [(Int, T.Text)]
frequentWords =
  sortBy (flip $ comparing fst) .
  fmap ((,) . length <*> head) . group . sort . T.words . T.toLower

main :: IO ()
main = T.readFile "miserables.txt" >>= (mapM_ print . take 10 . frequentWords)

Output:

(40370,"the")
(19863,"of")
(14470,"and")
(14277,"a")
(13587,"to")
(11019,"in")
(9212,"he")
(8346,"was")
(7251,"that")
(6414,"his")

J

Text acquisition: store the entire text from the web page http://www.gutenberg.org/files/135/135-0.txt (the plain text UTF-8 link) into a file. This linux example uses ~/downloads/books/LesMis.txt .

Program: Reading from left to right, 10 {. "ten take" from an array computed by words to the right. \:~ "sort descending" by items of the array computed by whatever is to the right. (#;{.)/.~ "tally linked with item" key ;: "words" parses the argument to its right as a j sentence. tolower changes to a common case

Hence the remainder of the j sentence must clean after loading the file.

The parenthesized expression (a.-.Alpha_j_,' ') computes to a vector of the j alphabet excluding [a-zA-Z ] ((e.&(a.-.Alpha_j_,' '))`(,:&' '))} substitutes space character for the unwanted characters. 1!:1 reads the file named in the box <

   10{.\:~(#;{.)/.~;:tolower((e.&(a.-.Alpha_j_,' '))`(,:&' '))}1!:1<jpath'~/downloads/books/LesMis.txt'
┌─────┬────┐
│41093│the │
├─────┼────┤
│19954│of  │
├─────┼────┤
│14943│and │
├─────┼────┤
│14558│a   │
├─────┼────┤
│13953│to  │
├─────┼────┤
│11219│in  │
├─────┼────┤
│9649 │he  │
├─────┼────┤
│8622 │was │
├─────┼────┤
│7924 │that│
├─────┼────┤
│6661 │it  │
└─────┴────┘

Java

This is relatively simple in Java.
I used a URL class to download the content, a BufferedReader class to examine the text line-for-line, a Pattern and Matcher to identify words, and a Map to hold to values.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

void printWordFrequency() throws URISyntaxException, IOException {
    URL url = new URI("https://www.gutenberg.org/files/135/135-0.txt").toURL();
    try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()))) {
        Pattern pattern = Pattern.compile("(\\w+)");
        Matcher matcher;
        String line;
        String word;
        Map<String, Integer> map = new HashMap<>();
        while ((line = reader.readLine()) != null) {
            matcher = pattern.matcher(line);
            while (matcher.find()) {
                word = matcher.group().toLowerCase();
                if (map.containsKey(word)) {
                    map.put(word, map.get(word) + 1);
                } else {
                    map.put(word, 1);
                }
            }
        }
        /* print out top 10 */
        List<Map.Entry<String, Integer>> list = new ArrayList<>(map.entrySet());
        list.sort(Map.Entry.comparingByValue());
        Collections.reverse(list);
        int count = 1;
        for (Map.Entry<String, Integer> value : list) {
            System.out.printf("%-20s%,7d%n", value.getKey(), value.getValue());
            if (count++ == 10) break;
        }
    }
}

the                  41,043
of                   19,952
and                  14,938
a                    14,539
to                   13,942
in                   11,208
he                    9,646
was                   8,620
that                  7,922
it                    6,659

An alternate demonstration

Translation of: Kotlin

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class WordCount {
    public static void main(String[] args) throws IOException {
        Path path = Paths.get("135-0.txt");
        byte[] bytes = Files.readAllBytes(path);
        String text = new String(bytes);
        text = text.toLowerCase();

        Pattern r = Pattern.compile("\\p{javaLowerCase}+");
        Matcher matcher = r.matcher(text);
        Map<String, Integer> freq = new HashMap<>();
        while (matcher.find()) {
            String word = matcher.group();
            Integer current = freq.getOrDefault(word, 0);
            freq.put(word, current + 1);
        }

        List<Map.Entry<String, Integer>> entries = freq.entrySet()
            .stream()
            .sorted((i1, i2) -> Integer.compare(i2.getValue(), i1.getValue()))
            .limit(10)
            .collect(Collectors.toList());

        System.out.println("Rank  Word  Frequency");
        System.out.println("====  ====  =========");
        int rank = 1;
        for (Map.Entry<String, Integer> entry : entries) {
            String word = entry.getKey();
            Integer count = entry.getValue();
            System.out.printf("%2d    %-4s    %5d\n", rank++, word, count);
        }
    }
}

Output:

Rank  Word  Frequency
====  ====  =========
 1    the     41088
 2    of      19949
 3    and     14942
 4    a       14596
 5    to      13951
 6    in      11214
 7    he       9648
 8    was      8621
 9    that     7924
10    it       6661

jq

The following solution uses the concept of a "bag of words" (bow), here realized as a JSON object with the words as keys and the frequency of a word as the corresponding value.

To avoid issues with case folding, the "letters" here just the alphabet and hyphen, but a "word" may not begin with hyphen. Thus "the-the" would count as one word, and "-the" would be excluded.

< 135-0.txt jq -nR --argjson n 10 '
def bow(stream): 
  reduce stream as $word ({}; .[($word|tostring)] += 1);

bow(inputs | gsub("[^-a-zA-Z]"; " ") | splits("  *") | ascii_downcase | select(test("^[a-z][-a-z]*$")))
| to_entries
| sort_by(.value)
| .[- $n :]
| reverse
| from_entries
'

Output

{
  "the": 41087,
  "of": 19937,
  "and": 14932,
  "a": 14552,
  "to": 13738,
  "in": 11209,
  "he": 9649,
  "was": 8621,
  "that": 7923,
  "it": 6661
}

Julia

Works with: Julia version 1.0

using FreqTables

txt = read("les-mis.txt", String)
words = split(replace(txt, r"\P{L}"i => " "))
table = sort(freqtable(words); rev=true)
println(table[1:10])

Output:

Dim1   │
───────┼──────
"the"  │ 36671
"of"   │ 19618
"and"  │ 14081
"to"   │ 13541
"a"    │ 13529
"in"   │ 10265
"was"  │  8545
"that" │  7326
"he"   │  6816
"had"  │  6140

K

Works with: ngn/k

common:{+((!d)o)!n@o:x#>n:#'.d:=("&"\`c$"&"|_,/0:y)^,""}
{(,'!x),'.x}common[10;"135-0.txt"]
(("the";41019)
 ("of";19898)
 ("and";14658)
 (,"a";14517)
 ("to";13695)
 ("in";11134)
 ("he";9405)
 ("was";8361)
 ("that";7592)
 ("his";6446))

(The relatively easy to read output format here is arguably less useful than the table produced by common but it would have been more concise to have common generate it directly.)

KAP

The below program defines the function 'stats' which accepts a filename containing the text.

∇ stats (file) {
	content ← "[\\h,.\"'\n-]+" regex:split unicode:toLower io:readFile file
	sorted ← (⍋⊇⊢) content
	selection ← 1,2≢/sorted
	words ← selection / sorted
	{⍵[10↑⍒⍵[;1];]} words ,[0.5] ≢¨ sorted ⊂⍨ +\selection
}

Output:

┏━━━━━━━━━━━━┓
┃ "the" 40387┃
┃  "of" 19913┃
┃ "and" 14742┃
┃   "a" 14289┃
┃  "to" 13819┃
┃  "in" 11088┃
┃  "he"  9430┃
┃ "was"  8597┃
┃"that"  7516┃
┃ "his"  6435┃
┗━━━━━━━━━━━━┛

Kotlin

The author of the Raku entry has given a good account of the difficulties with this task and, in the absence of any clarification on the various issues, I've followed a similar 'literal' approach.

So, after first converting the text to lower case, I've assumed that a word is any sequence of one or more lower-case Unicode letters and obtained the same results as the Raku version.

There is no change in the results if the numerals 0-9 are also regarded as letters.

// version 1.1.3

import java.io.File

fun main(args: Array<String>) {
    val text = File("135-0.txt").readText().toLowerCase()
    val r = Regex("""\p{javaLowerCase}+""")
    val matches = r.findAll(text)
    val wordGroups = matches.map { it.value }
                    .groupBy { it }
                    .map { Pair(it.key, it.value.size) }
                    .sortedByDescending { it.second }
                    .take(10)
    println("Rank  Word  Frequency")
    println("====  ====  =========")
    var rank = 1
    for ((word, freq) in wordGroups) 
        System.out.printf("%2d    %-4s    %5d\n", rank++, word, freq)   
}

Output:

Rank  Word  Frequency
====  ====  =========
 1    the     41088
 2    of      19949
 3    and     14942
 4    a       14596
 5    to      13951
 6    in      11214
 7    he       9648
 8    was      8621
 9    that     7924
10    it       6661

Liberty BASIC

dim words$(100000,2)'words$(a,1)=the word, words$(a,2)=the count
dim lines$(150000)
open "135-0.txt" for input as #txt
while EOF(#txt)=0 and total < 150000
    input #txt, lines$(total)
    total=total+1
wend
for a = 1 to total
    token$ = "?"
    index=0
    new=0
    while token$ <> ""
        new=0
        index = index + 1
        token$ = lower$(word$(lines$(a),index))
        token$=replstr$(token$,".","")
        token$=replstr$(token$,",","")
        token$=replstr$(token$,";","")
        token$=replstr$(token$,"!","")
        token$=replstr$(token$,"?","")
        token$=replstr$(token$,"-","")
        token$=replstr$(token$,"_","")
        token$=replstr$(token$,"~","")
        token$=replstr$(token$,"+","")
        token$=replstr$(token$,"0","")
        token$=replstr$(token$,"1","")
        token$=replstr$(token$,"2","")
        token$=replstr$(token$,"3","")
        token$=replstr$(token$,"4","")
        token$=replstr$(token$,"5","")
        token$=replstr$(token$,"6","")
        token$=replstr$(token$,"7","")
        token$=replstr$(token$,"8","")
        token$=replstr$(token$,"9","")
        token$=replstr$(token$,"/","")
        token$=replstr$(token$,"<","")
        token$=replstr$(token$,">","")
        token$=replstr$(token$,":","")
        for b = 1 to newwordcount
            if words$(b,1)=token$ then
                num=val(words$(b,2))+1
                num$=str$(num)
                if len(num$)=1 then num$="0000"+num$
                if len(num$)=2 then num$="000"+num$
                if len(num$)=3 then num$="00"+num$
                if len(num$)=4 then num$="0"+num$
                words$(b,2)=num$
                new=1
                exit for
            end if
        next b
        if new<>1 then newwordcount=newwordcount+1:words$(newwordcount,1)=token$:words$(newwordcount,2)="00001":print newwordcount;" ";token$
    wend
next a
print
sort words$(), 1, newwordcount, 2
print "Count Word"
print "===== ================="
for a = newwordcount to newwordcount-10 step -1
    print words$(a,2);" ";words$(a,1)
next a
print "-----------------------"
print newwordcount;" unique words found."
print "End of program"
close #txt
end

Output:

Count Word
===== =================
40292 the
19825 of
14703 and
14249 a
13594 to
122613
11061 in
09436 he
08579 was
07530 that
06428 his
-----------------------
29109 unique words found.

Lua

Works with: lua version 5.3

-- This program takes two optional command line arguments.  The first (arg[1])
-- specifies the input file, or defaults to standard input.  The second
-- (arg[2]) specifies the number of results to show, or defaults to 10.

-- in freq, each key is a word and each value is its count
local freq = {}
for line in io.lines(arg[1]) do
	-- %a stands for any letter
	for word in string.gmatch(string.lower(line), "%a+") do
		if not freq[word] then
			freq[word] = 1
		else
			freq[word] = freq[word] + 1
		end
	end
end

-- in array, each entry is an array whose first value is the count and whose
-- second value is the word
local array = {}
for word, count in pairs(freq) do
	table.insert(array, {count, word})
end
table.sort(array, function (a, b) return a[1] > b[1] end)

for i = 1, arg[2] or 10 do
	io.write(string.format('%7d %s\n', array[i][1] , array[i][2]))
end

Output:

❯ ./wordcount.lua 135-0.txt
  41093 the
  19954 of
  14943 and
  14558 a
  13953 to
  11219 in
   9649 he
   8622 was
   7924 that
   6661 it

Relevant documentation: io.lines gmatch patterns like %a

Mathematica / Wolfram Language

TakeLargest[10]@WordCounts[Import["https://www.gutenberg.org/files/135/135-0.txt"], IgnoreCase->True]//Dataset

Output:

the   41088
of    19936
and   14931
a     14536
to    13738
in    11208
he    9607
was   8621
that  7825
it    6535

MATLAB / Octave

function [result,count] = word_frequency()
URL='https://www.gutenberg.org/files/135/135-0.txt';
text=webread(URL);
DELIMITER={' ', ',', ';', ':', '.', '/', '*', '!', '?', '<', '>', '(', ')', '[', ']','{', '}', '&','$','§','"','”','“','-','—','‘','\t','\n','\r'};
words  = sort(strsplit(lower(text),DELIMITER));
flag   = [find(~strcmp(words(1:end-1),words(2:end))),length(words)]; 
dwords = words(flag);   % get distinct words, and ...
count  = diff([0,flag]);  % ... the corresponding occurance frequency
[tmp,idx] = sort(-count);       % sort according to occurance
result = dwords(idx);
count  = count(idx);
for k  =  1:10,
        fprintf(1,'%d\t%s\n',count(k),result{k})
end

Output:

41039   the
19950   of
14942   and
14523   a
13941   to
11208   in
9605    he
8620    was
7824    that
6533    it

Nim

import tables, strutils, sequtils, httpclient

proc take[T](s: openArray[T], n: int): seq[T] = s[0 ..< min(n, s.len)]

var client = newHttpClient()
var text = client.getContent("https://www.gutenberg.org/files/135/135-0.txt")

var wordFrequencies = text.toLowerAscii.splitWhitespace.toCountTable
wordFrequencies.sort
for (word, count) in toSeq(wordFrequencies.pairs).take(10):
  echo alignLeft($count, 8), word

Output:

40377   the
19870   of
14469   and
14278   a
13590   to
11025   in
9213    he
8347    was
7249    that
6414    his

Objeck

use System.IO.File;
use Collection;
use RegEx;

class Rosetta {
  function : Main(args : String[]) ~ Nil {
    if(args->Size() <> 1) {
      return;
    };
    
    input := FileReader->ReadFile(args[0]);
    filter := RegEx->New("\\w+");
    words := filter->Find(input);
    
    word_counts := StringMap->New();
    each(i : words) {
      word := words->Get(i)->As(String);
      if(word <> Nil & word->Size() > 0) {
        word := word->ToLower();
        if(word_counts->Has(word)) {
          count := word_counts->Find(word)->As(IntHolder);
          count->Set(count->Get() + 1);
        }
        else {
          word_counts->Insert(word, IntHolder->New(1));
        };
      };
    };  
    
    count_words := IntMap->New();
    words := word_counts->GetKeys();
    each(i : words) {
      word := words->Get(i)->As(String);
      count := word_counts->Find(word)->As(IntHolder);
      count_words->Insert(count->Get(), word);
    };
    
    counts := count_words->GetKeys();
    counts->Sort();
    
    index := 1;
    "Rank\tWord\tFrequency"->PrintLine();
    "====\t====\t===="->PrintLine();
    for(i := count_words->Size() - 1; i >= 0; i -= 1;) {
      if(count_words->Size() - 10 <= i) {
        count := counts->Get(i);
        word := count_words->Find(count)->As(String);
        "{$index}\t{$word}\t{$count}"->PrintLine();
        index += 1;
      };
    };
  }
}

Output:

Rank    Word    Frequency
====    ====    ====
1       the     41036
2       of      19946
3       and     14940
4       a       14589
5       to      13939
6       in      11204
7       he      9645
8       was     8619
9       that    7922
10      it      6659

OCaml

let () =
  let n =
    try int_of_string Sys.argv.(1)
    with _ -> 10
  in
  let ic = open_in "135-0.txt" in
  let h = Hashtbl.create 97 in
  let w = Str.regexp "[^A-Za-zéèàêâôîûœ]+" in
  try
    while true do
      let line = input_line ic in
      let words = Str.split w line in
      List.iter (fun word ->
        let word = String.lowercase_ascii word in
        match Hashtbl.find_opt h word with
        | None -> Hashtbl.add h word 1
        | Some x -> Hashtbl.replace h word (succ x)
      ) words
    done
  with End_of_file ->
    close_in ic;
    let l = Hashtbl.fold (fun word count acc -> (word, count)::acc) h [] in
    let s = List.sort (fun (_, c1) (_, c2) -> compare c2 c1) l in
    let r = List.init n (fun i -> List.nth s i) in
    List.iter (fun (word, count) ->
      Printf.printf "%d  %s\n" count word
    ) r

Output:

$ ocaml str.cma word_freq.ml 
41092  the
19954  of
14943  and
14554  a
13953  to
11219  in
9649  he
8622  was
7924  that
6661  it

Perl

Translation of: Raku

use strict;
use warnings;
use utf8;

my $top = 10;

open my $fh, '<', 'ref/word-count.txt';
(my $text = join '', <$fh>) =~ tr/A-Z/a-z/;

my @matcher = (
    qr/[a-z]+/,     # simple 7-bit ASCII
    qr/\w+/,        # word characters with underscore
    qr/[a-z0-9]+/,  # word characters without underscore
);

for my $reg (@matcher) {
    print "\nTop $top using regex: " . $reg\n";
    my @matches = $text =~ /$reg/g;
    my %words;
    for my $w (@matches) { $words{$w}++ };
    my $c = 0;
    for my $w ( sort { $words{$b} <=> $words{$a} } keys %words ) {
        printf "%-7s %6d\n", $w, $words{$w};
        last if ++$c >= $top;
    }
}

Output:

Top 10 using regex: (?^:[a-z]+)
the      41089
of       19949
and      14942
a        14608
to       13951
in       11214
he        9648
was       8621
that      7924
it        6661

Top 10 using regex: (?^:\w+)
the      41036
of       19946
and      14940
a        14589
to       13939
in       11204
he        9645
was       8619
that      7922
it        6659

Top 10 using regex: (?^:[a-z0-9]+)
the      41089
of       19949
and      14942
a        14608
to       13951
in       11214
he        9648
was       8621
that      7924
it        6661

Phix

without javascript_semantics
?"loading..."
constant subs = '\t'&"\r\n_.,\"\'!;:?][()|=<>#/*{}+@%&$",
         reps = repeat(' ',length(subs)),
         fn = open("135-0.txt","r")
string text = lower(substitute_all(get_text(fn),subs,reps))
close(fn)
sequence words = append(sort(split(text,no_empty:=true)),"")
constant wf = new_dict()
string last = words[1]
integer count = 1
for i=2 to length(words) do
    if words[i]!=last then
        setd({count,last},0,wf)
        count = 0
        last = words[i]
    end if
    count += 1
end for
count = 10
function visitor(object key, object /*data*/, object /*user_data*/)
    ?key
    count -= 1
    return count>0
end function
traverse_dict(routine_id("visitor"),0,wf,true)

Output:

loading...
{40743,"the"}
{19925,"of"}
{14881,"and"}
{14474,"a"}
{13704,"to"}
{11174,"in"}
{9623,"he"}
{8613,"was"}
{7867,"that"}
{6612,"it"}

Phixmonti

include ..\Utilitys.pmt

"loading..." ?
"135-0.txt" "r" fopen var fn
" "
true
while
    fn fgets number? if drop fn fclose false else lower " " chain chain true endif
endwhile

"process..." ?
len for
    var i
    i get dup 96 > swap 123 < and not if 32 i set endif
endfor
split sort

"count..." ?
( ) var words
"" var prev
1 var n
len for
    var i
    i get dup prev ==
    if
        drop n 1 + var n
    else
        words ( n prev ) 0 put var words var prev 1 var n
    endif
endfor
drop
words sort
10 for
    -1 * get ?
endfor
drop

Output:

loading...
process...
count...
[41093, "the"]
[19954, "of"]
[14943, "and"]
[14558, "a"]
[13953, "to"]
[11219, "in"]
[9649, "he"]
[8622, "was"]
[7924, "that"]
[6661, "it"]

=== Press any key to exit ===

PHP

<?php

preg_match_all('/\w+/', file_get_contents($argv[1]), $words);
$frecuency = array_count_values($words[0]);
arsort($frecuency);

echo "Rank\tWord\tFrequency\n====\t====\t=========\n";
$i = 1;
foreach ($frecuency as $word => $count) {
    echo $i . "\t" . $word . "\t" . $count . "\n";
    if ($i >= 10) {
        break;
    }
    $i++;
}

Output:

Rank  Word  Frequency
====  ====  =========
 1    the   36636
 2     of   19615
 3    and   14079
 4     to   13535
 5      a   13527
 6     in   10256
 7    was    8543
 8   that    7324
 9     he    6814
10    had    6139

Picat

To get the book proper, the header and footer are removed. Here are some tests with different sets of characters to split the words (split_char/1).

main =>
  NTop = 10,
  File = "les_miserables.txt",
  Chars = read_file_chars(File),

  % Remove the Project Gutenberg header/footer
  find(Chars,"*** START OF THE PROJECT GUTENBERG EBOOK LES MISÉRABLES ***",_,HeaderEnd),  
  find(Chars,"*** END OF THE PROJECT GUTENBERG EBOOK LES MISÉRABLES ***",FooterStart,_),

  Book = [to_lowercase(C) : C in slice(Chars,HeaderEnd+1,FooterStart-1)],

  % Split into words (different set of split characters)
  member(SplitType,[all,space_punct,space]),
  println(split_type=SplitType),
  split_chars(SplitType,SplitChars),
  Words = split(Book,SplitChars),

  println(freq(Words).to_list.sort_down(2).take(NTop)),
  nl,
  fail.

freq(L) = Freq =>
  Freq = new_map(),
  foreach(E in L)
    Freq.put(E,Freq.get(E,0)+1)
  end.

% different set of split chars
split_chars(all,"\n\r \t,;!.?()[]”\"-“—-__‘’*").
split_chars(space_punct,"\n\r \t,;!.?").
split_chars(space,"\n\r \t").

Output:

split_type = all
[the = 40907,of = 19830,and = 14872,a = 14487,to = 13872,in = 11157,he = 9645,was = 8618,that = 7908,it = 6626]

split_type = space_punct
[the = 40193,of = 19779,and = 14668,a = 14227,to = 13538,in = 11033,he = 9455,was = 8604,that = 7576,” = 6578]

split_type = space
[the = 40193,of = 19747,and = 14402,a = 14222,to = 13512,in = 10964,he = 9211,was = 8345,that = 7235,his = 6414]

It is a slightly different result if the the header/footer are not removed:

split_type = all
[the = 41094,of = 19952,and = 14939,a = 14545,to = 13954,in = 11218,he = 9647,was = 8620,that = 7922,it = 6641]

split_type = space_punct
[the = 40378,of = 19901,and = 14734,a = 14284,to = 13620,in = 11094,he = 9457,was = 8606,that = 7590,” = 6578]

split_type = space
[the = 40378,of = 19869,and = 14468,a = 14278,to = 13590,in = 11025,he = 9213,was = 8347,that = 7249,his = 6414]

PicoLisp

(setq *Delim " ^I^J^M-_.,\"'*[]?!&@#$%^\(\):;")
(setq *Skip (chop *Delim))

(de word+ NIL
   (prog1
      (lowc (till *Delim T))
      (while (member (peek) *Skip) (char)) ) )

(off B)
(in "135-0.txt"
   (until (eof)
      (let W (word+)
         (if (idx 'B W T) (inc (car @)) (set W 1)) ) ) )
(for L (head 10 (flip (by val sort (idx 'B))))
   (println L (val L)) )

Output:

"the" 41088
"of" 19949
"and" 14942
"a" 14545
"to" 13950
"in" 11214
"he" 9647
"was" 8620
"that" 7924
"it" 6661

Prolog

Works with: SWI Prolog

print_top_words(File, N):-
    read_file_to_string(File, String, [encoding(utf8)]),
    re_split("\\w+", String, Words),
    lower_case(Words, Lower),
    sort(1, @=<, Lower, Sorted),
    merge_words(Sorted, Counted),
    sort(2, @>, Counted, Top_words),
    writef("Top %w words:\nRank\tCount\tWord\n", [N]),
    print_top_words(Top_words, N, 1).

lower_case([_], []):-!.
lower_case([_, Word|Words], [Lower - 1|Rest]):-
    string_lower(Word, Lower),
    lower_case(Words, Rest).

merge_words([], []):-!.
merge_words([Word - C1, Word - C2|Words], Result):-
    !,
    C is C1 + C2,
    merge_words([Word - C|Words], Result).
merge_words([W|Words], [W|Rest]):-
    merge_words(Words, Rest).

print_top_words([], _, _):-!.
print_top_words(_, 0, _):-!.
print_top_words([Word - Count|Rest], N, R):-
    writef("%w\t%w\t%w\n", [R, Count, Word]),
    N1 is N - 1,
    R1 is R + 1,
    print_top_words(Rest, N1, R1).

main:-
    print_top_words("135-0.txt", 10).

Output:

Top 15 words:
Rank	Count	Word
1	41040	the
2	19951	of
3	14942	and
4	14539	a
5	13941	to
6	11209	in
7	9646	he
8	8620	was
9	7922	that
10	6659	it

PureBasic

EnableExplicit

Structure wordcount
  wkey$
  count.i
EndStructure

Define token.c, word$, idx.i, start.i, arg$
NewMap wordmap.i()
NewList wordlist.wordcount()

If OpenConsole("")  
  arg$ = ProgramParameter(0)
  If arg$ = "" : End 1 : EndIf  
  start = ElapsedMilliseconds()
  If ReadFile(0, arg$, #PB_Ascii)
    While Not Eof(0)
      token = ReadCharacter(0, #PB_Ascii)
      Select token
        Case 'A' To 'Z', 'a' To 'z'
          word$ + LCase(Chr(token))
        Default
          If word$
            wordmap(word$) + 1
            word$ = ""
          EndIf
      EndSelect    
    Wend
    CloseFile(0)
    ForEach wordmap()
      AddElement(wordlist())
      wordlist()\wkey$ = MapKey(wordmap())
      wordlist()\count = wordmap()
    Next
    SortStructuredList(wordlist(), #PB_Sort_Descending, OffsetOf(wordcount\count), TypeOf(wordcount\count))
    PrintN("Elapsed milliseconds: " + Str(ElapsedMilliseconds() - start))
    PrintN("File: " + GetFilePart(arg$))
    PrintN(~"Rank\tCount\t\t  Word")
    If FirstElement(wordlist())
      For idx = 1 To 10
        Print(RSet(Str(idx), 2))
        Print(~"\t")
        Print(wordlist()\wkey$)
        Print(~"\t\t")
        PrintN(RSet(Str(wordlist()\count), 6))         
        If NextElement(wordlist()) = 0
          Break
        EndIf
      Next
    EndIf  
  EndIf
  Input()
EndIf

End

Output:

Elapsed milliseconds: 462
File: 135-0.txt
Rank	Count		  Word
 1	the		 41093
 2	of		 19954
 3	and		 14943
 4	a		 14558
 5	to		 13953
 6	in		 11219
 7	he		  9649
 8	was		  8622
 9	that		  7924
10	it		  6661

Python

Collections

Python2.7

import collections
import re
import string
import sys

def main():
  counter = collections.Counter(re.findall(r"\w+",open(sys.argv[1]).read().lower()))
  print counter.most_common(int(sys.argv[2]))

if __name__ == "__main__":
  main()

Output:

$ python wordcount.py 135-0.txt 10
[('the', 41036), ('of', 19946), ('and', 14940), ('a', 14589), ('to', 13939),
 ('in', 11204), ('he', 9645), ('was', 8619), ('that', 7922), ('it', 6659)]

Python3.6

from collections import Counter
from re import findall

les_mis_file = 'les_mis_135-0.txt'

def _count_words(fname):
    with open(fname) as f:
        text = f.read()
    words = findall(r'\w+', text.lower())
    return Counter(words)

def most_common_words_in_file(fname, n):
    counts = _count_words(fname)
    for word, count in [['WORD', 'COUNT']] + counts.most_common(n):
        print(f'{word:>10} {count:>6}')


if __name__ == "__main__":
    n = int(input('How many?: '))
    most_common_words_in_file(les_mis_file, n)

Output:

How many?: 10
      WORD  COUNT
       the  41036
        of  19946
       and  14940
         a  14586
        to  13939
        in  11204
        he   9645
       was   8619
      that   7922
        it   6659

Sorted and groupby

Works with: Python version 3.7

"""
Word count task from Rosetta Code
http://www.rosettacode.org/wiki/Word_count#Python
"""
from itertools import (groupby,
                       starmap)
from operator import itemgetter
from pathlib import Path
from typing import (Iterable,
                    List,
                    Tuple)


FILEPATH = Path('lesMiserables.txt')
COUNT = 10


def main():
    words_and_counts = most_frequent_words(FILEPATH)
    print(*words_and_counts[:COUNT], sep='\n')


def most_frequent_words(filepath: Path,
                        *,
                        encoding: str = 'utf-8') -> List[Tuple[str, int]]:
    """
    A list of word-frequency pairs sorted by their occurrences.
    The words are read from the given file.
    """
    def word_and_frequency(word: str,
                           words_group: Iterable[str]) -> Tuple[str, int]:
        return word, capacity(words_group)

    file_contents = filepath.read_text(encoding=encoding)
    words = file_contents.lower().split()
    grouped_words = groupby(sorted(words))
    words_and_frequencies = starmap(word_and_frequency, grouped_words)
    return sorted(words_and_frequencies, key=itemgetter(1), reverse=True)


def capacity(iterable: Iterable) -> int:
    """Returns a number of elements in an iterable"""
    return sum(1 for _ in iterable)


if __name__ == '__main__':
    main()

Output:

('the', 40372)
('of', 19868)
('and', 14472)
('a', 14278)
('to', 13589)
('in', 11024)
('he', 9213)
('was', 8347)
('that', 7250)
('his', 6414)

Collections, Sorted and Lambda

#!/usr/bin/python3
import collections
import re

count = 10

with open("135-0.txt") as f:
    text = f.read()

word_freq = sorted(
    collections.Counter(sorted(re.split(r"\W+", text.lower()))).items(),
    key=lambda c: c[1],
    reverse=True,
)

for i in range(len(word_freq)):
    print("[{:2d}] {:>10} : {}".format(i + 1, word_freq[i][0], word_freq[i][1]))
    if i == count - 1:
        break

Output:

[ 1]        the : 41039
[ 2]         of : 19951
[ 3]        and : 14942
[ 4]          a : 14527
[ 5]         to : 13941
[ 6]         in : 11209
[ 7]         he : 9646
[ 8]        was : 8620
[ 9]       that : 7922
[10]         it : 6659

R

Version 1

I chose to remove apostrophes only if they're followed by an s (so "mom" and "mom's" will show up as the same word but "they" and "they're" won't). I also chose not to remove hyphens.

wordcount<-function(file,n){
  punctuation=c("`","~","!","@","#","$","%","^","&","*","(",")","_","+","=","{","[","}","]","|","\\",":",";","\"","<",",",">",".","?","/","'s")
  wordlist=scan(file,what=character())
  wordlist=tolower(wordlist)
  for(i in 1:length(punctuation)){
    wordlist=gsub(punctuation[i],"",wordlist,fixed=T)
  }
  df=data.frame("Word"=sort(unique(wordlist)),"Count"=rep(0,length(unique(wordlist))))
  for(i in 1:length(unique(wordlist))){
    df[i,2]=length(which(wordlist==df[i,1]))
  }
  df=df[order(df[,2],decreasing = T),]
  row.names(df)=1:nrow(df)
  return(df[1:n,])
}

Output:

> wordcount("MobyDick.txt",10)
Read 212793 items
   Word Count
1   the 14346
2    of  6590
3   and  6340
4     a  4611
5    to  4572
6    in  4130
7  that  2903
8   his  2516
9    it  2308
10    i  1845

Version 2

This version is purely functional using the native pipe operator in R 4.1+ and runs in less than a second.

word_frequency_pipeline <- function(file=NULL, n=10) {
	
  file |> 
    vroom::vroom_lines() |> 
    stringi::stri_split_boundaries(type="word", skip_word_none=T, skip_word_number=T) |>
    unlist() |>
    tolower() |> 
    table() |> 
    sort(decreasing = T) |>
    (\(.) .[1:n])() |> 
    data.frame()
	
}

Output:

> word_frequency_pipeline("~/../Downloads/135-0.txt")
   Var1  Freq
1   the 41042
2    of 19952
3   and 14938
4     a 14526
5    to 13942
6    in 11208
7    he  9605
8   was  8620
9  that  7824
10   it  6533

Racket

#lang racket

(define (all-words f (case-fold string-downcase))
  (map case-fold (regexp-match* #px"\\w+" (file->string f))))

(define (l.|l| l) (cons (car l) (length l)))

(define (counts l (>? >)) (sort (map l.|l| (group-by values l)) >? #:key cdr))

(module+ main
  (take (counts (all-words "data/les-mis.txt")) 10))

Output:

'(("the" . 41036)
  ("of" . 19946)
  ("and" . 14940)
  ("a" . 14589)
  ("to" . 13939)
  ("in" . 11204)
  ("he" . 9645)
  ("was" . 8619)
  ("that" . 7922)
  ("it" . 6659))

Raku

(formerly Perl 6)

Works with: Rakudo version 2022.07

Note: much of the following exposition is no longer critical to the task as the requirements have been updated, but is left here for historical and informational reasons.

This is slightly trickier than it appears initially. The task specifically states: "A word is a sequence of one or more contiguous letters", so contractions and hyphenated words are broken up. Initially we might reach for a regex matcher like /\w+/ , but \w includes underscore, which is not a letter but a punctuation connector; and this text is full of underscores since that is how Project Gutenberg texts denote italicized text. The underscores are not actually parts of the words though, they are markup.

We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various diacritics. Those are letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.)

Actually, in this case /A-Za-z/ returns very nearly the correct answer. Unfortunately, the name "Alèthe" appears once (only once!) in the text, gets incorrectly split into Al & the, and incorrectly reports 41089 occurrences of "the". The text has several words like "Panathenæa", "ça", "aérostiers" and "Keksekça" so the counts for 'a' are off too. The other 8 of the top 10 are "correct" using /A-Za-z/, but it is mostly by accident.

A more accurate regex matcher would be some kind of Unicode aware /\w/ minus underscore. It may also be useful, depending on your requirements, to recognize contractions with embedded apostrophes, hyphenated words, and hyphenated words broken across lines.

Here is a sample that shows the result when using various different matchers.

sub MAIN ($filename, UInt $top = 10) {
    my $file = $filename.IO.slurp.lc.subst(/ (<[\w]-[_]>'-')\n(<[\w]-[_]>) /, {$0 ~ $1}, :g );
    my @matcher = 
        rx/ <[a..z]>+ /,    # simple 7-bit ASCII
        rx/ \w+ /,          # word characters with underscore
        rx/ <[\w]-[_]>+ /,  # word characters without underscore
        rx/ [<[\w]-[_]>+]+ % < ' - '- > /  # word characters without underscore but with hyphens and contractions
    ;
    for @matcher -> $reg {
        say "\nTop $top using regex: ", $reg.raku;
	    my @words = $file.comb($reg).Bag.sort(-*.value)[^$top];
	    my $length = max @words».key».chars;
        printf "%-{$length}s %d\n", .key, .value for @words;
    }
}

Output:

Passing in the file name and 10:

Top 10 using regex: rx/ <[a..z]>+ /
the	41089
of	19949
and	14942
a	14608
to	13951
in	11214
he	9648
was	8621
that	7924
it	6661

Top 10 using regex: rx/ \w+ /
the	41035
of	19946
and	14940
a	14577
to	13939
in	11204
he	9645
was	8619
that	7922
it	6659

Top 10 using regex: rx/ <[\w]-[_]>+ /
the	41088
of	19949
and	14942
a	14596
to	13951
in	11214
he	9648
was	8621
that	7924
it	6661

Top 10 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
the	41081
of	19930
and	14934
a	14587
to	13735
in	11204
he	9607
was	8620
that	7825
it	6535

It can be difficult to figure out what words the different regexes do or don't match. Here are the three more complex regexes along with a list of "words" that are treated as being different using this regex as opposed to /a..z/. IE: It is lumped in as one of the top 10 word counts using /a..z/ but not with this regex.

Top 10 using regex: rx/ \w+ /
the	41035	alèthe _the _the_
of	19946	of_ _of_
and	14940	_and_ paternoster_and
a	14577	_ça aïe ça keksekça aérostiers _a poréa panathenæa
to	13939	to_ _to
in	11204	_in
he	9645	_he
was	8619	_was
that	7922	_that
it	6659	_it

Top 10 using regex: rx/ <[\w]-[_]>+ /
the	41088	alèthe
of	19949	
and	14942	
a	14596	poréa ça aérostiers panathenæa aïe keksekça
to	13951	
in	11214	
he	9648	
was	8621	
that	7924	
it	6661	

Top 10 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
the	41081	will-o'-the-wisps alèthe skip-the-gutter police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change jean-the-screw will-o'-the-wisp
of	19930	chromate-of-lead-colored die-of-hunger die-of-cold-if-you-have-bread police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change unheard-of die-of-hunger-if-you-have-a-fire
and	14934	come-and-see so-and-so cock-and-bull hide-and-seek sambre-and-meuse
a	14587	keksekça l'a ça now-a-days vis-a-vis a-dreaming police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change poréa panathenæa aérostiers a-hunting aïe die-of-hunger-if-you-have-a-fire
to	13735	to-morrow to-day hand-to-hand to-night well-to-do face-to-face
in	11204	in-pace son-in-law father-in-law whippers-in general-in-chief sons-in-law
he	9607	he's he'll
was	8620	police-agent-ja-vert-was-found-drowned-un-der-a-boat-of-the-pont-au-change
that	7825	that's pick-me-down-that
it	6535	it's it'll

One nice thing is this isn't special cased. It will work out of the box for any text / language.

Russian? No problem.

$ raku wf 14741-0.txt 5

Top 5 using regex: rx/ <[a..z]>+ /
the	176
of	119
gutenberg	93
project	87
to	80

Top 5 using regex: rx/ \w+ /
и	860
в	579
не	290
на	222
ты	195

Top 5 using regex: rx/ <[\w]-[_]>+ /
и	860
в	579
не	290
на	222
ты	195

Top 5 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
и	860
в	579
не	290
на	222
ты	195

Greek? Sure, why not.

$ raku wf 39963-0.txt 5

Top 5 using regex: rx/ <[a..z]>+ /
the	187
of	123
gutenberg	93
project	87
to	82

Top 5 using regex: rx/ \w+ /
και	1628
εις	986
δε	982
του	895
των	859

Top 5 using regex: rx/ <[\w]-[_]>+ /
και	1628
εις	986
δε	982
του	895
των	859

Top 5 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
και	1628
εις	986
δε	982
του	895
των	859

Of course, for the first matcher, we are asking specifically to match Latin ASCII, so we end up with... well... Latin ASCII; but the other 3 match any Unicode characters.

REXX

version 1

This REXX version doesn't need to sort the list of words.

Extra code was added to handle some foreign letters (non-Latin) and also handle most accented letters.

This version recognizes all the accented letters that are present in the required/specified text (file) (and some other non-Latin letters as well).

This means that the word Alèthe is treated as one word, not as two words Al the (and not thereby adding two separate words).

This version also supports words that contain embedded apostrophes ( ' )
[that is, within a word, but not those words that start or end with an apostrophe; for those encapsulated words, the apostrophe is elided].

Thus, it's is counted separately from it and/or its.

Since REXX doesn't support UTF-8 encodings, code was added to this REXX version to support the accented letters in the mandated input file.

/*REXX pgm displays top 10 words in a file (includes foreign letters),  case is ignored.*/
parse arg fID top .                              /*obtain optional arguments from the CL*/
if fID=='' | fID==","  then fID= 'les_mes.txt'   /*None specified? Then use the default.*/
if top=='' | top==","  then top= 10              /*  "      "        "   "   "     "    */
call init                                        /*initialize varied bunch of variables.*/
call rdr
say right('word', 40)  " "  center(' rank ', 6)  "  count "   /*display title for output*/
say right('════', 40)  " "  center('══════', 6)  " ═══════"   /*   "    title separator.*/

     do  until otops==tops | tops>top            /*process enough words to satisfy  TOP.*/
     WL=;         mk= 0;    otops= tops          /*initialize the word list (to a NULL).*/

          do n=1  for c;    z= !.n;      k= @.z  /*process the list of words in the file*/
          if k==mk  then WL= WL z                /*handle cases of tied number of words.*/
          if k> mk  then do;  mk=k;  WL=z;  end  /*this word count is the current max.  */
          end   /*n*/

     wr= max( length(' rank '), length(top) )    /*find the maximum length of the rank #*/

          do d=1  for words(WL);  y= word(WL, d) /*process all words in the  word list. */
          if d==1  then w= max(10, length(@.y) ) /*use length of the first number used. */
          say right(y, 40)         right( commas(tops), wr)          right(commas(@.y), w)
          @.y= .                                 /*nullify word count for next go 'round*/
          end   /*d*/                            /* [↑]  this allows a non-sorted list. */

     tops= tops + words(WL)                      /*correctly handle any  tied  rankings.*/
     end        /*until*/
exit                                             /*stick a fork in it,  we're all done. */
/*──────────────────────────────────────────────────────────────────────────────────────*/
commas: parse arg ?;  do jc=length(?)-3  to 1  by -3; ?=insert(',', ?, jc); end;  return ?
16bit:  do k=1 for xs; _=word(x,k); $=changestr('├'left(_,1),$,right(_,1)); end;  return
/*──────────────────────────────────────────────────────────────────────────────────────*/
init:   x= 'Çà åÅ çÇ êÉ ëÉ áà óâ ªæ ºç ¿è ⌐é ¬ê ½ë «î »ï ▒ñ ┤ô ╣ù ╗û ╝ü';     xs= words(x)
        abcL="abcdefghijklmnopqrstuvwxyz'"       /*lowercase letters of Latin alphabet. */
        abcU= abcL;            upper abcU        /*uppercase version of Latin alphabet. */
        accL= 'üéâÄàÅÇêëèïîìéæôÖòûùÿáíóúÑ'       /*some lowercase accented characters.  */
        accU= 'ÜéâäàåçêëèïîìÉÆôöòûùÿáíóúñ'       /*  "  uppercase    "         "        */
        accG= 'αßΓπΣσµτΦΘΩδφε'                   /*  "  upper/lowercase Greek letters.  */
        ll= abcL || abcL ||accL ||accL || accG               /*chars of  after letters. */
        uu= abcL || abcU ||accL ||accU || accG || xrange()   /*  "    " before    "     */
        @.= 0;    q= "'";    totW= 0;    !.= @.;    c= 0;    tops= 1;          return
/*──────────────────────────────────────────────────────────────────────────────────────*/
rdr:   do #=0  while lines(fID)\==0; $=linein(fID) /*loop whilst there're lines in file.*/
       if pos('├', $) \== 0  then call 16bit       /*are there any  16-bit  characters ?*/
       $= translate( $, ll, uu)                    /*trans. uppercase letters to lower. */
          do while $ \= '';    parse var  $  z  $  /*process each word in the  $  line. */
          parse var  z     z1  2  zr  ''  -1  zL   /*obtain: first, middle, & last char.*/
          if z1==q  then do; z=zr; if z==''  then iterate; end /*starts with apostrophe?*/
          if zL==q  then z= strip(left(z, length(z) - 1))      /*ends     "       "    ?*/
          if z==''  then iterate                               /*if Z is now null, skip.*/
          if @.z==0  then do;  c=c+1; !.c=z;  end  /*bump word cnt; assign word to array*/
          totW= totW + 1;      @.z= @.z + 1        /*bump total words; bump a word count*/
          end   /*while*/
       end      /*#*/
    say commas(totW)     ' words found  ('commas(c)    "unique)  in "    commas(#),
                         ' records read from file: '     fID;        say;          return

output when using the default inputs:

574,122  words found  (23,414 unique)  in  67,663  records read from file:  les_mes.txt

                                    word    rank    count
                                    ════   ══════  ═══════
                                     the      1     41,088
                                      of      2     19,949
                                     and      3     14,942
                                       a      4     14,595
                                      to      5     13,950
                                      in      6     11,214
                                      he      7      9,607
                                     was      8      8,620
                                    that      9      7,826
                                      it     10      6,535

To see a list of the top 1,000 words that show (among other things) words like it's and other accented words, see the discussion page for this task.

version 2

Inspired by version 1 and adapted for ooRexx. It ignores all characters other than a-z and A-Z (which are translated to a-z).

/*REXX program   reads  and  displays  a  count  of words a file.  Word case is ignored.*/
Call time 'R'
abc='abcdefghijklmnopqrstuvwxyz'
abcABC=abc||translate(abc)
parse arg fID_top                                /*obtain optional arguments from the CL*/
Parse Var fid_top fid ',' top
if fID=='' then fID= 'mis.TXT'                   /* Use default if not specified        */
if top=='' then top= 10                          /* Use default if not specified        */
occ.=0                                           /* occurrences of word (stem) in file  */
wn=0
Do While lines(fid)>0                            /*loop whilst there are lines in file. */
  line=linein(fID)
  line=translate(line,abc||abc,abcABC||xrange('00'x,'ff'x)) /*use only lowercase letters*/
  Do While line<>''
    Parse Var line word line                       /* take a word                         */
    If occ.word=0 Then Do                          /* not yet in word list                */
      wn=wn+1
      word.wn=word
      End
    occ.word=occ.word+1
    End
  End
Say 'We found' wn 'different words'
say right('word',40) ' rank   count '            /* header                              */
say right('----',40) '------ -------'            /* separator.                          */
tops=0
Do Until tops>=top | tops>=wn                    /*process enough words to satisfy  TOP.*/
  max_occ=0
  tl=''                                          /*initialize (possibly) a list of words*/
  Do wi=1 To wn                                  /*process the list of words in the file*/
    word=word.wi                                 /* take a word from the list           */
    Select
      When occ.word>max_occ Then Do              /* most occurrences so far             */
        tl=word                                  /* candidate for output                */
        max_occ=occ.word                         /* current maximum occurrences         */
        End
      When occ.word=max_occ Then Do              /* tied                                */
        tl=tl word                               /* add to output candidate             */
        End
      Otherwise                                  /* no candidate (yet)                  */
        Nop
      End
    End
    do d=1 for words(tl)
      word=word(tl,d)
      say right(word,40) right(tops+1,4) right(occ.word,8)
      occ.word=0                                /*nullify this word count for next time*/
      End
    tops=tops+words(tl)                         /*correctly handle the tied rankings.  */
  end
Say time('E') 'seconds elapsed'

Output:

We found 22820 different words
                                    word  rank   count
                                    ---- ------ -------
                                     the    1    41089
                                      of    2    19949
                                     and    3    14942
                                       a    4    14608
                                      to    5    13951
                                      in    6    11214
                                      he    7     9648
                                     was    8     8621
                                    that    9     7924
                                      it   10     6661
1.750000 seconds elapsed

Ring

# project : Word count

fp = fopen("Miserables.txt","r")
str = fread(fp, getFileSize(fp))
fclose(fp) 

mis =substr(str, " ", nl)
mis = lower(mis)
mis = str2list(mis)
count = list(len(mis))
ready = []
for n = 1 to len(mis)
     flag = 0
     for m = 1 to len(mis)
           if mis[n] = mis[m] and n != m
              for p = 1 to len(ready)
                    if m = ready[p]
                       flag = 1
                    ok
              next
              if flag = 0
                 count[n] = count[n] + 1                 
              ok
           ok
     next
     if flag = 0
        add(ready, n)
     ok
next
for n = 1 to len(count)
     for m = n + 1 to len(count)
          if count[m] > count[n]
             temp = count[n]
             count[n] = count[m]
             count[m] = temp
             temp = mis[n]
             mis[n] = mis[m]
             mis[m] = temp
          ok
     next
next
for n = 1 to 10
     see mis[n] + " " + (count[n] + 1) + nl
next

func getFileSize fp
        c_filestart = 0
        c_fileend = 2
        fseek(fp,0,c_fileend)
        nfilesize = ftell(fp)
        fseek(fp,0,c_filestart)
        return nfilesize

func swap(a, b)
        temp = a
        a = b
        b = temp
        return [a, b]

Output:

the	41089
of	19949
and	14942
a	14608
to	13951
in	11214
he	9648
was	8621
that	7924
it	6661

Ruby

class String
  def wc
  n = Hash.new(0)
  downcase.scan(/[A-Za-zÀ-ÿ]+/) { |g| n[g] += 1 }
  n.sort{|n,g| n[1]<=>g[1]}
  end
end

open('135-0.txt') { |n| n.read.wc[-10,10].each{|n| puts n[0].to_s+"->"+n[1].to_s} }

Output:

it->6661
that->7924
was->8621
he->9648
in->11214
to->13951
a->14596
and->14942
of->19949
the->41088

Tally and max_by

Works with: Ruby version 2.7

RE = /[[:alpha:]]+/
count =  open("135-0.txt").read.downcase.scan(RE).tally.max_by(10, &:last)
count.each{|ar| puts ar.join("->") }

Output:

the->41092
of->19954
and->14943
a->14546
to->13953
in->11219
he->9649
was->8622
that->7924
it->6661

Chain of Enumerables

wf = File.read("135-0.txt", :encoding => "UTF-8")
  .downcase
  .scan(/\w+/)
  .each_with_object(Hash.new(0)) { |word, hash| hash[word] += 1 }
  .sort_by { |k, v| v }
  .reverse
  .take(10)
  .each_with_index { |w, i|
  printf "[%2d] %10s : %d\n",
         i += 1,
         w[0],
         w[1]
}

Output:

[ 1]        the : 41040
[ 2]         of : 19951
[ 3]        and : 14942
[ 4]          a : 14539
[ 5]         to : 13941
[ 6]         in : 11209
[ 7]         he : 9646
[ 8]        was : 8620
[ 9]       that : 7922
[10]         it : 6659

Rust

use std::cmp::Reverse;
use std::collections::HashMap;
use std::fs::File;
use std::io::{BufRead, BufReader};

extern crate regex;
use regex::Regex;

fn word_count(file: File, n: usize) {
    let word_regex = Regex::new("(?i)[a-z']+").unwrap();

    let mut words = HashMap::new();
    for line in BufReader::new(file).lines() {
        word_regex
            .find_iter(&line.expect("Read error"))
            .map(|m| m.as_str())
            .for_each(|word| {
                *words.entry(word.to_lowercase()).or_insert(0) += 1;
            });
    }

    let mut words: Vec<_> = words.iter().collect();
    words.sort_unstable_by_key(|&(word, count)| (Reverse(count), word));

    for (word, count) in words.iter().take(n) {
        println!("{:8} {:>8}", word, count);
    }
}

fn main() {
    word_count(File::open("135-0.txt").expect("File open error"), 10)
}

Output:

the         41083
of          19948
and         14941
a           14604
to          13951
in          11212
he           9604
was          8621
that         7824
it           6534

Scala

Featuring online remote file as input

Output:

Best seen running in your browser Scastie (remote JVM).

import scala.io.Source

object WordCount extends App {

  val url = "http://www.gutenberg.org/files/135/135-0.txt"
  val header = "Rank Word  Frequency\n==== ======== ======"

  def wordCnt =
    Source.fromURL(url).getLines()
      .filter(_.nonEmpty)
      .flatMap(_.split("""\W+""")).toSeq
      .groupBy(_.toLowerCase())
      .mapValues(_.size).toSeq
      .sortWith { case ((_, v0), (_, v1)) => v0 > v1 }
      .take(10).zipWithIndex

  println(header)
  wordCnt.foreach {
    case ((word, count), rank) => println(f"${rank + 1}%4d $word%-8s $count%6d")
  }

  println(s"\nSuccessfully completed without errors. [total ${scala.compat.Platform.currentTime - executionStart} ms]")

}

Output:

Rank Word  Frequency
==== ======== ======
   1 the       41036
   2 of        19946
   3 and       14940
   4 a         14589
   5 to        13939
   6 in        11204
   7 he         9645
   8 was        8619
   9 that       7922
  10 it         6659

Successfully completed without errors. [total 4528 ms]

Seed7

The Seed7 program uses the function getHttp, to get the file 135-0.txt directly from Gutemberg. The library scanfile.s7i provides getSimpleSymbol, to get words from a fle. The words are converted to lower case, to assure that "The" and "the" are considered the same.

$ include "seed7_05.s7i";
  include "gethttp.s7i";
  include "strifile.s7i";
  include "scanfile.s7i";
  include "chartype.s7i";
  include "console.s7i";

const type: wordHash is hash [string] integer;
const type: countHash is hash [integer] array string;

const proc: main is func
  local
    var file: inFile is STD_NULL;
    var string: aWord is "";
    var wordHash: numberOfWords is wordHash.EMPTY_HASH;
    var countHash: countWords is countHash.EMPTY_HASH;
    var array integer: countKeys is 0 times 0;
    var integer: index is 0;
    var integer: number is 0;
  begin
    OUT := STD_CONSOLE;
    inFile := openStrifile(getHttp("www.gutenberg.org/files/135/135-0.txt"));
    while hasNext(inFile) do
      aWord := lower(getSimpleSymbol(inFile));
      if aWord <> "" and aWord[1] in letter_char then
        if aWord in numberOfWords then
          incr(numberOfWords[aWord]);
        else
          numberOfWords @:= [aWord] 1;
        end if;
      end if;
    end while;
    countWords := flip(numberOfWords);
    countKeys := sort(keys(countWords));
    writeln("Word    Frequency");
    for index range length(countKeys) downto length(countKeys) - 9 do
      number := countKeys[index];
      for aWord range sort(countWords[number]) do
        writeln(aWord rpad 8 <& number);
      end for;
    end for;
  end func;

Output:

Word    Frequency
the     41036
of      19946
and     14940
a       14589
to      13939
in      11204
he      9645
was     8619
that    7922
it      6659

Sidef

var count = Hash()
var file = File(ARGV[0] \\ '135-0.txt')

file.open_r.each { |line|
    line.lc.scan(/[\pL]+/).each { |word|
        count{word} := 0 ++
    }
}

var top = count.sort_by {|_,v| v }.last(10).flip

top.each { |pair|
    say "#{pair.key}\t-> #{pair.value}"
}

Output:

the	-> 41088
of	-> 19949
and	-> 14942
a	-> 14596
to	-> 13951
in	-> 11214
he	-> 9648
was	-> 8621
that	-> 7924
it	-> 6661

Simula

COMMENT COMPILE WITH
$ cim -m64 word-count.sim
;
BEGIN

    COMMENT ----- CLASSES FOR GENERAL USE ;

    ! ABSTRACT HASH KEY TYPE ;
    CLASS HASHKEY;
    VIRTUAL:
        PROCEDURE HASH IS
            INTEGER PROCEDURE HASH;;
        PROCEDURE EQUALTO IS
            BOOLEAN PROCEDURE EQUALTO(K); REF(HASHKEY) K;;
    BEGIN
    END HASHKEY;

    ! ABSTRACT HASH VALUE TYPE ;
    CLASS HASHVAL;
    BEGIN
        ! THERE IS NOTHING REQUIRED FOR THE VALUE TYPE ;
    END HASHVAL;

    CLASS HASHMAP;
    BEGIN
        CLASS INNERHASHMAP(N); INTEGER N;
        BEGIN

            INTEGER PROCEDURE INDEX(K); REF(HASHKEY) K;
            BEGIN
                INTEGER I;
                IF K == NONE THEN
                    ERROR("HASHMAP.INDEX: NONE IS NOT A VALID KEY");
                I := MOD(K.HASH,N);
            LOOP:
                IF KEYTABLE(I) == NONE OR ELSE KEYTABLE(I).EQUALTO(K) THEN
                    INDEX := I
                ELSE BEGIN
                    I := IF I+1 = N THEN 0 ELSE I+1;
                    GO TO LOOP;
                END;
            END INDEX;

            ! PUT SOMETHING IN ;
            PROCEDURE PUT(K,V); REF(HASHKEY) K; REF(HASHVAL) V;
            BEGIN
                INTEGER I;
                IF V == NONE THEN
                    ERROR("HASHMAP.PUT: NONE IS NOT A VALID VALUE");
                I := INDEX(K);
                IF KEYTABLE(I) == NONE THEN BEGIN
                    IF SIZE = N THEN
                        ERROR("HASHMAP.PUT: TABLE FILLED COMPLETELY");
                    KEYTABLE(I) :- K;
                    VALTABLE(I) :- V;
                    SIZE := SIZE+1;
                END ELSE
                    VALTABLE(I) :- V;
            END PUT;

            ! GET SOMETHING OUT ;
            REF(HASHVAL) PROCEDURE GET(K); REF(HASHKEY) K;
            BEGIN
                INTEGER I;
                IF K == NONE THEN
                    ERROR("HASHMAP.GET: NONE IS NOT A VALID KEY");
                I := INDEX(K);
                IF KEYTABLE(I) == NONE THEN
                    GET :- NONE ! ERROR("HASHMAP.GET: KEY NOT FOUND");
                ELSE
                    GET :- VALTABLE(I);
            END GET;

            PROCEDURE CLEAR;
            BEGIN
                INTEGER I;
                FOR I := 0 STEP 1 UNTIL N-1 DO BEGIN
                    KEYTABLE(I) :- NONE;
                    VALTABLE(I) :- NONE;
                END;
                SIZE := 0;
            END CLEAR;

            ! DATA MEMBERS OF CLASS HASHMAP ;
            REF(HASHKEY) ARRAY KEYTABLE(0:N-1);
            REF(HASHVAL) ARRAY VALTABLE(0:N-1);
            INTEGER SIZE;

        END INNERHASHMAP;

        PROCEDURE PUT(K,V); REF(HASHKEY) K; REF(HASHVAL) V;
        BEGIN
            IF IMAP.SIZE >= 0.75 * IMAP.N THEN
            BEGIN
                COMMENT RESIZE HASHMAP ;
                REF(INNERHASHMAP) NEWIMAP;
                REF(ITERATOR) IT;
                NEWIMAP :- NEW INNERHASHMAP(2 * IMAP.N);
                IT :- NEW ITERATOR(THIS HASHMAP);
                WHILE IT.MORE DO
                BEGIN
                    REF(HASHKEY) KEY;
                    KEY :- IT.NEXT;
                    NEWIMAP.PUT(KEY, IMAP.GET(KEY));
                END;
                IMAP.CLEAR;
                IMAP :- NEWIMAP;
            END;
            IMAP.PUT(K, V);
        END;

        REF(HASHVAL) PROCEDURE GET(K); REF(HASHKEY) K;
            GET :- IMAP.GET(K);

        PROCEDURE CLEAR;
            IMAP.CLEAR;

        INTEGER PROCEDURE SIZE;
            SIZE := IMAP.SIZE;

        REF(INNERHASHMAP) IMAP;

        IMAP :- NEW INNERHASHMAP(16);
    END HASHMAP;

    CLASS ITERATOR(H); REF(HASHMAP) H;
    BEGIN
        INTEGER POS,KEYCOUNT;

        BOOLEAN PROCEDURE MORE;
            MORE := KEYCOUNT < H.SIZE;

        REF(HASHKEY) PROCEDURE NEXT;
        BEGIN
            INSPECT H DO
            INSPECT IMAP DO
            BEGIN
                WHILE KEYTABLE(POS) == NONE DO
                    POS := POS+1;
                NEXT :- KEYTABLE(POS);
                KEYCOUNT := KEYCOUNT+1;
                POS := POS+1;
            END;
        END NEXT;

    END ITERATOR;

    COMMENT ----- PROBLEM SPECIFIC CLASSES ;

    HASHKEY CLASS TEXTHASHKEY(T); VALUE T; TEXT T;
    BEGIN
        INTEGER PROCEDURE HASH;
        BEGIN
            INTEGER I;
            T.SETPOS(1);
            WHILE T.MORE DO
                I := 31*I+RANK(T.GETCHAR);
            HASH := I;
        END HASH;
        BOOLEAN PROCEDURE EQUALTO(K); REF(HASHKEY) K;
            EQUALTO := T = K QUA TEXTHASHKEY.T;
    END TEXTHASHKEY;

    HASHVAL CLASS COUNTER;
    BEGIN
        INTEGER COUNT;
    END COUNTER;

    REF(INFILE) INF;
    REF(HASHMAP) MAP;
    REF(TEXTHASHKEY) KEY;
    REF(COUNTER) VAL;
    REF(ITERATOR) IT;
    TEXT LINE, WORD;
    INTEGER I, J, MAXCOUNT, LINENO;
    INTEGER ARRAY MAXCOUNTS(1:10);
    REF(TEXTHASHKEY) ARRAY MAXWORDS(1:10);

    WORD :- BLANKS(1000);
    MAP :- NEW HASHMAP;
  
    COMMENT MAP WORDS TO COUNTERS ;

    INF :- NEW INFILE("135-0.txt");
    INF.OPEN(BLANKS(4096));
    WHILE NOT INF.LASTITEM DO
    BEGIN
        BOOLEAN INWORD;

        PROCEDURE SAVE;
        BEGIN
            IF WORD.POS > 1 THEN
            BEGIN
                KEY :- NEW TEXTHASHKEY(WORD.SUB(1, WORD.POS - 1));
                VAL :- MAP.GET(KEY);
                IF VAL == NONE THEN
                BEGIN
                    VAL :- NEW COUNTER;
                    MAP.PUT(KEY, VAL);
                END;
                VAL.COUNT := VAL.COUNT + 1;
                WORD := " ";
                WORD.SETPOS(1);
            END;
        END SAVE;

        LINENO := LINENO + 1;
        LINE :- COPY(INF.IMAGE).STRIP; INF.INIMAGE;

        COMMENT SEARCH WORDS IN LINE ;
        COMMENT A WORD IS ANY SEQUENCE OF LETTERS ;

        INWORD := FALSE;
        LINE.SETPOS(1);
        WHILE LINE.MORE DO
        BEGIN
            CHARACTER CH;
            CH := LINE.GETCHAR;
            IF CH >= 'a' AND CH <= 'z' THEN
                CH := CHAR(RANK(CH) - RANK('a') + RANK('A'));
            IF CH >= 'A' AND CH <= 'Z' THEN
            BEGIN
                IF NOT INWORD THEN
                BEGIN
                    SAVE;
                    INWORD := TRUE;
                END;
                WORD.PUTCHAR(CH);
            END ELSE
            BEGIN
                IF INWORD THEN
                BEGIN
                    SAVE;
                    INWORD := FALSE;
                END;
            END;
        END;
        SAVE; COMMENT LAST WORD ;
    END;
    INF.CLOSE;

    COMMENT FIND 10 MOST COMMON WORDS ;

    IT :- NEW ITERATOR(MAP);
    WHILE IT.MORE DO
    BEGIN
        KEY :- IT.NEXT;
        VAL :- MAP.GET(KEY);
        FOR I := 1 STEP 1 UNTIL 10 DO
        BEGIN
            IF VAL.COUNT >= MAXCOUNTS(I) THEN
            BEGIN
                FOR J := 10 STEP -1 UNTIL I + 1 DO
                BEGIN
                    MAXCOUNTS(J) := MAXCOUNTS(J - 1);
                    MAXWORDS(J) :- MAXWORDS(J - 1);
                END;
                MAXCOUNTS(I) := VAL.COUNT;
                MAXWORDS(I) :- KEY;
                GO TO BREAK;
            END;
        END;
    BREAK:
    END;

    COMMENT OUTPUT 10 MOST COMMON WORDS ;

    FOR I := 1 STEP 1 UNTIL 10 DO
    BEGIN
        IF MAXWORDS(I) =/= NONE THEN
        BEGIN
            OUTINT(MAXCOUNTS(I), 10);
            OUTTEXT(" ");
            OUTTEXT(MAXWORDS(I) QUA TEXTHASHKEY.T);
            OUTIMAGE;
        END;
    END;

END

Output:

     41089 THE
     19949 OF
     14942 AND
     14608 A
     13951 TO
     11214 IN
      9648 HE
      8621 WAS
      7924 THAT
      6661 IT

6 garbage collection(s) in 0.2 seconds.

Smalltalk

The ASCII text file is from https://www.gutenberg.org/files/135/old/lesms10.txt.

Cuis Smalltalk, ASCII

Works with: Cuis version 6.0

(StandardFileStream new open: 'lesms10.txt' forWrite: false)
	contents asLowercase substrings asBag sortedCounts first: 10.

Output:

an OrderedCollection(40543 -> 'the' 19796 -> 'of' 14448 -> 'and' 14380 -> 'a' 13582 -> 'to' 11006 -> 'in' 9221 -> 'he' 8351 -> 'was' 7258 -> 'that' 6420 -> 'his')

Squeak Smalltalk, ASCII

Works with: Squeak version 6.0

(StandardFileStream readOnlyFileNamed: 'lesms10.txt')
	contents asLowercase substrings asBag sortedCounts first: 10.

Output:

{40543->'the' . 19796->'of' . 14448->'and' . 14380->'a' . 13582->'to' . 11006->'in' . 9221->'he' . 8351->'was' . 7258->'that' . 6420->'his'}

Swift

import Foundation

func printTopWords(path: String, count: Int) throws {
    // load file contents into a string
    let text = try String(contentsOfFile: path, encoding: String.Encoding.utf8)
    var dict = Dictionary<String, Int>()
    // split text into words, convert to lowercase and store word counts in dict
    let regex = try NSRegularExpression(pattern: "\\w+")
    regex.enumerateMatches(in: text, range: NSRange(text.startIndex..., in: text)) {
        (match, _, _) in
        guard let match = match else { return }
        let word = String(text[Range(match.range, in: text)!]).lowercased()
        dict[word, default: 0] += 1
    }
    // sort words by number of occurrences
    let wordCounts = dict.sorted(by: {$0.1 > $1.1})
    // print the top count words
    print("Rank\tWord\tCount")
    for (i, (word, n)) in wordCounts.prefix(count).enumerated() {
        print("\(i + 1)\t\(word)\t\(n)")
    }
}

do {
    try printTopWords(path: "135-0.txt", count: 10)
} catch {
    print(error.localizedDescription)
}

Output:

Rank	Word	Count
1	the	41039
2	of	19951
3	and	14942
4	a	14527
5	to	13941
6	in	11209
7	he	9646
8	was	8620
9	that	7922
10	it	6659

Tcl

lassign $argv head
while { [gets stdin line] >= 0 } {
    foreach word [regexp -all -inline {[A-Za-z]+} $line] {
        dict incr wordcount [string tolower $word]
    }
}

set sorted [lsort -stride 2 -index 1 -int -decr $wordcount]
foreach {word count} [lrange $sorted 0 [expr {$head * 2 - 1}]] {
    puts "$count\t$word"
}

./wordcount-di.tcl 10 < 135-0.txt

Output:

41093   the
19954   of
14943   and
14558   a
13953   to
11219   in
9649    he
8622    was
7924    that
6661    it

TMG

McIlroy's Unix TMG:

/* Input format: N text                                         */
/* Only lowercase letters can constitute a word in text.        */
/* (c) 2020, Andrii Makukha, 2-clause BSD licence.              */

progrm: readn/error
        table(freq) table(chain) [firstword = ~0]
loop:   not(!<<>>) output
    |   [j=777] batch/loop loop;                   /* Main loop */

/* To use less stack, divide input into batches.                */
/* (Avoid interpreting entire input as a single "sentence".)    */
batch:  [j<=0?] succ
     |  word/skip [j--] skip batch;
skip:   string(other);
not:    params(1) (any($1) fail | ());
readn:  string(!<<0123456789>>) readint(n) skip;
error:  diag(( ={ <ERROR: input must start with a number> * } ));

/* Process a word */
word:   smark any(letter) string(letter) scopy
        locate/new
        [freq[k]++] newmax;
locate: find(freq, k);
new:    enter(freq, k)
        [freq[k] = 1] newmax
        [firstword = firstword==~0 ? k : firstword]
        enter(chain, i) [chain[i]=prevword] [prevword=k];
newmax: [max = max<freq[k] ? freq[k] : max];

/* Output logic */
output: [next=max]
outmax: [max=next] [next=0] [max>0?] [j = prevword] cycle/outmax;
cycle:  [i = j] [k = freq[i]] [n>0?]
        ( [max==freq[i]?] parse(wn)
        | [(freq[i]<max) & (next<freq[i])?] [next = freq[i]]
        | ())
        [i != firstword?] [j = chain[i]] cycle;
wn:     getnam(freq, i) [k = freq[i]] decimal(k) [n--]
        = { 2 < > 1 * };

/* Reads decimal integer */
readint:  proc(n;i) ignore(<<>>) [n=0] inta
int1:     [n = n*12+i] inta\int1;
inta:     char(i) [i<72?] [(i =- 60)>=0?];

/* Variables */
prevword:   0;  /* Head of the linked list */
firstword:  0;  /* First word's index to know where to stop output */
k: 0;
i: 0;
j: 0;
n: 0;           /* Number of most frequent words to display */ 
max:  0;        /* Current highest number of occurrences */
next: 0;        /* Next highest number of occurrences */

/* Tables */
freq:   0;
chain:  0;

/* Character classes */
letter:   <<abcdefghijklmnopqrstuvwxyz>>;
other:   !<<abcdefghijklmnopqrstuvwxyz>>;

Unix TMG didn't have tolower builtin. Therefore, you would use it together with tr:

cat file | tr A-Z a-z > file1; ./a.out file1

Additionally, because 1972 TMG only understood ASCII characters, you might want to strip down the diacritics (e.g., é → e):

cat file | uni2ascii -B | tr A-Z a-z > file1; ./a.out file1

Transd

#lang transd

MainModule: {
    _start: (λ locals: cnt 0
        (with fs FileStream() words String()
            (open-r fs "/mnt/text/Literature/Miserables.txt")
            (textin fs words)

            (with v ( -|
                (split (tolower words))
                (group-by)
                (regroup-by (λ v Vector<String>() -> Int() (size v))))

            (for i in v :rev do (lout (get (get (snd i) 0) 0) ":\t " (fst i)) 
                (+= cnt 1) (if (> cnt 10) break))
    )))
}

Output:

the:     40379
of:      19869
and:	 14468
a:       14278
to:      13590
in:      11025
he:      9213
was:     8347
that:    7249
his:     6414
had:     6051

UNIX Shell

Works with: Bash

Works with: zsh

This is derived from Doug McIlroy's original 6-line note in the ACM article cited in the task.

#!/bin/sh
<"$1" tr -cs A-Za-z '\n' | tr A-Z a-z | LC_ALL=C sort | uniq -c | sort -rn | head -n "$2"

Output:

$ ./wordcount.sh 135-0.txt 10 
41089 the
19949 of
14942 and
14608 a
13951 to
11214 in
9648 he
8621 was
7924 that
6661 it

Original + URL import

This is Doug McIlroy's original solution but follows other solutions in importing the task's text file from the web and directly specifying the 10 most commonly used words.

curl "https://www.gutenberg.org/files/135/135-0.txt" | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 10q

Output:

41096 the
19955 of
14939 and
14558 a
13954 to
11218 in
9649 he
8622 was
7924 that
6661 it

VBA

In order to use it, you have to adapt the PATHFILE Const.

Option Explicit

Private Const PATHFILE As String = "C:\HOME\VBA\ROSETTA"

Sub Main()
Dim arr
Dim Dict As Object
Dim Book As String, temp As String
Dim T#
T = Timer
   Book = ExtractTxt(PATHFILE & "\les miserables.txt")
   temp = RemovePunctuation(Book)
   temp = UCase(temp)
   arr = Split(temp, " ")
   Set Dict = CreateObject("Scripting.Dictionary")
   FillDictionary Dict, arr
   Erase arr
   SortDictByFreq Dict, arr
   DisplayTheTopMostUsedWords arr, 10

Debug.Print "Words different in this book : " & Dict.Count
Debug.Print "-------------------------"
Debug.Print ""
Debug.Print "Optionally : "
Debug.Print "Frequency of the word MISERABLE : " & DisplayFrequencyOf("MISERABLE", Dict)
Debug.Print "Frequency of the word DISASTER : " & DisplayFrequencyOf("DISASTER", Dict)
Debug.Print "Frequency of the word ROSETTA_CODE : " & DisplayFrequencyOf("ROSETTA_CODE", Dict)
Debug.Print "-------------------------"
Debug.Print "Execution Time : " & Format(Timer - T, "0.000") & " sec."
End Sub

Private Function ExtractTxt(strFile As String) As String
'http://rosettacode.org/wiki/File_input/output#VBA
Dim i As Integer
   i = FreeFile
   Open strFile For Input As #i
       ExtractTxt = Input(LOF(1), #i)
   Close #i
End Function

Private Function RemovePunctuation(strBook As String) As String
Dim T, i As Integer, temp As String
Const PUNCT As String = """,;:!?."
   T = Split(StrConv(PUNCT, vbUnicode), Chr(0))
   temp = strBook
   For i = LBound(T) To UBound(T) - 1
      temp = Replace(temp, T(i), " ")
   Next
   temp = Replace(temp, "--", " ")
   temp = Replace(temp, "...", " ")
   temp = Replace(temp, vbCrLf, " ")
   RemovePunctuation = Replace(temp, "  ", " ")
End Function

Private Sub FillDictionary(d As Object, a As Variant)
Dim L As Long
   For L = LBound(a) To UBound(a)
      If a(L) <> "" Then _
         d(a(L)) = d(a(L)) + 1
   Next
End Sub

Private Sub SortDictByFreq(d As Object, myArr As Variant)
Dim K
Dim L As Long
   ReDim myArr(1 To d.Count, 1 To 2)
   For Each K In d.keys
      L = L + 1
      myArr(L, 1) = K
      myArr(L, 2) = CLng(d(K))
   Next
   SortArray myArr, LBound(myArr), UBound(myArr), 2
End Sub

Private Sub SortArray(a, Le As Long, Ri As Long, Col As Long)
Dim ref As Long, L As Long, r As Long, temp As Variant
   ref = a((Le + Ri) \ 2, Col)
   L = Le
   r = Ri
   Do
         Do While a(L, Col) < ref
            L = L + 1
         Loop
         Do While ref < a(r, Col)
            r = r - 1
         Loop
         If L <= r Then
            temp = a(L, 1)
            a(L, 1) = a(r, 1)
            a(r, 1) = temp
            temp = a(L, 2)
            a(L, 2) = a(r, 2)
            a(r, 2) = temp
            L = L + 1
            r = r - 1
         End If
   Loop While L <= r
   If L < Ri Then SortArray a, L, Ri, Col
   If Le < r Then SortArray a, Le, r, Col
End Sub

Private Sub DisplayTheTopMostUsedWords(arr As Variant, Nb As Long)
Dim L As Long, i As Integer
   i = 1
   Debug.Print "Rank Word    Frequency"
   Debug.Print "==== ======= ========="
   For L = UBound(arr) To UBound(arr) - Nb + 1 Step -1
      Debug.Print Left(CStr(i) & "    ", 5) & Left(arr(L, 1) & "       ", 8) & " " & Format(arr(L, 2), "0 000")
      i = i + 1
   Next
End Sub

Private Function DisplayFrequencyOf(Word As String, d As Object) As Long
   If d.Exists(Word) Then _
      DisplayFrequencyOf = d(Word)
End Function

Output:

Words different in this book : 25884
-------------------------
Rank Word    Frequency
==== ======= =========
1    THE      40 831
2    OF       19 807
3    AND      14 860
4    A        14 453
5    TO       13 641
6    IN       11 133
7    HE       9 598
8    WAS      8 617
9    THAT     7 807
10   IT       6 517

Optionally : 
Frequency of the word MISERABLE : 35
Frequency of the word DISASTER : 12
Frequency of the word ROSETTA_CODE : 0
-------------------------
Execution Time : 7,785 sec.

Wren

Translation of: Go

Library: Wren-str

Library: Wren-sort

Library: Wren-fmt

Library: Wren-pattern

I've taken the view that 'letter' means either a letter or digit for Unicode codepoints up to 255. I haven't included underscore, hyphen nor apostrophe as these usually separate compound words.

Not very quick (runs in about 15 seconds on my system) though this is partially due to Wren not having regular expressions and the string pattern matching module being written in Wren itself rather than C.

If the Go example is re-run today (17 February 2024), then the output matches this Wren example precisely though it appears that the text file has changed since the former was written more than 5 years ago.

import "io" for File
import "./str" for Str
import "./sort" for Sort
import "./fmt" for Fmt
import "./pattern" for Pattern

var fileName = "135-0.txt"
var text = File.read(fileName).trimEnd()
var groups = {}
// match runs of A-z, a-z, 0-9 and any non-ASCII letters with code-points < 256
var p = Pattern.new("+1&w")
var lines = text.split("\n")
for (line in lines) {
    var ms = p.findAll(line)
    for (m in ms) {
        var t = Str.lower(m.text)
        groups[t] = groups.containsKey(t) ? groups[t] + 1 : 1
    }
}
var keyVals = groups.toList
Sort.quick(keyVals, 0, keyVals.count - 1) { |i, j| (j.value - i.value).sign }
System.print("Rank  Word  Frequency")
System.print("====  ====  =========")
for (rank in 1..10) {
    var word = keyVals[rank-1].key
    var freq = keyVals[rank-1].value
    Fmt.print("$2d    $-4s    $5d", rank, word, freq)
}

Output:

Rank  Word  Frequency
====  ====  =========
 1    the     41092
 2    of      19954
 3    and     14943
 4    a       14546
 5    to      13953
 6    in      11219
 7    he       9649
 8    was      8622
 9    that     7924
10    it       6661

XQuery

let $maxentries := 10,
    $uri := 'https://www.gutenberg.org/files/135/135-0.txt'
return
<words in="{$uri}" top="{$maxentries}"> {
(
  let $doc := unparsed-text($uri),
  $tokens := (
               tokenize($doc, '\W+')[normalize-space()]
                 ! lower-case(.) 
                 ! normalize-unicode(., 'NFC')
             )
  return
    for $token in $tokens
    let $key := $token
    group by $key
    let $count := count($token)
    order by $count descending
    return <word key="{$key}" count="{$count}"/>
)[position()=(1 to $maxentries)]
}</words>

Output:

<words in="https://www.gutenberg.org/files/135/135-0.txt" top="10">
  <word key="the" count="41092"/>
  <word key="of" count="19954"/>
  <word key="and" count="14943"/>
  <word key="a" count="14545"/>
  <word key="to" count="13953"/>
  <word key="in" count="11219"/>
  <word key="he" count="9649"/>
  <word key="was" count="8622"/>
  <word key="that" count="7924"/>
  <word key="it" count="6661"/>
</words>

zkl

fname,count := vm.arglist;	// grab cammand line args

   // words may have leading or trailing "_", ie "the" and "_the"
File(fname).pump(Void,"toLower",  // read the file line by line and hash words
   RegExp("[a-z]+").pump.fp1(Dictionary().incV))  // line-->(word:count,..)
.toList().copy().sort(fcn(a,b){ b[1]<a[1] })[0,count.toInt()] // hash-->list
.pump(String,Void.Xplode,"%s,%s\n".fmt).println();

Output:

$ zkl bbb ~/Documents/Les\ Miserables.txt 10
the,41089
of,19949
and,14942
a,14608
to,13951
in,11214
he,9648
was,8621
that,7924
it,6661