Word frequency

Task

Given a text file and an integer n, print the n most common words in the file (and the number of their occurrences) in decreasing frequency.

For the purposes of this task:

A word is a sequence of one or more contiguous letters
You are free to define what a letter is. Underscores, accented letters, apostrophes, and other special characters can be handled at the example writer's discretion. For example, you may treat a compound word like "well-dressed" as either one word or two. The word "it's" could also be one or two words as you see fit. You may also choose not to support non US-ASCII characters. Feel free to explicitly state the thoughts behind the program decisions.
Assume words will not span multiple lines.
Do not worry about normalization of word spelling differences. Treat "color" and "colour" as two distinct words.
Uppercase letters are considered equivalent to their lowercase counterparts
Words of equal frequency can be listed in any order

Show example output using Les Misérables from Project Gutenberg as the text file input and display the top 10 most used words.

History

This task was originally taken from programming pearls from Communications of the ACM June 1986 Volume 29 Number 6 where this problem is solved by Donald Knuth using literate programming and then critiqued by Doug McIlroy, demonstrating solving the problem in a 6 line Unix shell script (provided as an example below).

References

McIlroy's program

Ada

This version uses a character set to match valid characters in a token. Another version could use a pointer to a function returning a boolean to match valid characters (allowing to use functions such as Is_Alphanumeric), but AFAIK there is no "Find_Token" method that uses one.

Works with: Ada version 2012

<lang Ada>with Ada.Command_Line; with Ada.Text_IO; with Ada.Integer_Text_IO; with Ada.Strings.Maps; with Ada.Strings.Fixed; with Ada.Characters.Handling; with Ada.Containers.Indefinite_Ordered_Maps; with Ada.Containers.Indefinite_Ordered_Sets; with Ada.Containers.Ordered_Maps;

procedure Word_Frequency is

   package TIO renames Ada.Text_IO;

   package String_Counters is new Ada.Containers.Indefinite_Ordered_Maps(String, Natural);
   package String_Sets is new Ada.Containers.Indefinite_Ordered_Sets(String);
   package Sorted_Counters is new Ada.Containers.Ordered_Maps
     (Natural,
      String_Sets.Set,
      "=" => String_Sets."=",
      "<" => ">");
   -- for sorting by decreasing number of occurrences and ascending lexical order

   procedure Increment(Key : in String; Element : in out Natural) is
   begin
       Element := Element + 1;
   end Increment;

   path : constant String := Ada.Command_Line.Argument(1);
   how_many : Natural := 10;
   set : constant Ada.Strings.Maps.Character_Set := Ada.Strings.Maps.To_Set(ranges => (('a', 'z'), ('0', '9')));
   F : TIO.File_Type;
   first : Positive;
   last : Natural;
   from : Positive;
   counter : String_Counters.Map;
   sorted_counts : Sorted_Counters.Map;
   C1 : String_Counters.Cursor;
   C2 : Sorted_Counters.Cursor;
   tmp_set : String_Sets.Set;

begin

   -- read file and count words
   TIO.Open(F, name => path, mode => TIO.In_File);
   while not TIO.End_Of_File(F) loop
      declare
         line : constant String := Ada.Characters.Handling.To_Lower(TIO.Get_Line(F));
      begin
         from := line'First;
         loop
            Ada.Strings.Fixed.Find_Token(line(from .. line'Last), set, Ada.Strings.Inside, first, last);
            exit when last < First;
            C1 := counter.Find(line(first .. last));
            if String_Counters.Has_Element(C1) then
               counter.Update_Element(C1, Increment'Access);
            else
               counter.Insert(line(first .. last), 1);
            end if;
            from := last + 1;
         end loop;
      end;
   end loop;
   TIO.Close(F);

   -- fill Natural -> StringSet Map
   C1 := counter.First;
   while String_Counters.Has_Element(C1) loop
      if sorted_counts.Contains(String_Counters.Element(C1)) then
         tmp_set := sorted_counts.Element(String_Counters.Element(C1));
         tmp_set.Include(String_Counters.Key(C1));
      else
         sorted_counts.Include(String_Counters.Element(C1), String_Sets.To_Set(String_Counters.Key(C1)));
      end if;
      String_Counters.Next(C1);
   end loop;

   -- output
   C2 := sorted_counts.First;
   while Sorted_Counters.Has_Element(C2) loop
      for Item of Sorted_Counters.Element(C2) loop
         Ada.Integer_Text_IO.Put(TIO.Standard_Output, Sorted_Counters.Key(C2), width => 9);
         TIO.Put(TIO.Standard_Output, " ");
         TIO.Put_Line(Item);
      end loop;
      Sorted_Counters.Next(C2);
      how_many := how_many - 1;
      exit when how_many = 0;
   end loop;

end Word_Frequency; </lang>

Output:

$ ./word_frequency 135-0.txt
    41093 the
    19954 of
    14943 and
    14558 a
    13953 to
    11219 in
     9649 he
     8622 was
     7924 that
     6661 it

AutoHotkey

<lang AutoHotkey>URLDownloadToFile, http://www.gutenberg.org/files/135/135-0.txt, % A_temp "\tempfile.txt" FileRead, H, % A_temp "\tempfile.txt" FileDelete, % A_temp "\tempfile.txt" words := [] while pos := RegExMatch(H, "\balpha:+\b", m, A_Index=1?1:pos+StrLen(m)) words[m] := words[m] ? words[m] + 1 : 1 for word, count in words list .= count "`t" word "`r`n" Sort, list, RN loop, parse, list, `n, `r { result .= A_LoopField "`r`n" if A_Index = 10 break } MsgBox % "Freq`tWord`n" result return</lang>

Outputs:

Freq	Word
41036	The
19946	of
14940	and
14589	A
13939	TO
11204	in
9645	HE
8619	WAS
7922	THAT
6659	it

AWK

syntax: GAWK -f WORD_FREQUENCY.AWK [-v show=x] LES_MISERABLES.TXT
A word is anything separated by white space.
Therefor "this" and "this." are different.
But "This" and "this" are identical.
As I am "free to define what a letter is" I have chosen to allow
numerics and all special characters as they are usually considered
parts of words in text processing applications.

{ nbytes += length($0) + 2 # +2 for CR/LF

   nfields += NF
   $0 = tolower($0)
   for (i=1; i<=NF; i++) {
     arr[$i]++
   }

} END {

   show = (show == "") ? 10 : show
   width1 = length(show)
   PROCINFO["sorted_in"] = "@val_num_desc"
   for (i in arr) {
     if (width2 == 0) { width2 = length(arr[i]) }
     if (n++ >= show) { break }
     printf("%*d %*d %s\n",width1,n,width2,arr[i],i)
   }
   printf("input: %d records, %d bytes, %d words of which %d are unique\n",NR,nbytes,nfields,length(arr))
   exit(0)

} </lang>

Output:

 1 40372 the
 2 19868 of
 3 14472 and
 4 14278 a
 5 13589 to
 6 11024 in
 7  9213 he
 8  8347 was
 9  7250 that
10  6414 his
input: 73829 records, 3369772 bytes, 568744 words of which 50394 are unique

Batch File

This takes a very long time per word thus I have chosen to feed it a 200 line sample and go from there.

You could cut the length of this down drastically if you didn't need to be able to recall the word at nth position and wished only to display the top 10 words.

<lang dos> @echo off

call:wordCount 1 2 3 4 5 6 7 8 9 10 42 101

pause>nul exit

wordCount

setlocal enabledelayedexpansion

set word=100000 set line=0 for /f "delims=" %%i in (input.txt) do ( set /a line+=1 for %%j in (%%i) do ( if not !skip%%j!==true ( echo line !line! ^| word !word:~-5! - "%%~j"

type input.txt | find /i /c "%%~j" > count.tmp set /p tmpvar=<count.tmp

set tmpvar=000000000!tmpvar! set tmpvar=!tmpvar:~-10! set count[!word!]=!tmpvar! %%~j

set "skip%%j=true" set /a word+=1 ) ) ) del count.tmp

set wordcount=0 for /f "tokens=1,2 delims= " %%i in ('set count ^| sort /+14 /r') do ( set /a wordcount+=1 for /f "tokens=2 delims==" %%k in ("%%i") do ( set word[!wordcount!]=!wordcount!. %%j - %%k ) )

cls for %%i in (%*) do echo !word[%%i]! endlocal goto:eof

</lang>

Output

1.  - 0000000140 I
2.  - 0000000140 a
3.  - 0000000118 He
4.  - 0000000100 the
5.  - 0000000080 an
6.  - 0000000075 in
7.  - 0000000066 at
8.  - 0000000062 is
9.  - 0000000058 on
10.  - 0000000058 as
42.  - 0000000010 with
101.  - 0000000004 ears

C

Library: GLib

Known issue: this code assumes the ASCII character set, but the input file is UTF-8. <lang c>#include <ctype.h>

include <stdbool.h>
include <stdio.h>
include <glib.h>

// A word is defined to be any run of characters for which isalpha // returns true, i.e. upper or lower case letters. All words are // converted to lower case. bool get_word(FILE* in, GString* word) {

   int c;
   g_string_set_size(word, 0);
   while ((c = getc(in)) != EOF) {
       if (isalpha(c)) {
           g_string_append_c(word, tolower(c));
           break;
       }
   }
   if (word->len == 0)
       return false;
   while ((c = getc(in)) != EOF) {
       if (!isalpha(c))
           break;
       g_string_append_c(word, tolower(c));
   }
   return true;

}

typedef struct word_count_tag {

   const char* word;
   size_t count;

} word_count;

int compare_word_count(const void* p1, const void* p2) {

   const word_count* w1 = p1;
   const word_count* w2 = p2;
   if (w1->count > w2->count)
       return -1;
   if (w1->count < w2->count)
       return 1;
   return 0;

}

void get_top_words(FILE* in, size_t count) {

   GHashTable* ht = g_hash_table_new_full(g_str_hash, g_str_equal,
                                          g_free, g_free);
   // Store word counts in the hash table
   GString* word = g_string_sized_new(64);
   while (get_word(in, word)) {
       gpointer value = g_hash_table_lookup(ht, word->str);
       if (value != NULL) {
           size_t* count = value;
           ++*count;
       } else {
           size_t* count = g_new(size_t, 1);
           *count = 1;
           g_hash_table_insert(ht, g_strdup(word->str), count);
       }
   }
   g_string_free(word, TRUE);

   // Sort words in decreasing order of frequency
   size_t size = g_hash_table_size(ht);
   word_count* words = g_new(word_count, size);
   GHashTableIter iter;
   gpointer key, value;
   g_hash_table_iter_init(&iter, ht);
   for (size_t i = 0; g_hash_table_iter_next(&iter, &key, &value); ++i) {
       words[i].word = (char*)key;
       words[i].count = *(size_t*)value;
   }
   qsort(words, size, sizeof(word_count), compare_word_count);

   // Print the most common words
   if (count > size)
       count = size;
   printf("Top %lu words\n", count);
   printf("Rank\tCount\tWord\n");
   for (size_t i = 0; i < count; ++i)
       printf("%lu\t%lu\t%s\n", i + 1, words[i].count, words[i].word);
   g_free(words);
   g_hash_table_destroy(ht);

}

int main(int argc, char** argv) {

   if (argc != 2) {
       fprintf(stderr, "usage: %s file\n", argv[0]);
       return 1;
   }
   FILE* in = fopen(argv[1], "r");
   if (in == NULL) {
       perror(argv[1]);
       return 1;
   }
   get_top_words(in, 15);
   fclose(in);
   return 0;

}</lang>

Output:

Top 15 words
Rank	Count	Word
1	41093	the
2	19954	of
3	14943	and
4	14558	a
5	13953	to
6	11219	in
7	9649	he
8	8622	was
9	7924	that
10	6661	it
11	6470	his
12	6195	is
13	6182	had
14	5149	which
15	4530	with

C#

Translation of: D

<lang csharp>using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text.RegularExpressions;

namespace WordCount {

   class Program {
       static void Main(string[] args) {
           var text = File.ReadAllText("135-0.txt").ToLower();

           var match = Regex.Match(text, "\\w+");
           Dictionary<string, int> freq = new Dictionary<string, int>();
           while (match.Success) {
               string word = match.Value;
               if (freq.ContainsKey(word)) {
                   freq[word]++;
               } else {
                   freq.Add(word, 1);
               }

               match = match.NextMatch();
           }

           Console.WriteLine("Rank  Word  Frequency");
           Console.WriteLine("====  ====  =========");
           int rank = 1;
           foreach (var elem in freq.OrderByDescending(a => a.Value).Take(10)) {
               Console.WriteLine("{0,2}    {1,-4}    {2,5}", rank++, elem.Key, elem.Value);
           }
       }
   }

}</lang>

Output:

Rank  Word  Frequency
====  ====  =========
 1    the     41035
 2    of      19946
 3    and     14940
 4    a       14577
 5    to      13939
 6    in      11204
 7    he       9645
 8    was      8619
 9    that     7922
10    it       6659

C++

Translation of: C#

<lang cpp>#include <algorithm>

include <iostream>
include <fstream>
include <map>
include <regex>
include <string>
include <vector>

int main() {

   using namespace std;
   regex wordRgx("\\w+");
   map<string, int> freq;
   string line;

   ifstream in("135-0.txt");
   if (!in.is_open()) {
       cerr << "Failed to open file\n";
       return 1;
   }
   while (getline(in, line)) {
       auto words_itr = sregex_iterator(line.cbegin(), line.cend(), wordRgx);
       auto words_end = sregex_iterator();
       while (words_itr != words_end) {
           auto match = *words_itr;
           auto word = match.str();
           if (word.size() > 0) {
               auto entry = freq.find(word);
               if (entry != freq.end()) {
                   entry->second++;
               } else {
                   freq.insert(make_pair(word, 1));
               }
           }
           words_itr = next(words_itr);
       }
   }
   in.close();

   vector<pair<string, int>> pairs;
   for (auto iter = freq.cbegin(); iter != freq.cend(); ++iter) {
       pairs.push_back(*iter);
   }
   sort(pairs.begin(), pairs.end(), [=](pair<string, int>& a, pair<string, int>& b) {
       return a.second > b.second;
   });

   cout << "Rank  Word  Frequency\n";
   cout << "====  ====  =========\n";
   int rank = 1;
   for (auto iter = pairs.cbegin(); iter != pairs.cend() && rank <= 10; ++iter) {
       printf("%2d   %4s   %5d\n", rank++, iter->first.c_str(), iter->second);
   }

   return 0;

}</lang>

Output:

Rank  Word  Frequency
====  ====  =========
 1    the   36636
 2     of   19615
 3    and   14079
 4     to   13535
 5      a   13527
 6     in   10256
 7    was    8543
 8   that    7324
 9     he    6814
10    had    6139

Clojure

<lang clojure>(defn count-words [file n]

 (->> file
   slurp
   clojure.string/lower-case
   (re-seq #"\w+")
   frequencies
   (sort-by val >)
   (take n)))</lang>

Output:

user=> (count-words "135-0.txt" 10)
(["the" 41036] ["of" 19946] ["and" 14940] ["a" 14589] ["to" 13939]
 ["in" 11204] ["he" 9645] ["was" 8619] ["that" 7922] ["it" 6659])

COBOL

      IDENTIFICATION DIVISION.
      PROGRAM-ID. WordFrequency.
      AUTHOR.  Bill Gunshannon.
      DATE-WRITTEN.  30 Jan 2020.
     ************************************************************
     ** Program Abstract:
     **   Given a text file and an integer n, print the n most
     **   common words in the file (and the number of their
     **   occurrences) in decreasing frequency.
     **
     **   A file named Parameter.txt provides this information.
     **   Format is:
     **   12345678901234567890123456789012345678901234567890
     **   |------------------|----|
     **     ^^^^^^^^^^^^^^^^  ^^^^
     **          |              |
     **     Source Text File   Number of words with count
     **       20 Characters      5 digits with leading zeroes
     **
     **
     ************************************************************
      
      ENVIRONMENT DIVISION.
      
      INPUT-OUTPUT SECTION.
      FILE-CONTROL.
           SELECT Parameter-File ASSIGN TO "Parameter.txt"
                ORGANIZATION IS LINE SEQUENTIAL.
           SELECT Input-File ASSIGN TO Source-Text
                ORGANIZATION IS LINE SEQUENTIAL.
           SELECT Word-File ASSIGN TO "Word.txt"
                ORGANIZATION IS LINE SEQUENTIAL.
           SELECT Output-File ASSIGN TO "Output.txt"
                ORGANIZATION IS LINE SEQUENTIAL.
           SELECT Print-File ASSIGN TO "Printer.txt"
                ORGANIZATION IS LINE SEQUENTIAL.
           SELECT Sort-File     ASSIGN TO DISK.
      
      DATA DIVISION.
      
      FILE SECTION.
      
      FD  Parameter-File
          DATA RECORD IS Parameter-Record.
      01  Parameter-Record.
          05 Source-Text               PIC X(20).
          05 How-Many                  PIC 99999.

      FD  Input-File
          DATA RECORD IS Input-Record.
      01  Input-Record.
          05 Input-Line                PIC X(80).

      FD  Word-File
          DATA RECORD IS Word-Record.
      01  Word-Record.
          05 Input-Word               PIC X(20).

      FD  Output-File
          DATA RECORD IS Output-Rec.
      01  Output-Rec.
          05  Output-Rec-Word         PIC X(20).
          05  Output-Rec-Word-Cnt     PIC 9(5).

      FD  Print-File
          DATA RECORD IS Print-Rec.
      01  Print-Rec.
          05  Print-Rec-Word          PIC X(20).
          05  Print-Rec-Word-Cnt      PIC 9(5).
      
      SD  Sort-File.
      01  Sort-Rec.
          05  Sort-Word               PIC X(20).
          05  Sort-Word-Cnt           PIC 9(5).
      
      
      WORKING-STORAGE SECTION.
      
      01 Eof                    PIC X     VALUE 'F'.
      01 InLine                 PIC X(80).
      01 Word1                  PIC X(20).
      01 Current-Word           PIC X(20).
      01 Current-Word-Cnt       PIC 9(5).
      01 Pos                    PIC 99
                VALUE 1.
      01 Cnt                    PIC 99.
      01 Report-Rank.
         05  IRank              PIC 99999
                VALUE 1.
         05 Rank                PIC ZZZZ9.
      
      PROCEDURE DIVISION.
      
      Main-Program.
     **
     **  Read the Parameters
     **
        OPEN INPUT Parameter-File.
        READ Parameter-File.
        CLOSE Parameter-File.

     **
     **  Open Files for first stage
     **
        OPEN INPUT  Input-File.
        OPEN OUTPUT  Word-File.

     **
     **  Pare\se the Source Text into a file of invidual words
     **
        PERFORM UNTIL Eof = 'T'
           READ Input-File 
              AT END MOVE 'T' TO Eof
           END-READ

        PERFORM Parse-a-Words

        MOVE SPACES TO Input-Record
        MOVE 1 TO Pos
        END-PERFORM.
    
     **
     **  Cleanup from the first stage
     **
        CLOSE Input-File Word-File

     **
     **  Sort the individual words in alphabetical order
     **
        SORT Sort-File
             ON ASCENDING KEY Sort-Word
             USING Word-File
             GIVING Word-File.

     **
     **  Count each time a word is used
     **
        PERFORM Collect-Totals.

     **
     **  Sort data by number of usages per word
     **
        SORT Sort-File
             ON DESCENDING KEY Sort-Word-Cnt
             USING Output-File
             GIVING Print-File.

     **
     **  Show the work done
     **
        OPEN INPUT Print-File.
           DISPLAY " Rank  Word               Frequency"
        PERFORM How-Many TIMES
           READ Print-File
           MOVE IRank TO Rank
           DISPLAY Rank "  " Print-Rec
           ADD 1 TO IRank
        END-PERFORM.

     **
     **  Cleanup
     **
        CLOSE Print-File.
        CALL "C$DELETE" USING "Word.txt" ,0
        CALL "C$DELETE" USING "Output.txt" ,0

        STOP RUN.

       Parse-a-Words.
         INSPECT Input-Record CONVERTING '-.,"();:/[]{}!?|' TO SPACE 
         PERFORM UNTIL Pos > FUNCTION STORED-CHAR-LENGTH(Input-Record)

         UNSTRING Input-Record DELIMITED BY SPACE INTO Word1 
                   WITH POINTER Pos TALLYING IN Cnt 
         MOVE FUNCTION TRIM(FUNCTION LOWER-CASE(Word1)) TO Word-Record
         
         IF Word-Record NOT EQUAL SPACES AND Word-Record IS ALPHABETIC
            THEN WRITE Word-Record
         END-IF

         END-PERFORM.

      Collect-Totals.
         MOVE 'F' to Eof
         OPEN INPUT Word-File
         OPEN OUTPUT Output-File
            READ Word-File
            MOVE Input-Word TO Current-Word
            MOVE 1 to Current-Word-Cnt
         PERFORM UNTIL Eof = 'T'
            READ Word-File
               AT END MOVE 'T' TO Eof
            END-READ

            IF FUNCTION TRIM(Word-Record) 
                   EQUAL 
                          FUNCTION TRIM(Current-Word)
               THEN
                    ADD 1 to Current-Word-Cnt
               ELSE
                    MOVE Current-Word TO Output-Rec-Word
                    MOVE Current-Word-Cnt TO Output-Rec-Word-Cnt
                    WRITE Output-Rec
                    MOVE 1 to Current-Word-Cnt
                    MOVE Word-Record TO Current-Word
                    MOVE SPACES TO Input-Record
           END-IF 
          
         END-PERFORM.
         CLOSE Word-File Output-File.
      END-PROGRAM.

</lang>

Output:

 Rank  Word               Frequency
    1  the                 40551
    2  of                  19806
    3  and                 14730
    4  a                   14351
    5  to                  13775
    6  in                  11074
    7  he                  09480
    8  was                 08613
    9  that                07632
   10  his                 06446
   11  it                  06335
   12  had                 06181
   13  is                  06097
   14  which               05135
   15  with                04469

Common Lisp

<lang lisp> (defun count-word (n pathname)

 (with-open-file (s pathname :direction :input)
   (loop for line = (read-line s nil nil) while line
         nconc (list-symb (drop-noise line)) into words
         finally (return (subseq (sort (pair words)
                                       #'> :key #'cdr)
                                 0 n)))))

 (defun list-symb (s)
   (let ((*read-eval* nil))
     (read-from-string (concatenate 'string "(" s ")"))))

(defun drop-noise (s)

 (delete-if-not #'(lambda (x) (or (alpha-char-p x)
                                  (equal x #\space)
                                  (equal x #\-))) s))

(defun pair (words &aux (hash (make-hash-table)) ac)

 (dolist (word words) (incf (gethash word hash 0)))
 (maphash #'(lambda (e n) (push `(,e . ,n) ac)) hash) ac)

</lang>

Output:

> (count-word 10 "c:/temp/135-0.txt")
((THE . 40738) (OF . 19922) (AND . 14878) (A . 14419) (TO . 13702) (IN . 11172)
 (HE . 9577) (WAS . 8612) (THAT . 7768) (IT . 6467))

Crystal

<lang ruby>require "http/client" require "regex"

Get the text from the internet

response = HTTP::Client.get "https://www.gutenberg.org/files/135/135-0.txt" text = response.body

text

 .downcase
 .scan(/[a-zA-ZáéíóúÁÉÍÓÚâêôäüöàèìòùñ']+/)
 .reduce({} of String => Int32) { |hash, match|
   word = match[0]
   hash[word] = hash.fetch(word, 0) + 1 # using fetch to set a default value (1) to the new found word
   hash
 }
 .to_a                                        # convert the returned hash to an array of tuples (String, Int32) -> {word, sum}
 .sort { |a, b| b[1] <=> a[1] }[0..9]         # sort and get the first 10 elements
 .each_with_index(1) { |(word, n), i| puts "#{i} \t #{word} \t #{n}" } # print the result

</lang>

Output:

1        the     41092
2        of      19954
3        and     14943
4        a       14556
5        to      13953
6        in      11219
7        he      9649
8        was     8622
9        that    7924
10       it      6661

D

<lang D>import std.algorithm : sort; import std.array : appender, split; import std.range : take; import std.stdio : File, writefln, writeln; import std.typecons : Tuple; import std.uni : toLower;

//Container for a word and how many times it has been seen alias Pair = Tuple!(string, "k", int, "v");

void main() {

   int[string] wcnt;

   //Read the file line by line
   foreach (line; File("135-0.txt").byLine) {
       //Split the words on whitespace
       foreach (word; line.split) {
           //Increment the times the word has been seen
           wcnt[word.toLower.idup]++;
       }
   }

   //Associative arrays cannot be sort, so put the key/value in an array
   auto wb = appender!(Pair[]);
   foreach(k,v; wcnt) {
       wb.put(Pair(k,v));
   }
   Pair[] sw = wb.data.dup;

   //Sort the array, and display the top ten values
   writeln("Rank  Word        Frequency");
   int rank=1;
   foreach (word; sw.sort!"a.v>b.v".take(10)) {
       writefln("%4s  %-10s  %9s", rank++, word.k, word.v);
   }

}</lang>

Output:

Rank  Word        Frequency
   1  the             40368
   2  of              19863
   3  and             14470
   4  a               14277
   5  to              13587
   6  in              11019
   7  he               9212
   8  was              8346
   9  that             7251
  10  his              6414

F#

<lang fsharp> open System.IO open System.Text.RegularExpressions let g=Regex("[A-Za-zÀ-ÿ]+").Matches(File.ReadAllText "135-0.txt") [for n in g do yield n.Value.ToLower()]|>List.countBy(id)|>List.sortBy(fun n->(-(snd n)))|>List.take 10|>List.iter(fun n->printfn "%A" n) </lang>

Output:

("the", 41088)
("of", 19949)
("and", 14942)
("a", 14596)
("to", 13951)
("in", 11214)
("he", 9648)
("was", 8621)
("that", 7924)
("it", 6661)

Factor

This program expects stdin to read from a file via the command line. ( e.g. invoking the program in Windows: >factor word-count.factor < input.txt ) The definition of a word here is simply any string surrounded by some combination of spaces, punctuation, or newlines. <lang factor> USING: ascii io math.statistics prettyprint sequences splitting ; IN: rosetta-code.word-count

lines " " join " .,?!:;()\"-" split harvest [ >lower ] map sorted-histogram <reversed> 10 head . </lang>

Output:

{
    { "the" 41021 }
    { "of" 19945 }
    { "and" 14938 }
    { "a" 14522 }
    { "to" 13938 }
    { "in" 11201 }
    { "he" 9600 }
    { "was" 8618 }
    { "that" 7822 }
    { "it" 6532 }
}

Frink

Frink has a Unicode-aware function, wordList[str], which intelligently enumerates through the words in a string (and correctly handles compound words, hyphenated words, accented characters, etc.) It returns words, spaces, and punctuation marks. Results that do not contain alphanumeric characters are filtered out.

The file fetched from Project Gutenberg is supposed to be encoded in UTF-8 character encoding, but their servers incorrectly send that it is Windows-1252 encoded, so this program fixes that.

Frink has a Unicode-aware lowercase function, lc[str] that correctly handles accented characters and may even make a string longer.

<lang frink>d = new dict for w = select[wordList[read["https://www.gutenberg.org/files/135/135-0.txt", "UTF-8"]], %r/alnum:/ ]

  d.increment[lc[w], 1]

println[join["\n", first[reverse[sort[array[d], {|a,b| a@1 <=> b@1}]], 10]]]</lang>

Output:

[the, 40802]
[of, 19933]
[and, 14924]
[a, 14450]
[to, 13719]
[in, 11184]
[he, 9636]
[was, 8617]
[that, 7901]
[it, 6641]

Go

Translation of: Kotlin

<lang go>package main

import (

   "fmt"
   "io/ioutil"
   "log"
   "regexp"
   "sort"
   "strings"

)

type keyval struct {

   key string
   val int

}

func main() {

   reg := regexp.MustCompile(`\p{Ll}+`)
   bs, err := ioutil.ReadFile("135-0.txt")
   if err != nil {
       log.Fatal(err)
   }
   text := strings.ToLower(string(bs))
   matches := reg.FindAllString(text, -1)
   groups := make(map[string]int)
   for _, match := range matches {
       groups[match]++
   }
   var keyvals []keyval
   for k, v := range groups {
       keyvals = append(keyvals, keyval{k, v})
   }
   sort.Slice(keyvals, func(i, j int) bool {
       return keyvals[i].val > keyvals[j].val
   })
   fmt.Println("Rank  Word  Frequency")
   fmt.Println("====  ====  =========")
   for rank := 1; rank <= 10; rank++ {
       word := keyvals[rank-1].key
       freq := keyvals[rank-1].val
       fmt.Printf("%2d    %-4s    %5d\n", rank, word, freq)
   }

}</lang>

Output:

Rank  Word  Frequency
====  ====  =========
 1    the     41088
 2    of      19949
 3    and     14942
 4    a       14596
 5    to      13951
 6    in      11214
 7    he       9648
 8    was      8621
 9    that     7924
10    it       6661

Groovy

Solution: <lang groovy>def topWordCounts = { String content, int n ->

   def mapCounts = [:]
   content.toLowerCase().split(/\W+/).each {
       mapCounts[it] = (mapCounts[it] ?: 0) + 1
   }
   def top = (mapCounts.sort { a, b -> b.value <=> a.value }.collect{ it })[0..<n]
   println "Rank Word Frequency\n==== ==== ========="
   (0..<n).each { printf ("%4d %-4s %9d\n", it+1, top[it].key, top[it].value) }

}</lang>

Test: <lang groovy>def rawText = "http://www.gutenberg.org/files/135/135-0.txt".toURL().text topWordCounts(rawText, 10)</lang>

Output:

Rank Word Frequency
==== ==== =========
   1 the      41036
   2 of       19946
   3 and      14940
   4 a        14589
   5 to       13939
   6 in       11204
   7 he        9645
   8 was       8619
   9 that      7922
  10 it        6659

Haskell

Translation of: Clojure

<lang Haskell>module Main where

import Control.Category -- (>>>) import Data.Char -- toLower, isSpace import Data.List -- sortBy, (Foldable(foldl')), filter import Data.Ord -- Down import System.IO -- stdin, ReadMode, openFile, hClose import System.Environment -- getArgs

-- containers import Data.Map.Strict (Map) import qualified Data.Map.Strict as M import qualified Data.IntMap.Strict as IM

-- text import Data.Text (Text) import qualified Data.Text as T import qualified Data.Text.IO as T

frequencies :: Ord a => [a] -> Map a Integer frequencies = foldl' (\m k -> M.insertWith (+) k 1 m) M.empty {-# SPECIALIZE frequencies :: [Text] -> Map Text Integer #-}

main :: IO () main = do

 args <- getArgs
 (n,hand,filep) <- case length args of
   0 -> return (10,stdin,False)
   1 -> return (read $ head args,stdin,False)
   _ -> let (ns:fp:_) = args
        in fmap (\h -> (read ns,h,True)) (openFile fp ReadMode)
 T.hGetContents hand >>=
   (T.map toLower
     >>> T.split isSpace
     >>> filter (not <<< T.null)
     >>> frequencies
     >>> M.toList
     >>> sortBy (comparing (Down <<< snd)) -- sort the opposite way
     >>> take n
     >>> print)
 when filep (hClose hand)</lang>

Output:

$ ./word_count 10 < ~/doc/les_miserables*
[("the",40368),("of",19863),("and",14470),("a",14277),("to",13587),("in",11019),("he",9212),("was",8346),("that",7251),("his",6414)]

Or, perhaps a little more simply: <lang haskell>import qualified Data.Text.IO as T import qualified Data.Text as T

import Data.List (group, sort, sortBy) import Data.Ord (comparing)

frequentWords :: T.Text -> [(Int, T.Text)] frequentWords =

 sortBy (flip $ comparing fst) .
 fmap ((,) . length <*> head) . group . sort . T.words . T.toLower

main :: IO () main = T.readFile "miserables.txt" >>= (mapM_ print . take 10 . frequentWords)</lang>

Output:

(40370,"the")
(19863,"of")
(14470,"and")
(14277,"a")
(13587,"to")
(11019,"in")
(9212,"he")
(8346,"was")
(7251,"that")
(6414,"his")

J

Text acquisition: store the entire text from the web page http://www.gutenberg.org/files/135/135-0.txt (the plain text UTF-8 link) into a file. This linux example uses ~/downloads/books/LesMis.txt .

Program: Reading from left to right, 10 {. "ten take" from an array computed by words to the right. \:~ "sort descending" by items of the array computed by whatever is to the right. (#;{.)/.~ "tally linked with item" key ;: "words" parses the argument to its right as a j sentence. tolower changes to a common case

Hence the remainder of the j sentence must clean after loading the file.

The parenthesized expression (a.-.Alpha_j_,' ') computes to a vector of the j alphabet excluding [a-zA-Z ] ((e.&(a.-.Alpha_j_,' '))`(,:&' '))} substitutes space character for the unwanted characters. 1!:1 reads the file named in the box <

   10{.\:~(#;{.)/.~;:tolower((e.&(a.-.Alpha_j_,' '))`(,:&' '))}1!:1<jpath'~/downloads/books/LesMis.txt'
┌─────┬────┐
│41093│the │
├─────┼────┤
│19954│of  │
├─────┼────┤
│14943│and │
├─────┼────┤
│14558│a   │
├─────┼────┤
│13953│to  │
├─────┼────┤
│11219│in  │
├─────┼────┤
│9649 │he  │
├─────┼────┤
│8622 │was │
├─────┼────┤
│7924 │that│
├─────┼────┤
│6661 │it  │
└─────┴────┘

Java

Translation of: Kotlin

<lang Java>import java.io.IOException; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.regex.Matcher; import java.util.regex.Pattern; import java.util.stream.Collectors;

public class WordCount {

   public static void main(String[] args) throws IOException {
       Path path = Paths.get("135-0.txt");
       byte[] bytes = Files.readAllBytes(path);
       String text = new String(bytes);
       text = text.toLowerCase();

       Pattern r = Pattern.compile("\\p{javaLowerCase}+");
       Matcher matcher = r.matcher(text);
       Map<String, Integer> freq = new HashMap<>();
       while (matcher.find()) {
           String word = matcher.group();
           Integer current = freq.getOrDefault(word, 0);
           freq.put(word, current + 1);
       }

       List<Map.Entry<String, Integer>> entries = freq.entrySet()
           .stream()
           .sorted((i1, i2) -> Integer.compare(i2.getValue(), i1.getValue()))
           .limit(10)
           .collect(Collectors.toList());

       System.out.println("Rank  Word  Frequency");
       System.out.println("====  ====  =========");
       int rank = 1;
       for (Map.Entry<String, Integer> entry : entries) {
           String word = entry.getKey();
           Integer count = entry.getValue();
           System.out.printf("%2d    %-4s    %5d\n", rank++, word, count);
       }
   }

}</lang>

Output:

Rank  Word  Frequency
====  ====  =========
 1    the     41088
 2    of      19949
 3    and     14942
 4    a       14596
 5    to      13951
 6    in      11214
 7    he       9648
 8    was      8621
 9    that     7924
10    it       6661

Julia

Works with: Julia version 1.0

<lang julia> using FreqTables

txt = read("les-mis.txt", String) words = split(replace(txt, r"\P{L}"i => " ")) table = sort(freqtable(words); rev=true) println(table[1:10])</lang>

Output:

Dim1   │
───────┼──────
"the"  │ 36671
"of"   │ 19618
"and"  │ 14081
"to"   │ 13541
"a"    │ 13529
"in"   │ 10265
"was"  │  8545
"that" │  7326
"he"   │  6816
"had"  │  6140

Kotlin

The author of the Raku entry has given a good account of the difficulties with this task and, in the absence of any clarification on the various issues, I've followed a similar 'literal' approach.

So, after first converting the text to lower case, I've assumed that a word is any sequence of one or more lower-case Unicode letters and obtained the same results as the Raku version.

There is no change in the results if the numerals 0-9 are also regarded as letters. <lang scala>// version 1.1.3

import java.io.File

fun main(args: Array<String>) {

   val text = File("135-0.txt").readText().toLowerCase()
   val r = Regex("""\p{javaLowerCase}+""")
   val matches = r.findAll(text)
   val wordGroups = matches.map { it.value }
                   .groupBy { it }
                   .map { Pair(it.key, it.value.size) }
                   .sortedByDescending { it.second }
                   .take(10)
   println("Rank  Word  Frequency")
   println("====  ====  =========")
   var rank = 1
   for ((word, freq) in wordGroups) 
       System.out.printf("%2d    %-4s    %5d\n", rank++, word, freq)

}</lang>

Output:

Rank  Word  Frequency
====  ====  =========
 1    the     41088
 2    of      19949
 3    and     14942
 4    a       14596
 5    to      13951
 6    in      11214
 7    he       9648
 8    was      8621
 9    that     7924
10    it       6661

Liberty BASIC

<lang lb>dim words$(100000,2)'words$(a,1)=the word, words$(a,2)=the count dim lines$(150000) open "135-0.txt" for input as #txt while EOF(#txt)=0 and total < 150000

   input #txt, lines$(total)
   total=total+1

wend for a = 1 to total

   token$ = "?"
   index=0
   new=0
   while token$ <> ""
       new=0
       index = index + 1
       token$ = lower$(word$(lines$(a),index))
       token$=replstr$(token$,".","")
       token$=replstr$(token$,",","")
       token$=replstr$(token$,";","")
       token$=replstr$(token$,"!","")
       token$=replstr$(token$,"?","")
       token$=replstr$(token$,"-","")
       token$=replstr$(token$,"_","")
       token$=replstr$(token$,"~","")
       token$=replstr$(token$,"+","")
       token$=replstr$(token$,"0","")
       token$=replstr$(token$,"1","")
       token$=replstr$(token$,"2","")
       token$=replstr$(token$,"3","")
       token$=replstr$(token$,"4","")
       token$=replstr$(token$,"5","")
       token$=replstr$(token$,"6","")
       token$=replstr$(token$,"7","")
       token$=replstr$(token$,"8","")
       token$=replstr$(token$,"9","")
       token$=replstr$(token$,"/","")
       token$=replstr$(token$,"<","")
       token$=replstr$(token$,">","")
       token$=replstr$(token$,":","")
       for b = 1 to newwordcount
           if words$(b,1)=token$ then
               num=val(words$(b,2))+1
               num$=str$(num)
               if len(num$)=1 then num$="0000"+num$
               if len(num$)=2 then num$="000"+num$
               if len(num$)=3 then num$="00"+num$
               if len(num$)=4 then num$="0"+num$
               words$(b,2)=num$
               new=1
               exit for
           end if
       next b
       if new<>1 then newwordcount=newwordcount+1:words$(newwordcount,1)=token$:words$(newwordcount,2)="00001":print newwordcount;" ";token$
   wend

next a print sort words$(), 1, newwordcount, 2 print "Count Word" print "===== =================" for a = newwordcount to newwordcount-10 step -1

   print words$(a,2);" ";words$(a,1)

next a print "-----------------------" print newwordcount;" unique words found." print "End of program" close #txt end </lang>

Output:

Count Word
===== =================
40292 the
19825 of
14703 and
14249 a
13594 to
122613
11061 in
09436 he
08579 was
07530 that
06428 his
-----------------------
29109 unique words found.

Lua

Works with: lua version 5.3

<lang lua> -- This program takes two command line arguments. The first (arg[1]) -- specifies the input file; if it is absent, then standard input is used. -- The second (arg[2]) refers to the number of results to show; if it is -- absent, default to the top 10 words.

local freq = {} for line in io.lines(arg[1]) do local lowerline = string.lower(line) for word in string.gmatch(lowerline, "%a+") do if not freq[word] then freq[word] = 1 else freq[word] = freq[word] + 1 end end end

local array = {} for word, count in pairs(freq) do table.insert(array, {word, count}) end

table.sort(array, function (a, b) return a[2] > b[2] end)

for i = 1, arg[2] or 10 do io.write(string.format('%7d %s\n', array[i][2] , array[i][1])) end </lang>

Output:

❯ ./wordcount.lua 135-0.txt
  41093 the
  19954 of
  14943 and
  14558 a
  13953 to
  11219 in
   9649 he
   8622 was
   7924 that
   6661 it

Nim

<lang Nim>import tables, strutils, sequtils, httpclient

proc take[T](s: openArray[T], n: int): seq[T] = s[0 ..< min(n, s.len)]

var client = newHttpClient() var text = client.getContent("https://www.gutenberg.org/files/135/135-0.txt")

var wordFrequencies = text.toLowerAscii.splitWhitespace.toCountTable wordFrequencies.sort for (word, count) in toSeq(wordFrequencies.pairs).take(10):

 echo alignLeft($count, 8), word</lang>

Output:

40372   the
19868   of
14472   and
14278   a
13589   to
11024   in
9213    he
8347    was
7250    that
6414    his

Objeck

<lang objeck>use System.IO.File; use Collection; use RegEx;

class Rosetta {

 function : Main(args : String[]) ~ Nil {
   if(args->Size() <> 1) {
     return;
   };
   
   input := FileReader->ReadFile(args[0]);
   filter := RegEx->New("\\w+");
   words := filter->Find(input);
   
   word_counts := StringMap->New();
   each(i : words) {
     word := words->Get(i)->As(String);
     if(word <> Nil & word->Size() > 0) {
       word := word->ToLower();
       if(word_counts->Has(word)) {
         count := word_counts->Find(word)->As(IntHolder);
         count->Set(count->Get() + 1);
       }
       else {
         word_counts->Insert(word, IntHolder->New(1));
       };
     };
   };  
   
   count_words := IntMap->New();
   words := word_counts->GetKeys();
   each(i : words) {
     word := words->Get(i)->As(String);
     count := word_counts->Find(word)->As(IntHolder);
     count_words->Insert(count->Get(), word);
   };
   
   counts := count_words->GetKeys();
   counts->Sort();
   
   index := 1;
   "Rank\tWord\tFrequency"->PrintLine();
   "====\t====\t===="->PrintLine();
   for(i := count_words->Size() - 1; i >= 0; i -= 1;) {
     if(count_words->Size() - 10 <= i) {
       count := counts->Get(i);
       word := count_words->Find(count)->As(String);
       "{$index}\t{$word}\t{$count}"->PrintLine();
       index += 1;
     };
   };
 }

}</lang>

Output:

Rank    Word    Frequency
====    ====    ====
1       the     41036
2       of      19946
3       and     14940
4       a       14589
5       to      13939
6       in      11204
7       he      9645
8       was     8619
9       that    7922
10      it      6659

OCaml

<lang ocaml>let () =

 let n =
   try int_of_string Sys.argv.(1)
   with _ -> 10
 in
 let ic = open_in "135-0.txt" in
 let h = Hashtbl.create 97 in
 let w = Str.regexp "[^A-Za-zéèàêâôîûœ]+" in
 try
   while true do
     let line = input_line ic in
     let words = Str.split w line in
     List.iter (fun word ->
       let word = String.lowercase_ascii word in
       match Hashtbl.find_opt h word with
       | None -> Hashtbl.add h word 1
       | Some x -> Hashtbl.replace h word (succ x)
     ) words
   done
 with End_of_file ->
   close_in ic;
   let l = Hashtbl.fold (fun word count acc -> (word, count)::acc) h [] in
   let s = List.sort (fun (_, c1) (_, c2) -> compare c2 c1) l in
   let r = List.init n (fun i -> List.nth s i) in
   List.iter (fun (word, count) ->
     Printf.printf "%d  %s\n" count word
   ) r</lang>

Output:

$ ocaml str.cma word_freq.ml 
41092  the
19954  of
14943  and
14554  a
13953  to
11219  in
9649  he
8622  was
7924  that
6661  it

Perl

Translation of: Raku

<lang perl>$top = 10;

open $fh, "<", '135-0.txt'; ($text = join , <$fh>) =~ tr/A-Z/a-z/;

@matcher = (

   qr/[a-z]+/,     # simple 7-bit ASCII
   qr/\w+/,        # word characters with underscore
   qr/[a-z0-9]+/,  # word characters without underscore

);

for $reg (@matcher) {

   print "\nTop $top using regex: " . $reg . "\n";
   @matches = $text =~ /$reg/g;
   my %words;
   for $w (@matches) { $words{$w}++ };
   $c = 0;
   for $w ( sort { $words{$b} <=> $words{$a} } keys %words ) {
       printf "%-7s %6d\n", $w, $words{$w};
       last if ++$c >= $top;
   }

}</lang>

Output:

Top 10 using regex: (?^:[a-z]+)
the      41089
of       19949
and      14942
a        14608
to       13951
in       11214
he        9648
was       8621
that      7924
it        6661

Top 10 using regex: (?^:\w+)
the      41036
of       19946
and      14940
a        14589
to       13939
in       11204
he        9645
was       8619
that      7922
it        6659

Top 10 using regex: (?^:[a-z0-9]+)
the      41089
of       19949
and      14942
a        14608
to       13951
in       11214
he        9648
was       8621
that      7924
it        6661

Phix

<lang Phix>?"loading..." constant subs = "\t\r\n_.,\"\'!;:?][()|=<>#/*{}+@%&$",

        reps = repeat(' ',length(subs)),
        fn = open("135-0.txt","r")

string text = lower(substitute_all(get_text(fn),subs,reps)) close(fn) sequence words = append(sort(split(text,no_empty:=true)),"") constant wf = new_dict() string last = words[1] integer count = 1 for i=2 to length(words) do

   if words[i]!=last then
       setd({count,last},0,wf)
       count = 0
       last = words[i]
   end if
   count += 1

end for count = 10 function visitor(object key, object /*data*/, object /*user_data*/)

   ?key
   count -= 1
   return count>0

end function traverse_dict(routine_id("visitor"),0,wf,true)</lang>

Output:

loading...
{40743,"the"}
{19925,"of"}
{14881,"and"}
{14474,"a"}
{13704,"to"}
{11174,"in"}
{9623,"he"}
{8613,"was"}
{7867,"that"}
{6612,"it"}

PHP

<lang php> <?php

preg_match_all('/\w+/', file_get_contents($argv[1]), $words); $frecuency = array_count_values($words[0]); arsort($frecuency);

echo "Rank\tWord\tFrequency\n====\t====\t=========\n"; $i = 1; foreach ($frecuency as $word => $count) {

   echo $i . "\t" . $word . "\t" . $count . "\n";
   if ($i >= 10) {
       break;
   }
   $i++;

}</lang>

Output:

Rank  Word  Frequency
====  ====  =========
 1    the   36636
 2     of   19615
 3    and   14079
 4     to   13535
 5      a   13527
 6     in   10256
 7    was    8543
 8   that    7324
 9     he    6814
10    had    6139

PicoLisp

<lang PicoLisp>(setq *Delim " ^I^J^M-_.,\"'*[]?!&@#$%^:;") (setq *Skip (chop *Delim))

(de word+ NIL

  (prog1
     (lowc (till *Delim T))
     (while (member (peek) *Skip) (char)) ) )

(off B) (in "135-0.txt"

  (until (eof)
     (let W (word+)
        (if (idx 'B W T) (inc (car @)) (set W 1)) ) ) )

(for L (head 10 (flip (by val sort (idx 'B))))

  (println L (val L)) )</lang>

Output:

"the" 41088
"of" 19949
"and" 14942
"a" 14545
"to" 13950
"in" 11214
"he" 9647
"was" 8620
"that" 7924
"it" 6661

Prolog

Works with: SWI Prolog

<lang prolog>print_top_words(File, N):-

   read_file_to_string(File, String, [encoding(utf8)]),
   re_split("\\w+", String, Words),
   lower_case(Words, Lower),
   sort(1, @=<, Lower, Sorted),
   merge_words(Sorted, Counted),
   sort(2, @>, Counted, Top_words),
   writef("Top %w words:\nRank\tCount\tWord\n", [N]),
   print_top_words(Top_words, N, 1).

lower_case([_], []):-!. lower_case([_, Word|Words], [Lower - 1|Rest]):-

   string_lower(Word, Lower),
   lower_case(Words, Rest).

merge_words([], []):-!. merge_words([Word - C1, Word - C2|Words], Result):-

   !,
   C is C1 + C2,
   merge_words([Word - C|Words], Result).

merge_words([W|Words], [W|Rest]):-

   merge_words(Words, Rest).

print_top_words([], _, _):-!. print_top_words(_, 0, _):-!. print_top_words([Word - Count|Rest], N, R):-

   writef("%w\t%w\t%w\n", [R, Count, Word]),
   N1 is N - 1,
   R1 is R + 1,
   print_top_words(Rest, N1, R1).

main:-

   print_top_words("135-0.txt", 15).</lang>

Output:

Top 15 words:
Rank	Count	Word
1	41040	the
2	19951	of
3	14942	and
4	14539	a
5	13941	to
6	11209	in
7	9646	he
8	8620	was
9	7922	that
10	6659	it
11	6469	his
12	6195	is
13	6181	had
14	5149	which
15	4530	with

Python

Collections

Python2.7

<lang python>import collections import re import string import sys

def main():

 counter = collections.Counter(re.findall(r"\w+",open(sys.argv[1]).read().lower()))
 print counter.most_common(int(sys.argv[2]))

if __name__ == "__main__":

 main()</lang>

Output:

$ python wordcount.py 135-0.txt 10
[('the', 41036), ('of', 19946), ('and', 14940), ('a', 14589), ('to', 13939),
 ('in', 11204), ('he', 9645), ('was', 8619), ('that', 7922), ('it', 6659)]

Python3.6

<lang python>from collections import Counter from re import findall

les_mis_file = 'les_mis_135-0.txt'

def _count_words(fname):

   with open(fname) as f:
       text = f.read()
   words = findall(r'\w+', text.lower())
   return Counter(words)

def most_common_words_in_file(fname, n):

   counts = _count_words(fname)
   for word, count in 'WORD', 'COUNT' + counts.most_common(n):
       print(f'{word:>10} {count:>6}')

if __name__ == "__main__":

   n = int(input('How many?: '))
   most_common_words_in_file(les_mis_file, n)</lang>

Output:

How many?: 10
      WORD  COUNT
       the  41036
        of  19946
       and  14940
         a  14586
        to  13939
        in  11204
        he   9645
       was   8619
      that   7922
        it   6659

Sorted and groupby

Works with: Python version 3.7

<lang python>""" Word count task from Rosetta Code http://www.rosettacode.org/wiki/Word_count#Python """ from itertools import (groupby,

                      starmap)

from operator import itemgetter from pathlib import Path from typing import (Iterable,

                   List,
                   Tuple)

FILEPATH = Path('lesMiserables.txt') COUNT = 10

def main():

   words_and_counts = most_frequent_words(FILEPATH)
   print(*words_and_counts[:COUNT], sep='\n')

def most_frequent_words(filepath: Path,

                       *,
                       encoding: str = 'utf-8') -> List[Tuple[str, int]]:
   """
   A list of word-frequency pairs sorted by their occurrences.
   The words are read from the given file.
   """
   def word_and_frequency(word: str,
                          words_group: Iterable[str]) -> Tuple[str, int]:
       return word, capacity(words_group)

   file_contents = filepath.read_text(encoding=encoding)
   words = file_contents.lower().split()
   grouped_words = groupby(sorted(words))
   words_and_frequencies = starmap(word_and_frequency, grouped_words)
   return sorted(words_and_frequencies, key=itemgetter(1), reverse=True)

def capacity(iterable: Iterable) -> int:

   """Returns a number of elements in an iterable"""
   return sum(1 for _ in iterable)

if __name__ == '__main__':

   main()

</lang>

Output:

('the', 40372)
('of', 19868)
('and', 14472)
('a', 14278)
('to', 13589)
('in', 11024)
('he', 9213)
('was', 8347)
('that', 7250)
('his', 6414)

R

I chose to remove apostrophes only if they're followed by an s (so "mom" and "mom's" will show up as the same word but "they" and "they're" won't). I also chose not to remove hyphens. <lang R> wordcount<-function(file,n){

 punctuation=c("`","~","!","@","#","$","%","^","&","*","(",")","_","+","=","{","[","}","]","|","\\",":",";","\"","<",",",">",".","?","/","'s")
 wordlist=scan(file,what=character())
 wordlist=tolower(wordlist)
 for(i in 1:length(punctuation)){
   wordlist=gsub(punctuation[i],"",wordlist,fixed=T)
 }
 df=data.frame("Word"=sort(unique(wordlist)),"Count"=rep(0,length(unique(wordlist))))
 for(i in 1:length(unique(wordlist))){
   df[i,2]=length(which(wordlist==df[i,1]))
 }
 df=df[order(df[,2],decreasing = T),]
 row.names(df)=1:nrow(df)
 return(df[1:n,])

} </lang>

Output:

> wordcount("MobyDick.txt",10)
Read 212793 items
   Word Count
1   the 14346
2    of  6590
3   and  6340
4     a  4611
5    to  4572
6    in  4130
7  that  2903
8   his  2516
9    it  2308
10    i  1845

Racket

<lang racket>#lang racket

(define (all-words f (case-fold string-downcase))

 (map case-fold (regexp-match* #px"\\w+" (file->string f))))

(define (l.|l| l) (cons (car l) (length l)))

(define (counts l (>? >)) (sort (map l.|l| (group-by values l)) >? #:key cdr))

(module+ main

 (take (counts (all-words "data/les-mis.txt")) 10))</lang>

Output:

'(("the" . 41036)
  ("of" . 19946)
  ("and" . 14940)
  ("a" . 14589)
  ("to" . 13939)
  ("in" . 11204)
  ("he" . 9645)
  ("was" . 8619)
  ("that" . 7922)
  ("it" . 6659))

Raku

(formerly Perl 6)

Works with: Rakudo version 2017.07

Note: much of the following exposition is no longer critical to the task as the requirements have been updated, but is left here for historical and informational reasons.

This is slightly trickier than it appears initially. The task specifically states: "A word is a sequence of one or more contiguous letters", so contractions and hyphenated words are broken up. Initially we might reach for a regex matcher like /\w+/ , but \w includes underscore, which is not a letter but a punctuation connector; and this text is full of underscores since that is how Project Gutenberg texts denote italicized text. The underscores are not actually parts of the words though, they are markup.

We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various accented glyphs. Those are letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.)

Actually, in this case /A-Za-z/ returns very nearly the correct answer. Unfortunately, the name "Alèthe" appears once (only once!) in the text, gets incorrectly split into Al & the, and incorrectly reports 41089 occurrences of "the". The text has several words like "Panathenæa", "ça", "aérostiers" and "Keksekça" so the counts for 'a' are off too. The other 8 of the top 10 are "correct" using /A-Za-z/, but it is mostly by accident.

A more accurate regex matcher would be some kind of Unicode aware /\w/ minus underscore. It may also be useful, depending on your requirements, to recognize contractions with embedded apostrophes, hyphenated words, and hyphenated words broken across lines.

Here is a sample that shows the result when using various different matchers. <lang perl6>sub MAIN ($filename, $top = 10) {

   my $file = $filename.IO.slurp.lc.subst(/ (<[\w]-[_]>'-')\n(<[\w]-[_]>) /, {$0 ~ $1}, :g );
   my @matcher = (
       rx/ <[a..z]>+ /,    # simple 7-bit ASCII
       rx/ \w+ /,          # word characters with underscore
       rx/ <[\w]-[_]>+ /,  # word characters without underscore
       rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /   # word characters without underscore but with hyphens and contractions
   );
   for @matcher -> $reg {
       say "\nTop $top using regex: ", $reg.perl;
       .put for $file.comb( $reg ).Bag.sort(-*.value)[^$top];
   }

}</lang>

Output:

Passing in the file name and 10:

Top 10 using regex: rx/ <[a..z]>+ /
the	41089
of	19949
and	14942
a	14608
to	13951
in	11214
he	9648
was	8621
that	7924
it	6661

Top 10 using regex: rx/ \w+ /
the	41035
of	19946
and	14940
a	14577
to	13939
in	11204
he	9645
was	8619
that	7922
it	6659

Top 10 using regex: rx/ <[\w]-[_]>+ /
the	41088
of	19949
and	14942
a	14596
to	13951
in	11214
he	9648
was	8621
that	7924
it	6661

Top 10 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
the	41081
of	19930
and	14934
a	14587
to	13735
in	11204
he	9607
was	8620
that	7825
it	6535

REXX

version 1

This REXX version doesn't need to sort the list of words.

Currently, this version recognizes all the accented (non-Latin) accented letters that are present in the text (file) that is specified to be used (and some other non-Latin letters as well). This means that the word Alèthe is treated as one word, not as two words Al the (and not thereby adding two words).

This version also supports words that contain embedded apostrophes ( ' ) [that is, within a word, but not those words that start or end with an apostrophe; for those words, the apostrophe is elided].

Thus, it's is counted separately from it or its.

Since REXX doesn't support UTF-8 encodings, code was added to this REXX version to support the accented letters in the mandated input file. <lang rexx>/*REXX pgm displays top 10 words in a file (includes foreign letters), case is ignored.*/ parse arg fID top . /*obtain optional arguments from the CL*/ if fID== | fID=="," then fID= 'les_mes.TXT' /*None specified? Then use the default.*/ if top== | top=="," then top= 10 /* " " " " " " */ @.=0; c=0; abcL="abcdefghijklmnopqrstuvwxyz'" /*initialize word list, count; alphabet*/ q= "'"; abcU= abcL; upper abcU /*define uppercase version of alphabet*/ totW=0; accL= 'üéâÄàÅÇêëèïîìéæôÖòûùÿáíóúÑ' /* " " of some accented chrs*/

             accU= 'ÜéâäàåçêëèïîìÉÆôöòûùÿáíóúñ' /*   "   lowercase accented characters.*/
             accG= 'αßΓπΣσµτΦΘΩδφε'             /*   "   some upper/lower Greek letters*/

a=abcL || abcL ||accL ||accL || accG /* " char string of after letters.*/ b=abcL || abcU ||accL ||accU || accG || xrange() /* " char string of before " */ x= 'Çà åÅ çÇ êÉ ëÉ áà óâ ªæ ºç ¿è ⌐é ¬ê ½ë «î »ï ▒ñ ┤ô ╣ù ╗û ╝ü' /*list of 16-bit chars.*/ xs= words(x) /*num. " " " */ !.= /*define the original word instances. */

    do #=0  while lines(fID)\==0; $=linein(fID) /*loop whilst there are lines in file. */
    if pos('├', $)\==0  then do k=1  for xs;     _=word(x, k)      /*any 16-bit chars? */
                             $=changestr('├'left(_, 1), $, right(_, 1) )     /*convert.*/
                             end   /*k*/
    $=translate( $, a, b)                       /*remove superfluous blanks in the line*/
        do while $\=;      parse var  $  z  $ /*now, process each word in the $ list.*/
        parse var  z     z1  2  zr    -1  zL  /*extract:  first, middle, & last char.*/
        if z1==q  then do; z=zr; if z== then iterate; end  /*starts with apostrophe? */
        if zL==q  then z=strip(left(z, length(z) - 1))       /*ends     "       "      */
        if z==  then iterate                               /*if Z is now null,  skip.*/
        if @.z==0  then do;  c=c+1; !.c=z; end  /*bump word count; assign word to array*/
        totW=totW + 1;       @.z=@.z + 1        /*bump total words & count of the word.*/
        end   /*while*/
    end       /*#*/

say commas(totW) ' words found ('commas(c) "unique) in " commas(#),

                    ' records read from file: '     fID;                              say

say right('word', 40) " " center(' rank ', 6) " count " /*display title for output*/ say right('════', 40) " " center('══════', 6) " ═══════" /* " title separator.*/ tops=1

    do  until otops==tops | tops>top            /*process enough words to satisfy  TOP.*/
    WL=;         mk=0;     otops=tops           /*initialize the word list (to a NULL).*/
         do n=1  for c;    z=!.n;      k=@.z    /*process the list of words in the file*/
         if k==mk  then WL=WL z                 /*handle cases of tied number of words.*/
         if k> mk  then do;  mk=k;  WL=z;  end  /*this word count is the current max.  */
         end   /*n*/
    wr=max( length(' rank '), length(top) )     /*find the maximum length of the rank #*/
         do d=1  for words(WL);  _=word(WL, d)  /*process all words in the  word list. */
         if d==1  then w=max(10, length(@._) )  /*use length of the first number used. */
         say right(@._, 40)          right(commas(tops), wr)        right(commas(@._), w)
         @._= -1                                /*nullify word count for next go around*/
         end   /*d*/                            /* [↑]  this allows a non-sorted list. */
    tops=tops + words(WL)                       /*correctly handle any  tied  rankings.*/
    end        /*until*/

exit /*stick a fork in it, we're all done. */ /*──────────────────────────────────────────────────────────────────────────────────────*/ commas: procedure; parse arg _; n=_'.9'; #=123456789; b=verify(n, #, "M")

       e=verify(n, #'0', , verify(n, #"0.", 'M') ) - 4
                do j=e  to b  by -3;   _=insert(',', _, j);   end  /*j*/;        return _</lang>

output when using the default inputs:

574,122  words found  (23,414 unique)  in  67,663  records read from file:  les_mes.TXT

                                    word    rank    count
                                    ════   ══════  ═══════
                                     the      1     41,088
                                      of      2     19,949
                                     and      3     14,942
                                       a      4     14,595
                                      to      5     13,950
                                      in      6     11,214
                                      he      7      9,607
                                     was      8      8,620
                                    that      9      7,826
                                      it     10      6,535

To see a list of the top 1,000 words that show (among other things) words like it's and other accented words, see the discussion page.

version 2

Inspired by version 1 and adapted for ooRexx. It ignores all characters other than a-z and A-Z (which are translated to a-z). <lang>/*REXX program reads and displays a count of words a file. Word case is ignored.*/ Call time 'R' abc='abcdefghijklmnopqrstuvwxyz' abcABC=abc||translate(abc) parse arg fID_top /*obtain optional arguments from the CL*/ Parse Var fid_top fid ',' top if fID== then fID= 'mis.TXT' /* Use default if not specified */ if top== then top= 10 /* Use default if not specified */ occ.=0 /* occurrences of word (stem) in file */ wn=0 Do While lines(fid)>0 /*loop whilst there are lines in file. */

 line=linein(fID)
 line=translate(line,abc||abc,abcABC||xrange('00'x,'ff'x)) /*use only lowercase letters*/
 Do While line<>
   Parse Var line word line                       /* take a word                         */
   If occ.word=0 Then Do                          /* not yet in word list                */
     wn=wn+1
     word.wn=word
     End
   occ.word=occ.word+1
   End
 End

Say 'We found' wn 'different words' say right('word',40) ' rank count ' /* header */ say right('----',40) '------ -------' /* separator. */ tops=0 Do Until tops>=top | tops>=wn /*process enough words to satisfy TOP.*/

 max_occ=0
 tl=                                          /*initialize (possibly) a list of words*/
 Do wi=1 To wn                                  /*process the list of words in the file*/
   word=word.wi                                 /* take a word from the list           */
   Select
     When occ.word>max_occ Then Do              /* most occurrences so far             */
       tl=word                                  /* candidate for output                */
       max_occ=occ.word                         /* current maximum occurrences         */
       End
     When occ.word=max_occ Then Do              /* tied                                */
       tl=tl word                               /* add to output candidate             */
       End
     Otherwise                                  /* no candidate (yet)                  */
       Nop
     End
   End
   do d=1 for words(tl)
     word=word(tl,d)
     say right(word,40) right(tops+1,4) right(occ.word,8)
     occ.word=0                                /*nullify this word count for next time*/
     End
   tops=tops+words(tl)                         /*correctly handle the tied rankings.  */
 end

Say time('E') 'seconds elapsed'</lang>

Output:

We found 22820 different words
                                    word  rank   count
                                    ---- ------ -------
                                     the    1    41089
                                      of    2    19949
                                     and    3    14942
                                       a    4    14608
                                      to    5    13951
                                      in    6    11214
                                      he    7     9648
                                     was    8     8621
                                    that    9     7924
                                      it   10     6661
1.750000 seconds elapsed

Ring

project : Word count

fp = fopen("Miserables.txt","r") str = fread(fp, getFileSize(fp)) fclose(fp)

mis =substr(str, " ", nl) mis = lower(mis) mis = str2list(mis) count = list(len(mis)) ready = [] for n = 1 to len(mis)

    flag = 0
    for m = 1 to len(mis)
          if mis[n] = mis[m] and n != m
             for p = 1 to len(ready)
                   if m = ready[p]
                      flag = 1
                   ok
             next
             if flag = 0
                count[n] = count[n] + 1                 
             ok
          ok
    next
    if flag = 0
       add(ready, n)
    ok

next for n = 1 to len(count)

    for m = n + 1 to len(count)
         if count[m] > count[n]
            temp = count[n]
            count[n] = count[m]
            count[m] = temp
            temp = mis[n]
            mis[n] = mis[m]
            mis[m] = temp
         ok
    next

next for n = 1 to 10

    see mis[n] + " " + (count[n] + 1) + nl

       c_filestart = 0
       c_fileend = 2
       fseek(fp,0,c_fileend)
       nfilesize = ftell(fp)
       fseek(fp,0,c_filestart)
       return nfilesize

func swap(a, b)

       temp = a
       a = b
       b = temp
       return [a, b]

</lang> Output:

the	41089
of	19949
and	14942
a	14608
to	13951
in	11214
he	9648
was	8621
that	7924
it	6661

Ruby

<lang ruby> class String

 def wc
 n = Hash.new(0)
 downcase.scan(/[A-Za-zÀ-ÿ]+/) { |g| n[g] += 1 }
 n.sort{|n,g| n[1]<=>g[1]}
 end

end

open('135-0.txt') { |n| n.read.wc[-10,10].each{|n| puts n[0].to_s+"->"+n[1].to_s} } </lang>

Output:

it->6661
that->7924
was->8621
he->9648
in->11214
to->13951
a->14596
and->14942
of->19949
the->41088

Tally and max_by

Works with: Ruby version 2.7

<lang ruby>RE = /alpha:+/ count = open("135-0.txt").read.downcase.scan(RE).tally.max_by(10, &:last) count.each{|ar| puts ar.join("->") } </lang>

Output:

the->41092
of->19954
and->14943
a->14546
to->13953
in->11219
he->9649
was->8622
that->7924
it->6661

Rust

<lang Rust>use std::cmp::Reverse; use std::collections::HashMap; use std::fs::File; use std::io::{BufRead, BufReader};

extern crate regex; use regex::Regex;

fn word_count(file: File, n: usize) {

   let word_regex = Regex::new("(?i)[a-z']+").unwrap();

   let mut words = HashMap::new();
   for line in BufReader::new(file).lines() {
       word_regex
           .find_iter(&line.expect("Read error"))
           .map(|m| m.as_str())
           .for_each(|word| {
               *words.entry(word.to_lowercase()).or_insert(0) += 1;
           });
   }

   let mut words: Vec<_> = words.iter().collect();
   words.sort_unstable_by_key(|&(word, count)| (Reverse(count), word));

   for (word, count) in words.iter().take(n) {
       println!("{:8} {:>8}", word, count);
   }

}

fn main() {

   word_count(File::open("135-0.txt").expect("File open error"), 10)

}</lang>

Output:

the         41083
of          19948
and         14941
a           14604
to          13951
in          11212
he           9604
was          8621
that         7824
it           6534

Scala

Featuring online remote file as input

Output:

Best seen running in your browser Scastie (remote JVM). <lang Scala>import scala.io.Source

object WordCount extends App {

 val url = "http://www.gutenberg.org/files/135/135-0.txt"
 val header = "Rank Word  Frequency\n==== ======== ======"

 def wordCnt =
   Source.fromURL(url).getLines()
     .filter(_.nonEmpty)
     .flatMap(_.split("""\W+""")).toSeq
     .groupBy(_.toLowerCase())
     .mapValues(_.size).toSeq
     .sortWith { case ((_, v0), (_, v1)) => v0 > v1 }
     .take(10).zipWithIndex

 println(header)
 wordCnt.foreach {
   case ((word, count), rank) => println(f"${rank + 1}%4d $word%-8s $count%6d")
 }

 println(s"\nSuccessfully completed without errors. [total ${scala.compat.Platform.currentTime - executionStart} ms]")

}</lang>

Output:

Rank Word  Frequency
==== ======== ======
   1 the       41036
   2 of        19946
   3 and       14940
   4 a         14589
   5 to        13939
   6 in        11204
   7 he         9645
   8 was        8619
   9 that       7922
  10 it         6659

Successfully completed without errors. [total 4528 ms]

Seed7

The Seed7 program uses the function getHttp, to get the file 135-0.txt directly from Gutemberg. The library scanfile.s7i provides getSimpleSymbol, to get words from a fle. The words are converted to lower case, to assure that "The" and "the" are considered the same.

<lang seed7>$ include "seed7_05.s7i";

 include "gethttp.s7i";
 include "strifile.s7i";
 include "scanfile.s7i";
 include "chartype.s7i";
 include "console.s7i";

const type: wordHash is hash [string] integer; const type: countHash is hash [integer] array string;

const proc: main is func

 local
   var file: inFile is STD_NULL;
   var string: aWord is "";
   var wordHash: numberOfWords is wordHash.EMPTY_HASH;
   var countHash: countWords is countHash.EMPTY_HASH;
   var array integer: countKeys is 0 times 0;
   var integer: index is 0;
   var integer: number is 0;
 begin
   OUT := STD_CONSOLE;
   inFile := openStrifile(getHttp("www.gutenberg.org/files/135/135-0.txt"));
   while hasNext(inFile) do
     aWord := lower(getSimpleSymbol(inFile));
     if aWord <> "" and aWord[1] in letter_char then
       if aWord in numberOfWords then
         incr(numberOfWords[aWord]);
       else
         numberOfWords @:= [aWord] 1;
       end if;
     end if;
   end while;
   countWords := flip(numberOfWords);
   countKeys := sort(keys(countWords));
   writeln("Word    Frequency");
   for index range length(countKeys) downto length(countKeys) - 9 do
     number := countKeys[index];
     for aWord range sort(countWords[number]) do
       writeln(aWord rpad 8 <& number);
     end for;
   end for;
 end func;</lang>

Output:

Word    Frequency
the     41036
of      19946
and     14940
a       14589
to      13939
in      11204
he      9645
was     8619
that    7922
it      6659

Sidef

<lang ruby>var count = Hash() var file = File(ARGV[0] \\ '135-0.txt')

file.open_r.each { |line|

   line.lc.scan(/[\pL]+/).each { |word|
       count{word} := 0 ++
   }

}

var top = count.sort_by {|_,v| v }.last(10).flip

top.each { |pair|

   say "#{pair.key}\t-> #{pair.value}"

}</lang>

Output:

the	-> 41088
of	-> 19949
and	-> 14942
a	-> 14596
to	-> 13951
in	-> 11214
he	-> 9648
was	-> 8621
that	-> 7924
it	-> 6661

Simula

<lang simula>COMMENT COMPILE WITH $ cim -m64 word-count.sim

BEGIN

   COMMENT ----- CLASSES FOR GENERAL USE ;

   ! ABSTRACT HASH KEY TYPE ;
   CLASS HASHKEY;
   VIRTUAL:
       PROCEDURE HASH IS
           INTEGER PROCEDURE HASH;;
       PROCEDURE EQUALTO IS
           BOOLEAN PROCEDURE EQUALTO(K); REF(HASHKEY) K;;
   BEGIN
   END HASHKEY;

   ! ABSTRACT HASH VALUE TYPE ;
   CLASS HASHVAL;
   BEGIN
       ! THERE IS NOTHING REQUIRED FOR THE VALUE TYPE ;
   END HASHVAL;

   CLASS HASHMAP;
   BEGIN
       CLASS INNERHASHMAP(N); INTEGER N;
       BEGIN

           INTEGER PROCEDURE INDEX(K); REF(HASHKEY) K;
           BEGIN
               INTEGER I;
               IF K == NONE THEN
                   ERROR("HASHMAP.INDEX: NONE IS NOT A VALID KEY");
               I := MOD(K.HASH,N);
           LOOP:
               IF KEYTABLE(I) == NONE OR ELSE KEYTABLE(I).EQUALTO(K) THEN
                   INDEX := I
               ELSE BEGIN
                   I := IF I+1 = N THEN 0 ELSE I+1;
                   GO TO LOOP;
               END;
           END INDEX;

           ! PUT SOMETHING IN ;
           PROCEDURE PUT(K,V); REF(HASHKEY) K; REF(HASHVAL) V;
           BEGIN
               INTEGER I;
               IF V == NONE THEN
                   ERROR("HASHMAP.PUT: NONE IS NOT A VALID VALUE");
               I := INDEX(K);
               IF KEYTABLE(I) == NONE THEN BEGIN
                   IF SIZE = N THEN
                       ERROR("HASHMAP.PUT: TABLE FILLED COMPLETELY");
                   KEYTABLE(I) :- K;
                   VALTABLE(I) :- V;
                   SIZE := SIZE+1;
               END ELSE
                   VALTABLE(I) :- V;
           END PUT;

           ! GET SOMETHING OUT ;
           REF(HASHVAL) PROCEDURE GET(K); REF(HASHKEY) K;
           BEGIN
               INTEGER I;
               IF K == NONE THEN
                   ERROR("HASHMAP.GET: NONE IS NOT A VALID KEY");
               I := INDEX(K);
               IF KEYTABLE(I) == NONE THEN
                   GET :- NONE ! ERROR("HASHMAP.GET: KEY NOT FOUND");
               ELSE
                   GET :- VALTABLE(I);
           END GET;

           PROCEDURE CLEAR;
           BEGIN
               INTEGER I;
               FOR I := 0 STEP 1 UNTIL N-1 DO BEGIN
                   KEYTABLE(I) :- NONE;
                   VALTABLE(I) :- NONE;
               END;
               SIZE := 0;
           END CLEAR;

           ! DATA MEMBERS OF CLASS HASHMAP ;
           REF(HASHKEY) ARRAY KEYTABLE(0:N-1);
           REF(HASHVAL) ARRAY VALTABLE(0:N-1);
           INTEGER SIZE;

       END INNERHASHMAP;

       PROCEDURE PUT(K,V); REF(HASHKEY) K; REF(HASHVAL) V;
       BEGIN
           IF IMAP.SIZE >= 0.75 * IMAP.N THEN
           BEGIN
               COMMENT RESIZE HASHMAP ;
               REF(INNERHASHMAP) NEWIMAP;
               REF(ITERATOR) IT;
               NEWIMAP :- NEW INNERHASHMAP(2 * IMAP.N);
               IT :- NEW ITERATOR(THIS HASHMAP);
               WHILE IT.MORE DO
               BEGIN
                   REF(HASHKEY) KEY;
                   KEY :- IT.NEXT;
                   NEWIMAP.PUT(KEY, IMAP.GET(KEY));
               END;
               IMAP.CLEAR;
               IMAP :- NEWIMAP;
           END;
           IMAP.PUT(K, V);
       END;

       REF(HASHVAL) PROCEDURE GET(K); REF(HASHKEY) K;
           GET :- IMAP.GET(K);

       PROCEDURE CLEAR;
           IMAP.CLEAR;

       INTEGER PROCEDURE SIZE;
           SIZE := IMAP.SIZE;

       REF(INNERHASHMAP) IMAP;

       IMAP :- NEW INNERHASHMAP(16);
   END HASHMAP;

   CLASS ITERATOR(H); REF(HASHMAP) H;
   BEGIN
       INTEGER POS,KEYCOUNT;

       BOOLEAN PROCEDURE MORE;
           MORE := KEYCOUNT < H.SIZE;

       REF(HASHKEY) PROCEDURE NEXT;
       BEGIN
           INSPECT H DO
           INSPECT IMAP DO
           BEGIN
               WHILE KEYTABLE(POS) == NONE DO
                   POS := POS+1;
               NEXT :- KEYTABLE(POS);
               KEYCOUNT := KEYCOUNT+1;
               POS := POS+1;
           END;
       END NEXT;

   END ITERATOR;

   COMMENT ----- PROBLEM SPECIFIC CLASSES ;

   HASHKEY CLASS TEXTHASHKEY(T); VALUE T; TEXT T;
   BEGIN
       INTEGER PROCEDURE HASH;
       BEGIN
           INTEGER I;
           T.SETPOS(1);
           WHILE T.MORE DO
               I := 31*I+RANK(T.GETCHAR);
           HASH := I;
       END HASH;
       BOOLEAN PROCEDURE EQUALTO(K); REF(HASHKEY) K;
           EQUALTO := T = K QUA TEXTHASHKEY.T;
   END TEXTHASHKEY;

   HASHVAL CLASS COUNTER;
   BEGIN
       INTEGER COUNT;
   END COUNTER;

   REF(INFILE) INF;
   REF(HASHMAP) MAP;
   REF(TEXTHASHKEY) KEY;
   REF(COUNTER) VAL;
   REF(ITERATOR) IT;
   TEXT LINE, WORD;
   INTEGER I, J, MAXCOUNT, LINENO;
   INTEGER ARRAY MAXCOUNTS(1:10);
   REF(TEXTHASHKEY) ARRAY MAXWORDS(1:10);

   WORD :- BLANKS(1000);
   MAP :- NEW HASHMAP;
 
   COMMENT MAP WORDS TO COUNTERS ;

   INF :- NEW INFILE("135-0.txt");
   INF.OPEN(BLANKS(4096));
   WHILE NOT INF.LASTITEM DO
   BEGIN
       BOOLEAN INWORD;

       PROCEDURE SAVE;
       BEGIN
           IF WORD.POS > 1 THEN
           BEGIN
               KEY :- NEW TEXTHASHKEY(WORD.SUB(1, WORD.POS - 1));
               VAL :- MAP.GET(KEY);
               IF VAL == NONE THEN
               BEGIN
                   VAL :- NEW COUNTER;
                   MAP.PUT(KEY, VAL);
               END;
               VAL.COUNT := VAL.COUNT + 1;
               WORD := " ";
               WORD.SETPOS(1);
           END;
       END SAVE;

       LINENO := LINENO + 1;
       LINE :- COPY(INF.IMAGE).STRIP; INF.INIMAGE;

       COMMENT SEARCH WORDS IN LINE ;
       COMMENT A WORD IS ANY SEQUENCE OF LETTERS ;

       INWORD := FALSE;
       LINE.SETPOS(1);
       WHILE LINE.MORE DO
       BEGIN
           CHARACTER CH;
           CH := LINE.GETCHAR;
           IF CH >= 'a' AND CH <= 'z' THEN
               CH := CHAR(RANK(CH) - RANK('a') + RANK('A'));
           IF CH >= 'A' AND CH <= 'Z' THEN
           BEGIN
               IF NOT INWORD THEN
               BEGIN
                   SAVE;
                   INWORD := TRUE;
               END;
               WORD.PUTCHAR(CH);
           END ELSE
           BEGIN
               IF INWORD THEN
               BEGIN
                   SAVE;
                   INWORD := FALSE;
               END;
           END;
       END;
       SAVE; COMMENT LAST WORD ;
   END;
   INF.CLOSE;

   COMMENT FIND 10 MOST COMMON WORDS ;

   IT :- NEW ITERATOR(MAP);
   WHILE IT.MORE DO
   BEGIN
       KEY :- IT.NEXT;
       VAL :- MAP.GET(KEY);
       FOR I := 1 STEP 1 UNTIL 10 DO
       BEGIN
           IF VAL.COUNT >= MAXCOUNTS(I) THEN
           BEGIN
               FOR J := 10 STEP -1 UNTIL I + 1 DO
               BEGIN
                   MAXCOUNTS(J) := MAXCOUNTS(J - 1);
                   MAXWORDS(J) :- MAXWORDS(J - 1);
               END;
               MAXCOUNTS(I) := VAL.COUNT;
               MAXWORDS(I) :- KEY;
               GO TO BREAK;
           END;
       END;
   BREAK:
   END;

   COMMENT OUTPUT 10 MOST COMMON WORDS ;

   FOR I := 1 STEP 1 UNTIL 10 DO
   BEGIN
       IF MAXWORDS(I) =/= NONE THEN
       BEGIN
           OUTINT(MAXCOUNTS(I), 10);
           OUTTEXT(" ");
           OUTTEXT(MAXWORDS(I) QUA TEXTHASHKEY.T);
           OUTIMAGE;
       END;
   END;

END </lang>

Output:

     41089 THE
     19949 OF
     14942 AND
     14608 A
     13951 TO
     11214 IN
      9648 HE
      8621 WAS
      7924 THAT
      6661 IT

6 garbage collection(s) in 0.2 seconds.

TMG

progrm: readn/error

       table(freq) table(chain) [firstword = ~0]

loop: not(!<<>>) output

   |   [j=777] batch/loop loop;                   /* Main loop */

/* To use less stack, divide input into batches. */ /* (Avoid interpreting entire input as a single "sentence".) */ batch: [j<=0?] succ

    |  word/skip [j--] skip batch;

skip: string(other); not: params(1) (any($1) fail | ()); readn: string(!<<0123456789>>) readint(n) skip; error: diag(( ={ <ERROR: input must start with a number> * } ));

/* Process a word */ word: smark any(letter) string(letter) scopy

       locate/new
       [freq[k]++] newmax;

locate: find(freq, k); new: enter(freq, k)

       [freq[k] = 1] newmax
       [firstword = firstword==~0 ? k : firstword]
       enter(chain, i) [chain[i]=prevword] [prevword=k];

newmax: [max = max<freq[k] ? freq[k] : max];

/* Output logic */ output: [next=max] outmax: [max=next] [next=0] [max>0?] [j = prevword] cycle/outmax; cycle: [i = j] [k = freq[i]] [n>0?]

       ( [max==freq[i]?] parse(wn)
       | [(freq[i]<max) & (next<freq[i])?] [next = freq[i]]
       | ())
       [i != firstword?] [j = chain[i]] cycle;

wn: getnam(freq, i) [k = freq[i]] decimal(k) [n--]

       = { 2 < > 1 * };

/* Reads decimal integer */ readint: proc(n;i) ignore(<<>>) [n=0] inta int1: [n = n*12+i] inta\int1; inta: char(i) [i<72?] [(i =- 60)>=0?];

/* Variables */ prevword: 0; /* Head of the linked list */ firstword: 0; /* First word's index to know where to stop output */ k: 0; i: 0; j: 0; n: 0; /* Number of most frequent words to display */ max: 0; /* Current highest number of occurrences */ next: 0; /* Next highest number of occurrences */

/* Tables */ freq: 0; chain: 0;

/* Character classes */ letter: <<abcdefghijklmnopqrstuvwxyz>>; other: !<<abcdefghijklmnopqrstuvwxyz>>;</lang>

Unix TMG didn't have tolower builtin. Therefore, you would use it together with tr: <lang bash>cat file | tr A-Z a-z > file1; ./a.out file1</lang>

Additionally, because 1972 TMG only understood ASCII characters, you might want to strip down the diacritics (e.g., é → e): <lang bash>cat file | uni2ascii -B | tr A-Z a-z > file1; ./a.out file1</lang>

UNIX Shell

Works with: Bash

Works with: zsh

This is derived from Doug McIlroy's original 6-line note in the ACM article cited in the task. <lang bash>#!/bin/sh cat ${1} | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${2}q</lang>

Output:

$ ./wordcount.sh 135-0.txt 10 
41089 the
19949 of
14942 and
14608 a
13951 to
11214 in
9648 he
8621 was
7924 that
6661 it

VBA

In order to use it, you have to adapt the PATHFILE Const.

<lang vb> Option Explicit

Private Const PATHFILE As String = "C:\HOME\VBA\ROSETTA"

Sub Main() Dim arr Dim Dict As Object Dim Book As String, temp As String Dim T# T = Timer

  Book = ExtractTxt(PATHFILE & "\les miserables.txt")
  temp = RemovePunctuation(Book)
  temp = UCase(temp)
  arr = Split(temp, " ")
  Set Dict = CreateObject("Scripting.Dictionary")
  FillDictionary Dict, arr
  Erase arr
  SortDictByFreq Dict, arr
  DisplayTheTopMostUsedWords arr, 10

Debug.Print "Words different in this book : " & Dict.Count Debug.Print "-------------------------" Debug.Print "" Debug.Print "Optionally : " Debug.Print "Frequency of the word MISERABLE : " & DisplayFrequencyOf("MISERABLE", Dict) Debug.Print "Frequency of the word DISASTER : " & DisplayFrequencyOf("DISASTER", Dict) Debug.Print "Frequency of the word ROSETTA_CODE : " & DisplayFrequencyOf("ROSETTA_CODE", Dict) Debug.Print "-------------------------" Debug.Print "Execution Time : " & Format(Timer - T, "0.000") & " sec." End Sub

Private Function ExtractTxt(strFile As String) As String 'http://rosettacode.org/wiki/File_input/output#VBA Dim i As Integer

  i = FreeFile
  Open strFile For Input As #i
      ExtractTxt = Input(LOF(1), #i)
  Close #i

End Function

Private Function RemovePunctuation(strBook As String) As String Dim T, i As Integer, temp As String Const PUNCT As String = """,;:!?."

  T = Split(StrConv(PUNCT, vbUnicode), Chr(0))
  temp = strBook
  For i = LBound(T) To UBound(T) - 1
     temp = Replace(temp, T(i), " ")
  Next
  temp = Replace(temp, "--", " ")
  temp = Replace(temp, "...", " ")
  temp = Replace(temp, vbCrLf, " ")
  RemovePunctuation = Replace(temp, "  ", " ")

End Function

Private Sub FillDictionary(d As Object, a As Variant) Dim L As Long

  For L = LBound(a) To UBound(a)
     If a(L) <> "" Then _
        d(a(L)) = d(a(L)) + 1
  Next

End Sub

Private Sub SortDictByFreq(d As Object, myArr As Variant) Dim K Dim L As Long

  ReDim myArr(1 To d.Count, 1 To 2)
  For Each K In d.keys
     L = L + 1
     myArr(L, 1) = K
     myArr(L, 2) = CLng(d(K))
  Next
  SortArray myArr, LBound(myArr), UBound(myArr), 2

End Sub

Private Sub SortArray(a, Le As Long, Ri As Long, Col As Long) Dim ref As Long, L As Long, r As Long, temp As Variant

  ref = a((Le + Ri) \ 2, Col)
  L = Le
  r = Ri
  Do
        Do While a(L, Col) < ref
           L = L + 1
        Loop
        Do While ref < a(r, Col)
           r = r - 1
        Loop
        If L <= r Then
           temp = a(L, 1)
           a(L, 1) = a(r, 1)
           a(r, 1) = temp
           temp = a(L, 2)
           a(L, 2) = a(r, 2)
           a(r, 2) = temp
           L = L + 1
           r = r - 1
        End If
  Loop While L <= r
  If L < Ri Then SortArray a, L, Ri, Col
  If Le < r Then SortArray a, Le, r, Col

End Sub

Private Sub DisplayTheTopMostUsedWords(arr As Variant, Nb As Long) Dim L As Long, i As Integer

  i = 1
  Debug.Print "Rank Word    Frequency"
  Debug.Print "==== ======= ========="
  For L = UBound(arr) To UBound(arr) - Nb + 1 Step -1
     Debug.Print Left(CStr(i) & "    ", 5) & Left(arr(L, 1) & "       ", 8) & " " & Format(arr(L, 2), "0 000")
     i = i + 1
  Next

End Sub

Private Function DisplayFrequencyOf(Word As String, d As Object) As Long

  If d.Exists(Word) Then _
     DisplayFrequencyOf = d(Word)

End Function</lang>

Output:

Words different in this book : 25884
-------------------------
Rank Word    Frequency
==== ======= =========
1    THE      40 831
2    OF       19 807
3    AND      14 860
4    A        14 453
5    TO       13 641
6    IN       11 133
7    HE       9 598
8    WAS      8 617
9    THAT     7 807
10   IT       6 517

Optionally : 
Frequency of the word MISERABLE : 35
Frequency of the word DISASTER : 12
Frequency of the word ROSETTA_CODE : 0
-------------------------
Execution Time : 7,785 sec.

zkl

<lang zkl>fname,count := vm.arglist; // grab cammand line args

  // words may have leading or trailing "_", ie "the" and "_the"

File(fname).pump(Void,"toLower", // read the file line by line and hash words

  RegExp("[a-z]+").pump.fp1(Dictionary().incV))  // line-->(word:count,..)

.toList().copy().sort(fcn(a,b){ b[1]<a[1] })[0,count.toInt()] // hash-->list .pump(String,Void.Xplode,"%s,%s\n".fmt).println();</lang>

Output:

$ zkl bbb ~/Documents/Les\ Miserables.txt 10
the,41089
of,19949
and,14942
a,14608
to,13951
in,11214
he,9648
was,8621
that,7924
it,6661