Letter frequency
You are encouraged to solve this task according to the task description, using any language you may know.
Open a text file and count the occurrences of each letter.
Some of these programs count all characters (including punctuation), but some only count letters A to Z.
Ada
<lang Ada>with Ada.Text_IO;
procedure Letter_Frequency is
Counters: array (Character) of Natural := (others => 0); -- initialize all Counters to 0 C: Character; File: Ada.Text_IO.File_Type;
begin
Ada.Text_IO.Open(File, Mode => Ada.Text_IO.In_File, Name => "letter_frequency.adb"); while not Ada.Text_IO.End_Of_File(File) loop Ada.Text_IO.Get(File, C); Counters(C) := Counters(C) + 1; end loop;
for I in Counters'Range loop if Counters(I) > 0 then Ada.Text_IO.Put_Line("'" & I & "':" & Integer'Image(Counters(I))); end if; end loop;
end Letter_Frequency;</lang>
Sample Output (counting the characters of its own source code):
>./letter_frequency ' ': 122 '"': 6 '&': 3 ... [a lot of lines omitted] 'x': 7 'y': 5 'z': 1
Aikido
<lang aikido>import ctype
var letters = new int [26]
var s = openin (args[0]) while (!s.eof()) {
var ch = s.getchar() if (s.eof()) { break } if (ctype.isalpha (ch)) { var n = cast<int>(ctype.tolower(ch) - 'a') ++letters[n] }
}
foreach i letters.size() {
println (cast<char>('a' + i) + " " + letters[i])
}</lang>
C
<lang c>/* declare array */ int frequency[26]; int ch; FILE* txt_file = fopen ("a_text_file.txt", "rt");
/* init the freq table: */ for (ch = 0; ch < 26; ch++)
frequency[ch] = 0;
while (1) {
ch = fgetc(txt_file); if (ch == EOF) break; /* end of file or read error. EOF is typically -1 */
/* assuming ASCII; "letters" means "a to z" */ if ('a' <= ch && ch <= 'z') /* lower case */ frequency[ch-'a']++; else if ('A' <= ch && ch <= 'Z') /* upper case */ frequency[ch-'A']++;
}</lang>
C++
<lang cpp>#include <fstream>
- include <iostream>
int main() { std::ifstream input("filename.txt", std::ios_base::binary); if (!input) { std::cerr << "error: can't open file\n"; return -1; }
size_t count[256]; std::fill_n(count, 256, 0);
for (char c; input.get(c); ++count[uint8_t(c)]) // process input file ; // empty loop body
for (size_t i = 0; i < 256; ++i) { if (count[i] && isgraph(i)) // non-zero counts of printable characters { std::cout << char(i) << " = " << count[i] << '\n'; } } }</lang> Example output when file contains "Hello, world!" (without quotes):
! = 1 , = 1 H = 1 d = 1 e = 1 l = 3 o = 2 r = 1 w = 1
D
<lang d>import std.stdio, std.ascii, std.algorithm;
void main() {
int[26] frequency;
foreach (ubyte[] buffer; File("data.txt").byChunk(2 ^^ 15)) foreach (c; filter!isAlpha(buffer)) frequency[toLower(c) - 'a']++;
writeln(frequency);
}</lang>
J
The example task is the same: open a text file and compute letter frequency.
Input is a directory-path with filename. Result is 26 integers.
<lang j>require 'files' NB. define fread ltrfreq=: 3 : 0
letters=. u: 65+i.26 NB. upper case letters <: #/.~ (toupper fread y) (,~ -. -.) letters
)</lang>
Example use (based on a configuration file from another task):
<lang j> ltrfreq 'config.file' 88 17 17 24 79 18 19 19 66 0 2 26 26 57 54 31 1 53 43 59 19 6 2 0 8 0</lang>
Java
<lang java5>import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.util.Arrays;
public class LetterFreq { public static int[] countLetters(String filename) throws IOException{ int[] freqs = new int[26]; BufferedReader in = new BufferedReader(new FileReader(filename)); String line; while((line = in.readLine()) != null){ line = line.toUpperCase(); for(char ch:line.toCharArray()){ if(Character.isLetter(ch)){ freqs[ch - 'A']++; } } } in.close(); return freqs; }
public static void main(String[] args) throws IOException{ System.out.println(Arrays.toString(countLetters("filename.txt"))); } }</lang>
In Java 7, we can use try with resources. The countLetters
method would look like this:
<lang java5>public static int[] countLetters(String filename) throws IOException{
int[] freqs = new int[26];
try(BufferedReader in = new BufferedReader(new FileReader(filename))){
String line;
while((line = in.readLine()) != null){
line = line.toUpperCase();
for(char ch:line.toCharArray()){
if(Character.isLetter(ch)){
freqs[ch - 'A']++;
}
}
}
}
return freqs;
}</lang>
Liberty BASIC
Un-rem a line to convert to all-upper-case. Letter freq'y is printed as percentages. <lang lb>
open "text.txt" for input as #i txt$ =input$( #i, lof( #i)) Le =len( txt$) close #i
dim LetterFreqy( 255)
' txt$ =upper$( txt$)
for i =1 to Le char =asc( mid$( txt$, i, 1)) if char >=32 then LetterFreqy( char) =LetterFreqy( char) +1 next i
for j =32 to 255 if LetterFreqy( j) <>0 then print " Character #"; j, "("; chr$( j);_ ") appeared "; using( "##.##", 100 *LetterFreqy( j) /Le); "% of the time." next j
end
</lang>
Lua
<lang lua>-- Open the file named on the command line local file = assert(io.open(arg[1])) -- Keep a table counting the instances of each letter local instances = {} local function tally(char)
-- normalize case char = string.upper(char) -- add to the count of the found character occurrences[char] = occurrences[char] + 1
end -- For each line in the file for line in file:lines() do
line:gsub( '%a', -- For each letter (%a) on the line, tally) --increase the count for that letter
end -- Print letter counts for letter, count in pairs(instances) do
print(letter, count)
end </lang>
Objective-C
<lang objc>#import <Foundation/Foundation.h>
int main (int argc, const char *argv[]) {
NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
NSData *data = [NSData dataWithContentsOfFile:[NSString stringWithCString:argv[1] encoding:NSUTF8StringEncoding]]; NSString *string = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding]; NSCountedSet *countedSet = [[NSCountedSet alloc] init]; NSUInteger len = [string length]; for (NSUInteger i = 0; i < len; i++) { unichar c = [string characterAtIndex:i]; if ([[NSCharacterSet letterCharacterSet] characterIsMember:c]) [countedSet addObject:[NSNumber numberWithInteger:c]]; } [string release]; for (NSNumber *chr in countedSet) { NSLog(@"%C => %lu", (unichar)[chr integerValue], [countedSet countForObject:chr]); } [countedSet release]; [pool release]; return 0;
}</lang>
OCaml
We open a text file and compute letter frequency. Other characters than [a-z] and [A-Z] are ignored, and upper case letters are first converted to lower case before to compute letter frequency.
<lang ocaml>let () =
let ic = open_in Sys.argv.(1) in let base = int_of_char 'a' in let arr = Array.make 26 0 in try while true do let c = Char.lowercase(input_char ic) in let ndx = int_of_char c - base in if ndx < 26 && ndx >= 0 then arr.(ndx) <- succ arr.(ndx) done with End_of_file -> close_in ic; for i=0 to 25 do Printf.printf "%c -> %d\n" (char_of_int(i + base)) arr.(i) done</lang>
Perl
Counts letters in files given on command line or piped to stdin. Case insensitive. <lang perl>while (<>) { $cnt{lc chop}++ while length } print "$_: ", $cnt{$_}//0, "\n" for 'a' .. 'z';</lang>
Perl 6
<lang perl6>(my %count){$_}++ for lines.comb; .say for %count.sort;</lang> The lines function automatically opens the file supplied on the command line. This program does not count newlines.
The following should work by spec, but nobody implements the Bag type yet: <lang perl6>.say for slurp.comb.Bag.pairs.sort;</lang>
PHP
<lang php><?php print_r(array_count_values(str_split(file_get_contents($argv[1])))); ?></lang>
PicoLisp
<lang PicoLisp>(let Freq NIL
(in "file.txt" (while (char) (accu 'Freq @ 1)) ) (sort Freq) )</lang>
For a "file.txt":
abcd cdef
Output:
-> (("^J" . 2) ("a" . 1) ("b" . 1) ("c" . 2) ("d" . 2) ("e" . 1) ("f" . 1))
Prolog
Works with SWI-Prolog.
Only alphabetic codes are computed in uppercase state.
Uses packlist/2 defined there : http://rosettacode.org/wiki/Run-length_encoding#Prolog
<lang Prolog>frequency(File) :-
read_file_to_codes(File, Code, []),
% we only keep alphabetic codes include(my_code_type, Code, LstCharCode),
% we translate char_codes into uppercase atoms. maplist(my_upcase, LstCharCode, LstChar),
% sort and pack the list msort(LstChar, SortLstChar), packList(SortLstChar, Freq), maplist(my_write, Freq).
my_write([Num, Atom]) :-
swritef(A, '%3r', [Num]),
writef('Number of %w :%w\n', [Atom, A]).
my_code_type(Code) :-
code_type(Code, alpha).
my_upcase(CharCode, UpChar) :- char_code(Atom, CharCode), upcase_atom(Atom, UpChar).
- - use_module(library(clpfd)).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ?- packList([a,a,a,b,c,c,c,d,d,e], L). % L = [[3,a],[1,b],[3,c],[2,d],[1,e]] . % % ?- packList(R, [[3,a],[1,b],[3,c],[2,d],[1,e]]). % R = [a,a,a,b,c,c,c,d,d,e] . %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% packList([],[]).
packList([X],1,X) :- !.
packList([X|Rest],[XRun|Packed]):- run(X,Rest, XRun,RRest), packList(RRest,Packed).
run(Var,[],[1,Var],[]).
run(Var,[Var|LRest],[N1, Var],RRest):- N #> 0, N1 #= N + 1, run(Var,LRest,[N, Var],RRest).
run(Var,[Other|RRest], [1,Var],[Other|RRest]):- dif(Var,Other). </lang> Output for this file
Number of A : 63 Number of B : 7 Number of C : 53 Number of D : 29 Number of E : 65 ... Number of T : 52 Number of U : 20 Number of V : 10 Number of W : 8 Number of X : 6 Number of Y : 12 true .
Python
Using collections.Counter
<lang python>import collections, sys
def filecharcount(openfile):
c = collections.Counter() for line in openfile: c.update(line) return sorted(c.items())
f = open(sys.argv[1]) print(filecharcount(f))</lang>
Not using collections.Counter
<lang python>import string if hasattr(string, ascii_lowercase):
letters = string.ascii_lowercase # Python 2.2 and later
else:
letters = string.lowercase # Earlier versions
offset = ord('a')
def countletters(file_handle):
"""Traverse a file and compute the number of occurences of each letter """return results as a simple 26 element list of integers. results = [0] * len(letters) for line in file_handle: for char in line: char = char.lower() if char in letters: results[offset - ord(char)] += 1 # Ordinal of 'a' minus ordinal of any lowercase ASCII letter -> 0..25 return results
if __name__ == "__main__":
sourcedata = open(sys.argv[1]) lettercounts = countletters(sourcedata) for i in xrange(len(lettercounts)): print "%s=%d" % (chr(i + ord('a')), lettercounts[i]),</lang>
This example defines the function and provides a sample usage. The if ... __main__... line allows it to be cleanly imported into any other Python code while also allowing it to function as a standalone script. (A very common Python idiom).
Using a numerically indexed array (list) for this is artificial and clutters the code somewhat.
Using defaultdict
<lang python>... from collections import defaultdict def countletters(file_handle):
"""Count occurences of letters and return a dictionary of them """ results = defaultdict(int) for line in file_handle: for char in line: if char.lower() in letters: c = char.lower() results[c] += 1 return results</lang>
Which eliminates the ungainly fiddling with ordinal values and offsets in function countletters of a previous example above. More importantly it allows the results to be more simply printed using:
<lang python>lettercounts = countletters(sourcedata) for letter,count in lettercounts.iteritems():
print "%s=%s" % (letter, count),</lang>
Again eliminating all fussing with the details of converting letters into list indices.
Ruby
<lang ruby>def letter_frequency(file)
letters = 'a' .. 'z' File.read(file) . split(//) . group_by {|letter| letter.downcase} . select {|key, val| letters.include? key} . collect {|key, val| [key, val.length]}
end
letter_frequency(ARGV[0]).sort_by {|key, val| -val}.each {|pair| p pair}</lang> example output, using the program file as input:
$ ruby letterFrequency.rb letterFrequency.rb ["e", 34] ["l", 20] ["t", 17] ["r", 14] ["a", 12] ["y", 9] ["c", 8] ["i", 7] ["v", 6] ["n", 6] ["f", 6] ["s", 6] ["d", 5] ["p", 5] ["k", 5] ["u", 4] ["o", 4] ["g", 3] ["b", 2] ["h", 2] ["q", 2] ["z", 1] ["w", 1]
SIMPOL
Example: open a text file and compute letter frequency. <lang simpol>constant iBUFSIZE 500
function main(string filename)
fsfileinputstream fpi integer e, i, aval, zval, cval string s, buf, c array chars
e = 0 fpi =@ fsfileinputstream.new(filename, error=e) if fpi =@= .nul s = "Error, file """ + filename + """ not found{d}{a}" else chars =@ array.new() aval = .charval("a") zval = .charval("z") i = 1 while i <= 26 chars[i] = 0 i = i + 1 end while buf = .lcase(fpi.getstring(iBUFSIZE, 1)) while not fpi.endofdata and buf > "" i = 1 while i <= .len(buf) c = .substr(buf, i, 1) cval = .charval(c) if cval >= aval and cval <= zval chars[cval - aval + 1] = chars[cval - aval + 1] + 1 end if i = i + 1 end while buf = .lcase(fpi.getstring(iBUFSIZE, 1)) end while
s = "Character counts for """ + filename + """{d}{a}" i = 1 while i <= chars.count() s = s + .char(aval + i - 1) + ": " + .tostr(chars[i], 10) + "{d}{a}" i = i + 1 end while end if
end function s</lang>
As this was being created I realized that in [SIMPOL] I wouldn't have done it this way (in fact, I wrote it differently the first time and had to go back and change it to use an array afterward). In [SIMPOL] we would have used the set object. It acts similarly to a single-dimensional array, but can also use various set operations, such as difference, unite, intersect, etc. One of th einteresting things is that each unique value is stored only once, and the number of duplicates is stored with it. The sample then looks a little cleaner:
<lang simpol>constant iBUFSIZE 500
function main(string filename)
fsfileinputstream fpi integer e, i, aval, zval string s, buf, c set chars
e = 0 fpi =@ fsfileinputstream.new(filename, error=e) if fpi =@= .nul s = "Error, file """ + filename + """ not found{d}{a}" else chars =@ set.new() aval = .charval("a") zval = .charval("z") buf = .lcase(fpi.getstring(iBUFSIZE, 1)) while not fpi.endofdata and buf > "" i = 1 while i <= .len(buf) c = .substr(buf, i, 1) if .charval(c) >= aval and .charval(c) <= zval chars.addvalue(c) end if i = i + 1 end while buf = .lcase(fpi.getstring(iBUFSIZE, 1)) end while
s = "Character counts for """ + filename + """{d}{a}" i = 1 while i <= chars.count() s = s + chars[i] + ": " + .tostr(chars.valuecount(chars[i]), 10) + "{d}{a}" i = i + 1 end while end if
end function s</lang>
The final stage simply reads the totals for each character. One caveat, if a character is unrepresented, then it will not show up at all in this second implementation.
Tcl
<lang tcl>proc letterHistogram {fileName} {
# Initialize table (in case of short texts without every letter) for {set i 97} {$i<=122} {incr i} { set frequency([format %c $i]) 0 } # Iterate over characters in file set f [open $fileName] foreach c [split [read $f] ""] { # Count them if they're alphabetic if {[string is alpha $c]} { incr frequency([string tolower $c]) } } close $f # Print the histogram parray frequency
}
letterHistogram the/sample.txt</lang>
VBA
<lang VBA> Public Sub LetterFrequency(fname) 'count number of letters in text file "fname" (ASCII-coded) 'note: we count all characters but print only the letter frequencies
Dim Freqs(255) As Long Dim abyte As Byte Dim ascal as Byte 'ascii code for lowercase a Dim ascau as Byte 'ascii code for uppercase a
'try to open the file On Error GoTo CantOpen Open fname For Input As #1 On Error GoTo 0
'initialize For i = 0 To 255
Freqs(i) = 0
Next i
'process file byte-per-byte While Not EOF(1)
abyte = Asc(Input(1, #1)) Freqs(abyte) = Freqs(abyte) + 1
Wend Close #1
'add lower and upper case together and print result Debug.Print "Frequencies:" ascal = Asc("a") ascau = Asc("A") For i = 0 To 25
Debug.Print Chr$(ascal + i), Freqs(ascal + i) + Freqs(ascau + i)
Next i Exit Sub
CantOpen:
Debug.Print "can't find or read the file "; fname Close
End Sub </lang>
Output:
LetterFrequency "d:\largetext.txt" Frequencies: a 24102 b 4985 c 4551 d 19127 e 61276 f 2734 g 10661 h 8243 i 21589 j 4904 k 7186 l 12026 m 7454 n 31963 o 19021 p 4960 q 37 r 21166 s 13403 t 21090 u 6117 v 8612 w 5017 x 168 y 299 z 4159
Vedit macro language
<lang vedit>File_Open("c:\txt\a_text_file.txt") Update()
for (#1='A'; #1<='Z'; #1++) {
Out_Reg(103) Char_Dump(#1,NOCR) Out_Reg(CLEAR) #2 = Search(@103, BEGIN+ALL+NOERR) Message(@103) Num_Type(#2)
}</lang>
Example output:
A 76 B 23 C 51 D 64 E 192 F 51 G 32 H 59 I 146 J 1 K 9 L 73 M 34 N 94 O 113 P 27 Q 1 R 92 S 89 T 138 U 63 V 26 W 35 X 16 Y 16 Z 2