Word frequency: Difference between revisions

m
→‎{{header|Wren}}: Minor tidy and rerun
m (syntax highlighting fixup automation)
m (→‎{{header|Wren}}: Minor tidy and rerun)
 
(11 intermediate revisions by 6 users not shown)
Line 2,812:
 
=={{header|Java}}==
This is relatively simple in Java.<br />
I used a ''URL'' class to download the content, a ''BufferedReader'' class to examine the text line-for-line, a ''Pattern'' and ''Matcher'' to identify words, and a ''Map'' to hold to values.
<syntaxhighlight lang="java">
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
</syntaxhighlight>
 
<syntaxhighlight lang="java">
void printWordFrequency() throws URISyntaxException, IOException {
URL url = new URI("https://www.gutenberg.org/files/135/135-0.txt").toURL();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()))) {
Pattern pattern = Pattern.compile("(\\w+)");
Matcher matcher;
String line;
String word;
Map<String, Integer> map = new HashMap<>();
while ((line = reader.readLine()) != null) {
matcher = pattern.matcher(line);
while (matcher.find()) {
word = matcher.group().toLowerCase();
if (map.containsKey(word)) {
map.put(word, map.get(word) + 1);
} else {
map.put(word, 1);
}
}
}
/* print out top 10 */
List<Map.Entry<String, Integer>> list = new ArrayList<>(map.entrySet());
list.sort(Map.Entry.comparingByValue());
Collections.reverse(list);
int count = 1;
for (Map.Entry<String, Integer> value : list) {
System.out.printf("%-20s%,7d%n", value.getKey(), value.getValue());
if (count++ == 10) break;
}
}
}
</syntaxhighlight>
<pre>
the 41,043
of 19,952
and 14,938
a 14,539
to 13,942
in 11,208
he 9,646
was 8,620
that 7,922
it 6,659
</pre>
<br />
An alternate demonstration
{{trans|Kotlin}}
<syntaxhighlight lang="java">import java.io.IOException;
Line 2,929 ⟶ 2,993:
"he" │ 6816
"had" │ 6140</pre>
 
=={{header|K}}==
{{works with|ngn/k}}<syntaxhighlight lang=K>common:{+((!d)o)!n@o:x#>n:#'.d:=("&"\`c$"&"|_,/0:y)^,""}
{(,'!x),'.x}common[10;"135-0.txt"]
(("the";41019)
("of";19898)
("and";14658)
(,"a";14517)
("to";13695)
("in";11134)
("he";9405)
("was";8361)
("that";7592)
("his";6446))</syntaxhighlight>
 
(The relatively easy to read output format here is arguably less useful than the table produced by <code>common</code> but it would have been more concise to have <code>common</code> generate it directly.)
 
=={{header|KAP}}==
Line 3,325 ⟶ 3,405:
=={{header|Perl}}==
{{trans|Raku}}
<syntaxhighlight lang="perl">$topuse = 10strict;
use warnings;
use utf8;
 
my $top = 10;
open $fh, "<", '135-0.txt';
($text = join '', <$fh>) =~ tr/A-Z/a-z/
or die "Can't open '135-0.txt': $!\n";
 
open my $fh, '<', 'ref/word-count.txt';
@matcher = (
(my $text = join '', <$fh>) =~ tr/A-Z/a-z/;
 
my @matcher = (
qr/[a-z]+/, # simple 7-bit ASCII
qr/\w+/, # word characters with underscore
Line 3,337 ⟶ 3,420:
);
 
for my $reg (@matcher) {
print "\nTop $top using regex: " . $reg . "\n";
my @matches = $text =~ /$reg/g;
my %words;
for my $w (@matches) { $words{$w}++ };
my $c = 0;
for my $w ( sort { $words{$b} <=> $words{$a} } keys %words ) {
printf "%-7s %6d\n", $w, $words{$w};
last if ++$c >= $top;
Line 3,350 ⟶ 3,433:
 
{{out}}
<pre>
<pre>Top 10 using regex: (?^:[a-z]+)
Top 10 using regex: (?^:[a-z]+)
the 41089
of 19949
Line 3,384 ⟶ 3,468:
was 8621
that 7924
it 6661</pre>
</pre>
 
=={{header|Phix}}==
Line 3,978 ⟶ 4,063:
=={{header|Raku}}==
(formerly Perl 6)
{{works with|Rakudo|20202022.08.107}}
Note: much of the following exposition is no longer critical to the task as the requirements have been updated, but is left here for historical and informational reasons.
 
This is slightly trickier than it appears initially. The task specifically states: "A word is a sequence of one or more contiguous letters", so contractions and hyphenated words are broken up. Initially we might reach for a regex matcher like /\w+/ , but \w includes underscore, which is not a letter but a punctuation connector; and this text is '''full''' of underscores since that is how Project Gutenberg texts denote italicized text. The underscores are not actually parts of the words though, they are markup.
 
We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various accented glyphs[[wp:diacritic|diacritic]]s. Those '''are''' letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.)
 
Actually, in this case /A-Za-z/ returns '''very nearly''' the correct answer. Unfortunately, the name "Alèthe" appears once (only once!) in the text, gets incorrectly split into Al & the, and incorrectly reports 41089 occurrences of "the".
Line 3,991 ⟶ 4,076:
 
Here is a sample that shows the result when using various different matchers.
<syntaxhighlight lang="raku" line>sub MAIN ($filename, UInt $top = 10) {
my $file = $filename.IO.slurp.lc.subst(/ (<[\w]-[_]>'-')\n(<[\w]-[_]>) /, {$0 ~ $1}, :g );
my @matcher = (
rx/ <[a..z]>+ /, # simple 7-bit ASCII
rx/ \w+ /, # word characters with underscore
rx/ <[\w]-[_]>+ /, # word characters without underscore
rx/ [<[\w]-[_]>+[["'"|]+ % < ' -'|" '-"]<[\w]-[_] >+]* / # word characters without underscore but with hyphens and contractions
);
for @matcher -> $reg {
say "\nTop $top using regex: ", $reg.raku;
my @words .put for= $file.comb( $reg ).Bag.sort(-*.value)[^$top];
my $length = max @words».key».chars;
printf "%-{$length}s %d\n", .key, .value for @words;
}
}</syntaxhighlight>
Line 4,956 ⟶ 5,043:
6 garbage collection(s) in 0.2 seconds.
</pre>
 
=={{header|Smalltalk}}==
The ASCII text file is from https://www.gutenberg.org/files/135/old/lesms10.txt.
 
===Cuis Smalltalk, ASCII===
{{works with|Cuis|6.0}}
<syntaxhighlight lang="smalltalk">
(StandardFileStream new open: 'lesms10.txt' forWrite: false)
contents asLowercase substrings asBag sortedCounts first: 10.
</syntaxhighlight>
{{Out}}<pre>an OrderedCollection(40543 -> 'the' 19796 -> 'of' 14448 -> 'and' 14380 -> 'a' 13582 -> 'to' 11006 -> 'in' 9221 -> 'he' 8351 -> 'was' 7258 -> 'that' 6420 -> 'his') </pre>
 
===Squeak Smalltalk, ASCII===
{{works with|Squeak|6.0}}
<syntaxhighlight lang="smalltalk">
(StandardFileStream readOnlyFileNamed: 'lesms10.txt')
contents asLowercase substrings asBag sortedCounts first: 10.
</syntaxhighlight>
{{Out}}<pre>{40543->'the' . 19796->'of' . 14448->'and' . 14380->'a' . 13582->'to' . 11006->'in' . 9221->'he' . 8351->'was' . 7258->'that' . 6420->'his'} </pre>
 
=={{header|Swift}}==
Line 5,100 ⟶ 5,206:
Additionally, because 1972 TMG only understood ASCII characters, you might want to strip down the diacritics (e.g., é → e):
<syntaxhighlight lang="bash">cat file | uni2ascii -B | tr A-Z a-z > file1; ./a.out file1</syntaxhighlight>
 
=={{header|Transd}}==
<syntaxhighlight lang="Scheme">#lang transd
 
MainModule: {
_start: (λ locals: cnt 0
(with fs FileStream() words String()
(open-r fs "/mnt/text/Literature/Miserables.txt")
(textin fs words)
 
(with v ( -|
(split (tolower words))
(group-by)
(regroup-by (λ v Vector<String>() -> Int() (size v))))
 
(for i in v :rev do (lout (get (get (snd i) 0) 0) ":\t " (fst i))
(+= cnt 1) (if (> cnt 10) break))
)))
}</syntaxhighlight>
{{out}}
<pre>
the: 40379
of: 19869
and: 14468
a: 14278
to: 13590
in: 11025
he: 9213
was: 8347
that: 7249
his: 6414
had: 6051
</pre>
 
=={{header|UNIX Shell}}==
Line 5,298 ⟶ 5,437:
I've taken the view that 'letter' means either a letter or digit for Unicode codepoints up to 255. I haven't included underscore, hyphen nor apostrophe as these usually separate compound words.
 
Not very quick (runs in about 4715 seconds on my system) though this is partially due to Wren not having regular expressions and the string pattern matching module being written in Wren itself rather than C.
 
If the Go example is re-run today (2117 OctoberFebruary 20202024), then the output matches this Wren example precisely though it appears that the text file has changed since the former was written more than 25 years ago.
<syntaxhighlight lang="ecmascriptwren">import "io" for File
import "./str" for Str
import "./sort" for Sort
import "./fmt" for Fmt
import "./pattern" for Pattern
 
var fileName = "135-0.txt"
9,479

edits