Word frequency: Difference between revisions

Line 3,823:
 
=={{header|R}}==
===='''Version 1'''====
I chose to remove apostrophes only if they're followed by an s (so "mom" and "mom's" will show up as the same word but "they" and "they're" won't). I also chose not to remove hyphens.
<lang R>
Line 3,856 ⟶ 3,857:
9 it 2308
10 i 1845
</pre>
 
===='''Version 2'''====
This version is purely functional using the native pipe operator in R 4.1+ and runs in less than a second.
<lang R>
word_frequency_pipeline <- function(file=NULL, n=10) {
file |>
vroom::vroom_lines() |>
stringi::stri_split_boundaries(type="word", skip_word_none=T, skip_word_number=T) |>
unlist() |>
tolower() |>
table() |>
sort(decreasing = T) |>
(\(.) .[1:n])() |>
data.frame()
}
</lang>
{{Out}}
<pre>
> word_frequency_pipeline("~/../Downloads/135-0.txt")
Var1 Freq
1 the 41042
2 of 19952
3 and 14938
4 a 14526
5 to 13942
6 in 11208
7 he 9605
8 was 8620
9 that 7824
10 it 6533
</pre>
 
Anonymous user