I before E except after C: Difference between revisions

J: stretch goals
m (→‎{{header|Python}}: Re-factored code.)
(J: stretch goals)
Line 39:
 
Note that if we looked at frequency of use for words, instead of considering all words to have equal weights, we might come up with a different answer.
 
=== stretch goal ===
 
After downloading 1_2_all_freq to /tmp, we can read it into J, and break out the first column (as words) and the third column as numbers:
 
<lang J>allfreq=: |:}.<;._1;._2]1!:1<'/tmp/1_2_all_freq.txt'
 
words=: >0 { allfreq
freqs=: 0 {.@".&>2 { allfreq</lang>
 
With these definitions, we can define a prevalence verb which will tell us how often a particular substring is appears in use:
 
<lang J>prevalence=:verb define
(y +./@E."1 words) +/ .* freqs
)</lang>
 
Investigating our original proposed rules:
 
<lang J> 'ie' %&prevalence 'ei'
1.76868</lang>
 
A generic "i before e" rule is not looking quite as good now - words that have i before e are used less than twice as much as words which use e before i.
 
<lang J> 'cei' %&prevalence 'cie'
0.328974</lang>
 
An "except after c" variant is looking awful now - words that use the cie sequence are three times as likely as words that use the cei sequence. So, of course, if we modified our original rule with this exception it would weaken the original rule:
 
<lang J> ('ie' -&prevalence 'cie') % ('ei' -&prevalence 'cei')
1.68255</lang>
 
Note that we might also want to consider non-adjacent matches (the regular expression 'i.*e' instead of 'ie' or perhaps 'c.*ie' or 'c.*i.*e' instead of 'cie') - this would be straightforward to check, but this would bulk up the page.
 
=={{header|Python}}==
6,962

edits