I before E except after C

The phrase "I before E, except after C" is a widely known mnemonic which is supposed to help when spelling English words.

Task Description

Using the word list from http://www.puzzlers.org/pub/wordlists/unixdict.txt, check if the two sub-clauses of the phrase are plausible individually:

"I before E when not preceded by C"
"E before I when preceded by C"

If both sub-phrases are plausible then the original phrase can be said to be plausible.
Something is plausible if the number of words having the feature is more than two times the number of words having the opposite feature (where feature is 'ie' or 'ei' preceded or not by 'c' as appropriate).

Stretch goal

As a stretch goal use the entries from the table of Word Frequencies in Written and Spoken English: based on the British National Corpus, (selecting those rows with three space or tab separated words only), too see if the phrase is plausible when word frequencies are taken into account.

Show your output here as well as your program.

cf.

Schools to rethink 'i before e' - BBC news, 20 June 2009
I Before E Except After C - QI Series 8 Ep 14, (humorous)
Companion website for the book: "Word Frequencies in Written and Spoken English: based on the British National Corpus".

J

After downloading unixdict to /tmp:

<lang J> dict=:tolower fread '/tmp/unixdict.txt'</lang>

Investigating the rules:

<lang J> +/'cie' E. dict 24

  +/'cei' E. dict

13

  +/'ie' E. dict

490

  +/'ei' E. dict

230</lang>

So, based on unixdict.txt, the "I before E" rule seems plausible (490 > 230 by more than a factor of 2), but the exception does not make much sense (we see almost twice as many i before e after a c as we see e before i after a c).

Note that if we looked at frequency of use for words, instead of considering all words to have equal weights, we might come up with a different answer.

Python

<lang python>import urllib.request import re

PLAUSIBILITY_RATIO = 2

def plausibility_check(comment, x, y):

   print('\n  Checking plausibility of: %s' % comment)
   if x > PLAUSIBILITY_RATIO * y:
       print('    PLAUSIBLE. As we have counts of %i vs %i words, a ratio of %4.1f times'
             % (x, y, x / y))
   else:
       if x > y:
           print('    IMPLAUSIBLE. As although we have counts of %i vs %i words, a ratio of %4.1f times does not make it plausible'
                 % (x, y, x / y))
       else:
           print('    IMPLAUSIBLE, probably contra-indicated. As we have counts of %i vs %i words, a ratio of %4.1f times'
                 % (x, y, x / y))
   return x > PLAUSIBILITY_RATIO * y

words = set(urllib.request.urlopen(

   'http://www.puzzlers.org/pub/wordlists/unixdict.txt'
   ).read().decode().lower().split())

cie = sum('cie' in word for word in words) cei = sum('cei' in word for word in words) not_c_ie = sum(bool(re.search(r'(^ie|[^c]ie)', word)) for word in words) not_c_ei = sum(bool(re.search(r'(^ei|[^c]ei)', word)) for word in words)

print('Checking plausibility of "I before E except after C":') if ( plausibility_check('I before E when not preceded by C', not_c_ie, not_c_ei)

    and plausibility_check('E before I when preceded by C', cei, cie) ):
   print('\nOVERALL IT IS PLAUSIBLE!')

else:

   print('\nOVERALL IT IS IMPLAUSIBLE!')

print('\n(To be plausible, one word count must exceed another by %i times)' % PLAUSIBILITY_RATIO)</lang>

Output:

Checking plausibility of "I before E except after C":

  Checking plausibility of: I before E when not preceded by C
    PLAUSIBLE. As we have counts of 465 vs 213 words, a ratio of  2.2 times

  Checking plausibility of: E before I when preceded by C
    IMPLAUSIBLE, probably contra-indicated. As we have counts of 13 vs 24 words, a ratio of  0.5 times

OVERALL IT IS IMPLAUSIBLE!

(To be plausible, one word count must exceed another by 2 times)

REXX

The following assumptions were made about the (default) dictionary:

there could be leading and/or trailing blanks or tabs
the dictionary words are in mixed case.
there could be blank lines
there may be more than one occurrence of a target string within a word [einsteinium]

<lang rexx>/*REXX pgm shows plausibility of I before E when not preceded by C, and*/ /*────────────────────────────── E before I when preceded by C. */

.=0 /*zero out various word counters.*/

parse arg iFID .; if iFID== then iFID='UNIXDICT.TXT' /*use default?*/

 do r=1  while lines(ifid)\==0;    _=linein(iFID)  /*get a single line.*/
 u=translate(space(_,0))              /*elide superfluous blanks & tabs*/
 if u==            then iterate     /*if a blank line, then ignore it*/
 #.words=#.words+1                    /*keep a running count of #words.*/
 if pos('EI',u)\==0 & pos('IE',u)\==0 then #.both=#.both+1  /*has both.*/
 call find 'ie'
 call find 'ei'
 end   /*r*/

L=length(#.words) /*use this to align the output #s*/ say 'words in the ' ifid ' dictionary: ' #.words say 'words with "IE" and "EI" (in same word): ' right(#.both,L) say 'words with "IE" and preceded by "C": ' right(#.ie.c ,L) say 'words with "IE" and not preceded by "C": ' right(#.ie.z ,L) say 'words with "EI" and preceded by "C": ' right(#.ei.c ,L) say 'words with "EI" and not preceded by "C": ' right(#.ei.z ,L) say; mantra='The spelling mantra ' p1=#.ie.z/max(1,#.ei.z); phrase='"I before E when not preceded by C"' say mantra phrase ' is ' word("im", 1+(p1>2))'plausible.' p2=#.ie.c/max(1,#.ei.c); phrase='"E before I when preceded by C"' say mantra phrase ' is ' word("im", 1+(p2>2))'plausible.' po=p1>2 & p2>2; say 'Overall, it is' word("im",1+po)'plausible.' exit /*stick a fork in it, we're done.*/ /*──────────────────────────────────FIND subroutine─────────────────────*/ find: arg x; s=1; do forever; _=pos(x,u,s); if _==0 then leave

                   if substr(u,_-1+(_==1)*999,1)=='C'  then #.x.c=#.x.c+1
                                                       else #.x.z=#.x.z+1
                   s=_+1              /*handle case of multiple finds. */
                   end   /*forever*/

return</lang> output when using the default dictionary

words in the   UNIXDICT.TXT  dictionary:  25104
words with "IE" and "EI" (in same word):      4
words with "IE" and     preceded by "C":     24
words with "IE" and not preceded by "C":    465
words with "EI" and     preceded by "C":     13
words with "EI" and not preceded by "C":    213

The spelling mantra   "I before E when not preceded by C"  is  plausible.
The spelling mantra   "E before I when preceded by C"  is  implausible.
Overall, it is implausible.

Seed7

<lang seed7>$ include "seed7_05.s7i";

 include "gethttp.s7i";
 include "float.s7i";

const integer: PLAUSIBILITY_RATIO is 2;

const func boolean: plausibilityCheck (in string: comment, in integer: x, in integer: y) is func

 result
   var boolean: plausible is FALSE;
 begin
   writeln("  Checking plausibility of: " <& comment);
   if x > PLAUSIBILITY_RATIO * y then
     writeln("    PLAUSIBLE. As we have counts of " <& x <& " vs " <& y <&
             " words, a ratio of " <& flt(x) / flt(y) digits 1 lpad 4 <& " times");
   elsif x > y then
     writeln("    IMPLAUSIBLE. As although we have counts of " <& x <& " vs " <& y <&
             " words, a ratio of " <& flt(x) / flt(y) digits 1 lpad 4 <& " times does not make it plausible");
   else
     writeln("    IMPLAUSIBLE, probably contra-indicated. As we have counts of " <& x <& " vs " <& y <&
             " words, a ratio of " <& flt(x) / flt(y) digits 1 lpad 4 <& " times");
   end if;
   plausible := x > PLAUSIBILITY_RATIO * y;
 end func;

const func integer: count (in string: stri, in array string: words) is func

 result
   var integer: count is 0;
 local
   var integer: index is 0;
 begin
   for key index range words do
     if pos(words[index], stri) <> 0 then
       incr(count);
     end if;
   end for;
 end func;

const proc: main is func

 local
   var array string: words is 0 times "";
   var integer: cie is 0;
   var integer: cei is 0;
   var integer: not_c_ie is 0;
   var integer: not_c_ei is 0;
 begin
   words := split(lower(getHttp("www.puzzlers.org/pub/wordlists/unixdict.txt")), "\n");
   cie := count("cie", words);
   cei := count("cei", words);
   not_c_ie := count("ie", words) - cie;
   not_c_ei := count("ei", words) - cei;
   writeln("Checking plausibility of \"I before E except after C\":");
   if plausibilityCheck("I before E when not preceded by C", not_c_ie, not_c_ei) and
       plausibilityCheck("E before I when preceded by C", cei, cie) then
     writeln("OVERALL IT IS PLAUSIBLE!");
   else
     writeln("OVERALL IT IS IMPLAUSIBLE!");
     writeln("(To be plausible, one word count must exceed another by " <& PLAUSIBILITY_RATIO <& " times)");
   end if;
 end func;</lang>

Output:

Checking plausibility of "I before E except after C":
  Checking plausibility of: I before E when not preceded by C
    PLAUSIBLE. As we have counts of 465 vs 213 words, a ratio of  2.2 times
  Checking plausibility of: E before I when preceded by C
    IMPLAUSIBLE, probably contra-indicated. As we have counts of 13 vs 24 words, a ratio of  0.5 times
OVERALL IT IS IMPLAUSIBLE!
(To be plausible, one word count must exceed another by 2 times)

Tcl

Translation of: Python

<lang tcl>package require http

variable PLAUSIBILITY_RATIO 2.0 proc plausible {description x y} {

   variable PLAUSIBILITY_RATIO
   puts "  Checking plausibility of: $description"
   if {$x > $PLAUSIBILITY_RATIO * $y} {

set conclusion "PLAUSIBLE" set fmt "As we have counts of %i vs %i words, a ratio of %.1f times" set result true

   } elseif {$x > $y} {

set conclusion "IMPLAUSIBLE" set fmt "As although we have counts of %i vs %i words," append fmt " a ratio of %.1f times does not make it plausible" set result false

   } else {

set conclusion "IMPLAUSIBLE, probably contra-indicated" set fmt "As we have counts of %i vs %i words, a ratio of %.1f times" set result false

   }
   puts [format "    %s.\n    $fmt" $conclusion $x $y [expr {double($x)/$y}]]
   return $result

}

set t [http::geturl http://www.puzzlers.org/pub/wordlists/unixdict.txt] set words [split [http::data $t] "\n"] http::cleanup $t foreach {name pattern} {ie (?:^|[^c])ie ei (?:^|[^c])ei cie cie cei cei} {

   set count($name) [llength [lsearch -nocase -all -regexp $words $pattern]]

}

puts "Checking plausibility of \"I before E except after C\":" if {

   [plausible "I before E when not preceded by C" $count(ie) $count(ei)] &&
   [plausible "E before I when preceded by C" $count(cei) $count(cie)]

} then {

   puts "\nOVERALL IT IS PLAUSIBLE!"

} else {

   puts "\nOVERALL IT IS IMPLAUSIBLE!"

} puts "\n(To be plausible, one word count must exceed another by\ $PLAUSIBILITY_RATIO times)"</lang>

Output:

Checking plausibility of "I before E except after C":
  Checking plausibility of: I before E when not preceded by C
    PLAUSIBLE.
    As we have counts of 465 vs 213 words, a ratio of 2.2 times
  Checking plausibility of: E before I when preceded by C
    IMPLAUSIBLE, probably contra-indicated.
    As we have counts of 13 vs 24 words, a ratio of 0.5 times

OVERALL IT IS IMPLAUSIBLE!

(To be plausible, one word count must exceed another by 2.0 times)

UNIX Shell

<lang bash>#!/bin/sh

matched() { egrep "$1" unixdict.txt | wc -l }

check() { if [ $(expr $(matched $3) \> $(expr 2 \* $(matched $2))) = '0' ]; then echo clause $1 not plausible exit 1 fi }

check 1 \[^c\]ei \[^c\]ie && check 2 cie cei && echo plausible</lang>

Output:

clause 2 not plausible