NYSIIS: Difference between revisions

From Rosetta Code
Content added Content deleted
(→‎{{header|Tcl}}: Add Python)
m (Modified task description for clarity)
Line 2: Line 2:
{{wikipedia}}
{{wikipedia}}
The [[wp:New York State Identification and Intelligence System|New York State Identification and Intelligence System phonetic code]], commonly known as NYSIIS, is a phonetic algorithm for creating indices for words based on their pronunciation. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.
The [[wp:New York State Identification and Intelligence System|New York State Identification and Intelligence System phonetic code]], commonly known as NYSIIS, is a phonetic algorithm for creating indices for words based on their pronunciation. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

The task here is to implement the original NYSIIS algorithm, shown in Wikipedia, rather than any other subsequent modification. Also, before the algorithm is applied the input string should be converted to upper case with all white space removed.

An optional step is to handle multiple names, including double-barrelled names or double surname (e.g. 'Hoyle-Johnson' or 'Vaughan Williams') and unnecessary suffixes/honours that are not required for indexing purposes (e.g. 'Jnr', 'Sr', 'III', 'CBE'). The original implementation is also restricted to six characters, but this is not a requirement.

;See also
;See also
* [[Soundex]]
* [[Soundex]]
Line 12: Line 17:
{
{


ClassMethod Encode(pAlgorithm As %String = "", pName As %String = "", ByRef pCode As %String, pSuffixRem As %Boolean = 1, pTruncate As %Integer = 0) As %Status
ClassMethod Encode(pAlgorithm As %String = "", pName As %String = "", Output pCode As %String, pSuffixRem As %Boolean = 1, pTruncate As %Integer = 0) As %Status
{
{
// check algorithm and name
// check algorithm and name
Line 34: Line 39:
"GCVO", "KCVO", "DCVO", "CVO", "LVO", "MVO", "OM", "ISO", "GBE", "KBE", "DBE", "CBE", "OBE", "MBE", "CH")
"GCVO", "KCVO", "DCVO", "CVO", "LVO", "MVO", "OM", "ISO", "GBE", "KBE", "DBE", "CBE", "OBE", "MBE", "CH")
Set decs=$ListBuild("VC", "GC", "CGC", "RRC", "DSC", "MC", "DFC", "AFC", "ARRC", "OBI", "IOM")
Set decs=$ListBuild("VC", "GC", "CGC", "RRC", "DSC", "MC", "DFC", "AFC", "ARRC", "OBI", "IOM")
Set regexp="( )(SNR$|SR$|JNR$|JR$|ESQ$|"_$ListToString(ords, "$|")_"$|"_$ListToString(decs, "$|")_"$|[IVX]+$)"
Set regexp="( )(SNR|SR|JNR|JR|ESQ|"_$ListToString(ords, "|")_"|"_$ListToString(decs, "|")_"|[IVX]+)"
Set rem=##class(%Regex.Matcher).%New(regexp, pName)
Set rem=##class(%Regex.Matcher).%New(regexp, pName)
Set pName=rem.ReplaceAll("")
Set pName=rem.ReplaceAll("")
Line 46: Line 51:
Set pCode=""
Set pCode=""
For piece=1:1:$Length(pName, " ") {
For piece=1:1:$Length(pName, " ") {
If pAlgorithm="nysiis" Set pCode=pCode_..NYSIIS(pName)
If pAlgorithm="nysiis" Set pCode=pCode_..NYSIIS($Piece(pName, " ", piece))
}
}
If pTruncate {
If pTruncate {
Line 67: Line 72:
*/
*/
// create regexp matcher instance, remove punctuation and convert all to upper case
// create regexp matcher instance, remove punctuation and convert all to upper case
Set rem=##class(%Regex.Matcher).%New(" ")
Set rem=##class(%Regex.Matcher).%New(" ")
Set rem.Text=$ZConvert($ZStrip(pName, "*P"), "U")
Set rem.Text=$ZConvert($ZStrip(pName, "*P"), "U")
Line 159: Line 164:
Wheeler -> WHALAR
Wheeler -> WHALAR
Louis XVI -> L
Louis XVI -> L
Hoyle-Johnson -> HAYLJA[NSAN]
Vaughan Williams -> VAGANW[ALAN]
D'Souza -> DSAS
de Sousa -> DSAS
</pre>
</pre>



Revision as of 12:48, 24 March 2013

NYSIIS is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.
This page uses content from Wikipedia. The original article was at NYSIIS. The list of authors can be seen in the page history. As with Rosetta Code, the text of Wikipedia is available under the GNU FDL. (See links for details on variance)

The New York State Identification and Intelligence System phonetic code, commonly known as NYSIIS, is a phonetic algorithm for creating indices for words based on their pronunciation. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

The task here is to implement the original NYSIIS algorithm, shown in Wikipedia, rather than any other subsequent modification. Also, before the algorithm is applied the input string should be converted to upper case with all white space removed.

An optional step is to handle multiple names, including double-barrelled names or double surname (e.g. 'Hoyle-Johnson' or 'Vaughan Williams') and unnecessary suffixes/honours that are not required for indexing purposes (e.g. 'Jnr', 'Sr', 'III', 'CBE'). The original implementation is also restricted to six characters, but this is not a requirement.

See also

Caché ObjectScript

Refactored code based on other examples to reduce footprint.

<lang cache> Class Utils.Phonetic [ Abstract ] {

ClassMethod Encode(pAlgorithm As %String = "", pName As %String = "", Output pCode As %String, pSuffixRem As %Boolean = 1, pTruncate As %Integer = 0) As %Status { // check algorithm and name Set pAlgorithm=$ZConvert(pAlgorithm, "l") If pAlgorithm="" Quit $$$ERROR($$$GeneralError, "No algorithm specified.") If $Case(pAlgorithm, "nysiis":1, :0)=0 Quit $$$ERROR($$$GeneralError, "Unknown algorithm specified.") If $Match(pName, ".*\d.*# no numbers") Quit $$$ERROR($$$GeneralError, "Name cannot contain numerics.")

// remove apostrophes, find punctuation and replace with spaces (exclude hyphens) Set pName=$Translate(pName, "'") Set pun=$ZStrip(pName, "*E'P", "-") Set pName=$Translate(pName, pun, $Justify(" ", $Length(pun)))

// convert name(s) to uppercase and remove all white space Set pName=$ZStrip($ZConvert(pName, "U"), "<=>W")

// remove suffixes (e.g. 'Jnr', 'OBE', 'DSC', etc), including roman numerals (e.g. 'II', 'VIII') // - http://en.wikipedia.org/wiki/List_of_post-nominal_letters_(United_Kingdom) If pSuffixRem { Set ords=$ListBuild("KG", "LG", "KT", "LT", "GCB", "KCB", "DCB", "CB", "GCMG", "KCMG", "DCMG", "CMG", "DSO", "GCVO", "KCVO", "DCVO", "CVO", "LVO", "MVO", "OM", "ISO", "GBE", "KBE", "DBE", "CBE", "OBE", "MBE", "CH") Set decs=$ListBuild("VC", "GC", "CGC", "RRC", "DSC", "MC", "DFC", "AFC", "ARRC", "OBI", "IOM") Set regexp="( )(SNR|SR|JNR|JR|ESQ|"_$ListToString(ords, "|")_"|"_$ListToString(decs, "|")_"|[IVX]+)" Set rem=##class(%Regex.Matcher).%New(regexp, pName) Set pName=rem.ReplaceAll("") }

// replace hyphen and white space, plus some final validation Set pName=$ZStrip($Translate(pName, "-", " "), "<=>W") If $Length($Piece(pName, " "))<2 Quit $$$ERROR($$$GeneralError, "Invalid name.")

// begin algorithm and truncate result, if necessary Set pCode="" For piece=1:1:$Length(pName, " ") { If pAlgorithm="nysiis" Set pCode=pCode_..NYSIIS($Piece(pName, " ", piece)) } If pTruncate { Set pName=pCode Set pCode=$Extract(pCode, 1, pTruncate) Set $Extract(pName, 1, pTruncate)="" If $Length(pName) Set pCode=pCode_"["_pName_"]" }

// finished Quit $$$OK }

ClassMethod NYSIIS(pName As %String) As %String { /* New York State Identification and Intelligence System (NYSIIS) Phonetic Encoder - http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System - http://www.dropby.com/indexLF.html?content=/NYSIIS.html */

// create regexp matcher instance, remove punctuation and convert all to upper case Set rem=##class(%Regex.Matcher).%New(" ") Set rem.Text=$ZConvert($ZStrip(pName, "*P"), "U")

// translate first characters of name: // => MAC->MCC, KN->N, K->C, PH/PF->FF, SCH->SSS For rule="^MAC->MCC", "^KN->N", "^K->C", "^(PH|PF)->FF", "SCH->SSS" { Set rem.Pattern=$Piece(rule, "->") If rem.Locate() Set rem.Text=rem.ReplaceFirst($Piece(rule, "->", 2)) Quit }

// translate last characters of name: // => EE/IE->Y, DT/RT/RD/NT/ND->D For rule="(EE|IE)$->Y", "(DT|RT|RD|NT|ND)$->D" { Set rem.Pattern=$Piece(rule, "->") If rem.Locate() Set rem.Text=rem.ReplaceFirst($Piece(rule, "->", 2)) Quit }

// first character of key = first character of name Set pName1=$Extract(rem.Text, 1), rem.Text=$Extract(rem.Text, 2, *)

// translate remaining characters by following rules, incrementing by one character each time: // => EV->AF else A,E,I,O,U->A // => Q->G, Z->S, M->N // => KN->N else K->C // => SCH->SSS, PH->FF // => H->if previous or next is non-vowel, previous // => W->if previous is vowel, A (A is the only vowel left) // => add current to key if current is not same as the last key character Set ptr=0, rules=$ListBuild("EV->AF", "(A|E|I|O|U)->A", "Q->G", "Z->S", "M->N", "KN->N", "K->C", "SCH->SSS", "PH->FF", "H[^A]", "[^A]H", "AW->A") While $ListNext(rules, ptr, rule) { Set rem.Pattern=$Piece(rule, "->") If $Piece(rule, "->", 2)="", rem.Locate() { Set $Piece(rule, "->", 2)=$Translate(rem.Group, "H") } Set rem.Text=rem.ReplaceAll($Piece(rule, "->", 2)) } Set pName=$ZStrip(rem.Text, "=U") // remove duplicates

// if last character is S, remove it If $Extract(pName, *)="S" Set pName=$Extract(pName, 1, *-1)

// if last characters are AY, replace with Y If $Extract(pName, *-1, *)="AY" Set pName=$Extract(pName, 1, *-2)_"Y"

// if last character is A, remove it If $Extract(pName, *)="A" Set pName=$Extract(pName, 1, *-1)

// append translated key to removed first character Quit pName1_pName }

} </lang>

Examples:
USER>For  { Read !, name Quit:name=""  Set sc=##class(Utils.Phonetic).Encode("nysiis", name, .code,, 6) If sc Write " -> ", code }

knight -> NAGT
mitchell -> MATCAL
o'daniel -> ODANAL
brown sr -> BRAN
browne III -> BRAN
browne IV -> BRAN
O'Banion -> OBANAN
Mclaughlin -> MCLAGL[AN]
McCormack -> MCARNA[C]
Chapman -> CHAPNA[N]
Silva -> SALV
McDonald -> MCDANA[LD]
Lawson -> LASAN
Jacobs -> JACAB
Greene -> GRAN
O'Brien -> OBRAN
Morrison -> MARASA[N]
Larson -> LARSAN
Willis -> WAL
Mackenzie -> MCANSY
Carr -> CAR
Lawrence -> LARANC
Matthews -> MAT
Richards -> RACARD
Bishop -> BASAP
Franklin -> FRANCL[AN]
McDaniel -> MCDANA[L]
Harper -> HARPAR
Lynch -> LYNC
Watkins -> WATCAN
Carlson -> CARLSA[N]
Wheeler -> WHALAR
Louis XVI -> L
Hoyle-Johnson -> HAYLJA[NSAN]
Vaughan Williams -> VAGANW[ALAN]
D'Souza -> DSAS
de Sousa -> DSAS

Perl 6

This implementation removes common name suffixes similar to the reference implementation, even though it is not specified in the task description or on the linked NYSIIS page. This algorithm isn't too friendly to certain French kings. :)

<lang perl6>sub no_suffix ($name) {

   $name.uc.subst: /\h (<[JS]>R) | (<[IVX]>+) $/, ;

}

sub nysiis ($name is copy) {

   given $name .= uc {
       s:g/<-[A..Z]>//;
       s/^MAC/MCC/;
       s/^P<[FH]>/FF/;
       s/^SCH/SSS/;
       s/^KN/N/;
       s/<[IE]>E$  /Y/;
       s/<[DRN]>T$ /D/;
       s/<[RN]>D$  /D/;
       s:c(1):g/EV/AF/;
       s:c(1):g/<[AEIOU]>+/A/;
       s:c(1):g/Q/G/;
       s:c(1):g/Z/S/;
       s:c(1):g/M/N/;
       s:c(1):g/KN/N/;
       s:c(1):g/K/C/;
       s:c(1):g/SCH/S/;
       s:c(1):g/PF/F/;
       s:c(1):g/K/C/;
       s:c(1):g/H(<-[AEIOU]>)/$0/;
       s:g/(<-[AEIOU]>)H/$0/;
       s:g/(<-[AEIOU]>)W/$0/;
       s/ AY$ /Y/;
       s/  S$ //;
       s/  A$ //;
       s:g/ (.)$0+ /$0/;
    }

}


for «

   knight      mitchell        "o'daniel"      "brown  sr"     "browne III"
   "browne IV" "O'Banion"      Mclaughlin      McCormack       Chapman
   Silva       McDonald        Lawson          Jacobs          Greene
   "O'Brien"   Morrison        Larson          Willis          Mackenzie
   Carr        Lawrence        Matthews        Richards        Bishop
   Franklin    McDaniel        Harper          Lynch           Watkins
   Carlson     Wheeler         "Louis XVI"

» {

   my $nysiis = nysiis no_suffix $_;
   if $nysiis.chars > 6 {
       $nysiis .= subst: rx/ <after .**6> .+ /, -> $/ { "[$/]" };
   }
   printf "%10s,  %s\n", $_, $nysiis;

}</lang>

Output:

    knight,  NAGT
  mitchell,  MATCAL
  o'daniel,  ODANAL
  brown sr,  BRAN
browne III,  BRAN
 browne IV,  BRAN
  O'Banion,  OBANAN
Mclaughlin,  MCLAGL[AN]
 McCormack,  MCARNA[C]
   Chapman,  CAPNAN
     Silva,  SALV
  McDonald,  MCDANA[LD]
    Lawson,  LASAN
    Jacobs,  JACAB
    Greene,  GRAN
   O'Brien,  OBRAN
  Morrison,  MARASA[N]
    Larson,  LARSAN
    Willis,  WAL
 Mackenzie,  MCANSY
      Carr,  CAR
  Lawrence,  LARANC
  Matthews,  MAT
  Richards,  RACARD
    Bishop,  BASAP
  Franklin,  FRANCL[AN]
  McDaniel,  MCDANA[L]
    Harper,  HARPAR
     Lynch,  LYNC
   Watkins,  WATCAN
   Carlson,  CARLSA[N]
   Wheeler,  WALAR
 Louis XVI,  L

Python

A litteral translation of the algorithm from the Wikipedia article. <lang python>import re

_vowels = 'AEIOU'

def replace_at(text, position, fromlist, tolist):

   for f, t in zip(fromlist, tolist):
       if text[position:].startswith(f):
           return .join([text[:position],
                           t,
                           text[position+len(f):]])
   return text

def replace_end(text, fromlist, tolist):

   for f, t in zip(fromlist, tolist):
       if text.endswith(f):
           return text[:-len(f)] + t
   return text

def nysiis(name):

   name = re.sub(r'\W', , name).upper()
   name = replace_at(name, 0, ['MAC', 'KN', 'K', 'PH', 'PF', 'SCH'],
                              ['MCC', 'N',  'C', 'FF', 'FF', 'SSS'])
   name = replace_end(name, ['EE', 'IE', 'DT', 'RT', 'RD', 'NT', 'ND'],
                            ['Y',  'Y',  'D',  'D',  'D',  'D',  'D'])
   key, key1 = name[0], 
   i = 1
   while i < len(name):
       #print(i, name, key1, key)
       n_1, n = name[i-1], name[i]
       n1_ = name[i+1] if i+1 < len(name) else 
       name = replace_at(name, i, ['EV'] + list(_vowels), ['AF'] + ['A']*5)
       name = replace_at(name, i, 'QZM', 'GSN')
       name = replace_at(name, i, ['KN', 'K'], ['N', 'C'])
       name = replace_at(name, i, ['SCH', 'PH'], ['SSS', 'FF'])
       if n == 'H' and (n_1 not in _vowels or n1_ not in _vowels):
           name = .join([name[:i], n_1, name[i+1:]])
       if n == 'W' and n_1 in _vowels:
           name = .join([name[:i], 'A', name[i+1:]])
       if key and key[-1] != name[i]:
           key += name[i]
       i += 1
   key = replace_end(key, ['S', 'AY', 'A'], [, 'Y', ])
   return key1 + key

if __name__ == '__main__':

   names = ['Bishop', 'Carlson', 'Carr', 'Chapman', 'Franklin',
            'Greene', 'Harper', 'Jacobs', 'Larson', 'Lawrence',
            'Lawson', 'Louis, XVI', 'Lynch', 'Mackenzie', 'Matthews',
            'McCormack', 'McDaniel', 'McDonald', 'Mclaughlin', 'Morrison',
            "O'Banion", "O'Brien", 'Richards', 'Silva', 'Watkins',
            'Wheeler', 'Willis', 'brown, sr', 'browne, III', 'browne, IV',
            'knight', 'mitchell', "o'daniel"]
   for name in names:
       print('%15s: %s' % (name, nysiis(name)))</lang>
Output:
         Bishop: BASAP
        Carlson: CARLSAN
           Carr: CAR
        Chapman: CAPNAN
       Franklin: FRANCLAN
         Greene: GRAN
         Harper: HARPAR
         Jacobs: JACAB
         Larson: LARSAN
       Lawrence: LARANC
         Lawson: LASAN
     Louis, XVI: LASXV
          Lynch: LYNC
      Mackenzie: MCANSY
       Matthews: MATA
      McCormack: MCARNAC
       McDaniel: MCDANAL
       McDonald: MCDANALD
     Mclaughlin: MCLAGLAN
       Morrison: MARASAN
       O'Banion: OBANAN
        O'Brien: OBRAN
       Richards: RACARD
          Silva: SALV
        Watkins: WATCAN
        Wheeler: WALAR
         Willis: WALA
      brown, sr: BRANSR
    browne, III: BRAN
     browne, IV: BRANAV
         knight: NAGT
       mitchell: MATCAL
       o'daniel: ODANAL

Tcl

<lang tcl>proc nysiis {name {truncate false}} {

   # Normalize to first word, uppercased, without non-letters
   set name [regsub -all {[^A-Z]+} [string toupper [regexp -inline {\S+} $name]] ""]
   # Prefix map
   foreach {from to} {MAC MCC KN N K C PH FF PF FF SCH SSS} {

if {[regsub ^$from $name $to name]} break

   }
   # Suffix map
   foreach {from to} {EE Y IE Y DT D RT D NT D ND D} {

if {[regsub $from$ $name $to name]} break

   }
   # Split
   regexp (.)(.*) $name -> name rest
   # Reduce suffix
   regsub -all {[AEIOU]} [regsub -all EV $rest AF] A rest
   set rest [string map {Q G Z S M N KN N K C SCH SSS PH FF} $rest]
   regsub -all {([^A])H|(.)H(?=[^A])} $rest {\1\2} rest
   regsub -all AW $rest A rest
   regsub -all {(.)\1+} $rest {\1} rest
   regsub {S$} $rest "" rest
   regsub {A(Y?)$} $rest {\1} rest
   append name $rest
   # Apply truncation if needed
   if {$truncate} {

set name [string range $name 0 5]

   }
   return $name

}</lang> Demonstrating: <lang tcl>foreach name {

   knight      mitchell        "o'daniel"      "brown  sr"     "browne III"
   "browne IV" "O'Banion"      Mclaughlin      McCormack       Chapman
   Silva       McDonald        Lawson          Jacobs          Greene
   "O'Brien"   Morrison        Larson          Willis          Mackenzie
   Carr        Lawrence        Matthews        Richards        Bishop
   Franklin    McDaniel        Harper          Lynch           Watkins
   Carlson     Wheeler         "Louis XVI"

} {

   puts "$name -> [nysiis $name]"

}</lang>

Output:
knight -> NAGT
mitchell -> MATCAL
o'daniel -> ODANAL
brown  sr -> BRAN
browne III -> BRAN
browne IV -> BRAN
O'Banion -> OBANAN
Mclaughlin -> MCLAGLAN
McCormack -> MCARNAC
Chapman -> CHAPNAN
Silva -> SALV
McDonald -> MCDANALD
Lawson -> LASAN
Jacobs -> JACAB
Greene -> GRAN
O'Brien -> OBRAN
Morrison -> MARASAN
Larson -> LARSAN
Willis -> WAL
Mackenzie -> MCANSY
Carr -> CAR
Lawrence -> LARANC
Matthews -> MAT
Richards -> RACARD
Bishop -> BASAP
Franklin -> FRANCLAN
McDaniel -> MCDANAL
Harper -> HARPAR
Lynch -> LYNC
Watkins -> WATCAN
Carlson -> CARLSAN
Wheeler -> WHALAR
Louis XVI -> L