Jaro-Winkler distance: Difference between revisions

Content added Content deleted

Inline

Revision as of 15:15, 31 July 2020

The Jaro-Winkler distance is a metric for measuring the edit distance between words. It is similar to the more basic Levenstein distance but the Jaro distance also accounts for transpositions between letters in the words. With the Winkler modification to the Jaro metric, the Jaro-Winkler distance also adds an increase in similarity for words which start with the same letters (prefix).

The Jaro-Winkler distance is a modification of the Jaro similarity metric, which measures the similarity between two strings. The Jaro similarity is 1.0 when strings are identical and 0 when strings have no letters in common. Distance measures such as the Jaro distance or Jaro-Winkler distance, on the other hand, are 0 when strings are identical and 1 when they have no letters in common.

The Jaro similarity between two strings s1 and s2, sim_j, is defined as

sim_j = 0 if m is 0.

sim_j = ( (m / length of s1) + (m / length of s2) + (m - t) / m ) / 3 otherwise.

Where:

$m$ is the number of matching characters (the same character in the same position);
$t$ is half the number of transpositions (a shared character placed in different positions).

The Winkler modification to Jaro is to check for identical prefixes of the strings.

If we define the number of initial (prefix) characters in common as:

l = the length of a common prefix between strings, up to 4 characters

and, additionally, select a multiplier (Winkler suggested 0.1) for the relative importance of the prefix for the word similarity:

p = 0.1

The Jaro-Winkler similarity can then be defined as

sim_w = sim_j + lp(1 - sim_j)

Where:

sim_j is the Jaro similarity.
l is the number of matching characters at the beginning of the strings, up to 4.
p is a factor to modify the amount to which the prefix similarity affects the metric.

Winkler suggested this be 0.1.

The Jaro-Winkler distance between strings, which is 0.0 for identical strings, is then defined as

d_w = 1 - sim_w

String metrics such as Jaro-Winkler distance are useful in applications such as spelling checkers, because letter transpositions are common typing errors and humans tend to misspell the middle portions of words more often than their beginnings. This may help a spelling checker program to generate better alternatives for misspelled word replacement.

The task

Using a dictionary of your choice and the following list of 9 commonly misspelled words:

"accomodate", "definately", "goverment", "occured", "publically", "recieve", "seperate", "untill", "wich"

Calculate the Jaro-Winkler distance between the misspelled word and words in the dictionary.

Use this distance to list close alternatives (at least two per word) to the misspelled words.

Show the calculated distances between the misspelled words and their potential replacements.

See also

Wikipedia page: Jaro–Winkler distance.
Comparing string similarity algorithms. Comparison of algorithms on Medium

Python

<lang python>""" Test Jaro-Winkler distance metric. linuxwords.txt is from http://users.cs.duke.edu/~ola/ap/linuxwords """

WORDS = [s.strip() for s in open("linuxwords.txt").read().split()] MISSPELLINGS = [

   "accomodate",
   "definately",
   "goverment",
   "occured",
   "publically",
   "recieve",
   "seperate",
   "untill",
   "wich",

]

def jaro_winkler_distance(st1, st2):

   """
   Compute Jaro-Winkler distance between two strings.
   """
   if len(st1) < len(st2):
       st1, st2 = st2, st1
   len1, len2 = len(st1), len(st2)
   if len2 == 0:
       return 0.0
   delta = max(0, len2 // 2 - 1)
   flag = [False for _ in range(len2)]  # flags for possible transpositions
   ch1_match = []
   for idx1, ch1 in enumerate(st1):
       for idx2, ch2 in enumerate(st2):
           if idx2 <= idx1 + delta and idx2 >= idx1 - delta and ch1 == ch2 and not flag[idx2]:
               flag[idx2] = True
               ch1_match.append(ch1)
               break

   matches = len(ch1_match)
   if matches == 0:
       return 1.0
   transpositions, idx1 = 0, 0
   for idx2, ch2 in enumerate(st2):
       if flag[idx2]:
           transpositions += (ch2 != ch1_match[idx1])
           idx1 += 1

   jaro = (matches / len1 + matches / len2 + (matches - transpositions/2) / matches) / 3.0
   commonprefix = 0
   for i in range(min(4, len2)):
       commonprefix += (st1[i] == st2[i])

   return 1.0 - (jaro + commonprefix * 0.1 * (1 - jaro))

def within_distance(maxdistance, stri, maxtoreturn):

   """
   Find words in WORDS of closeness to stri within maxdistance, return up to maxreturn of them.
   """
   arr = [w for w in WORDS if jaro_winkler_distance(stri, w) <= maxdistance]
   arr.sort(key=lambda x: jaro_winkler_distance(stri, x))
   return arr if len(arr) <= maxtoreturn else arr[:maxtoreturn]

for STR in MISSPELLINGS:

   print('\nClose dictionary words ( distance < 0.15 using Jaro-Winkler distance) to "',
         STR, '" are:\n        Word   |  Distance')
   for w in within_distance(0.15, STR, 5):
       print('{:>14} | {:6.4f}'.format(w, jaro_winkler_distance(STR, w)))

</lang>

Output:

Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " accomodate " are:
        Word   |  Distance
   accommodate | 0.0364
  accommodated | 0.0515
  accommodates | 0.0515
 accommodating | 0.0979
 accommodation | 0.0979

Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " definately " are:
        Word   |  Distance
    definitely | 0.0564
     defiantly | 0.0586
        define | 0.0909
      definite | 0.0977
       defiant | 0.1013

Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " goverment " are:
        Word   |  Distance
    government | 0.0733
        govern | 0.0800
   governments | 0.0897
      movement | 0.0992
  governmental | 0.1033

Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " occured " are:
        Word   |  Distance
      occurred | 0.0250
         occur | 0.0571
      occupied | 0.0786
        occurs | 0.0905
      accursed | 0.0917

Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " publically " are:
        Word   |  Distance
      publicly | 0.0400
        public | 0.0800
     publicity | 0.1044
   publication | 0.1327
    biblically | 0.1400

Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " recieve " are:
        Word   |  Distance
       receive | 0.0625
      received | 0.0917
      receiver | 0.0917
      receives | 0.0917
       relieve | 0.0917

Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " seperate " are:
        Word   |  Distance
     desperate | 0.0708
      separate | 0.0917
     temperate | 0.1042
     separated | 0.1144
     separates | 0.1144

Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " untill " are:
        Word   |  Distance
         until | 0.0571
         untie | 0.1257
      untimely | 0.1321

Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " wich " are:
        Word   |  Distance
         witch | 0.1067
         which | 0.1200

Rust

Translation of: Python

<lang rust>use std::fs::File; use std::io::{self, BufRead};

fn load_dictionary(filename: &str) -> std::io::Result<Vec<String>> {

   let file = File::open(filename)?;
   let mut dict = Vec::new();
   for line in io::BufReader::new(file).lines() {
       dict.push(line?);
   }
   Ok(dict)

}

fn jaro_winkler_distance(string1: &str, string2: &str) -> f64 {

   let mut st1 = string1;
   let mut st2 = string2;
   if st1.len() < st2.len() {
       std::mem::swap(&mut st1, &mut st2);
   }
   let len1 = st1.len();
   let len2 = st2.len();
   if len2 == 0 {
       return 0.0;
   }
   let delta = std::cmp::max(0, len2 / 2 - 1);
   let mut flag = vec![false; len2];
   let mut ch1_match = vec![];
   for (idx1, ch1) in st1.chars().enumerate() {
       for (idx2, ch2) in st2.chars().enumerate() {
           if idx2 <= idx1 + delta && idx2 + delta >= idx1 && ch1 == ch2 && !flag[idx2] {
               flag[idx2] = true;
               ch1_match.push(ch1);
               break;
           }
       }
   }
   let matches = ch1_match.len();
   if matches == 0 {
       return 1.0;
   }
   let mut transpositions = 0;
   let mut idx1 = 0;
   for (idx2, ch2) in st2.chars().enumerate() {
       if flag[idx2] {
           transpositions += (ch2 != ch1_match[idx1]) as i32;
           idx1 += 1;
       }
   }
   let m = matches as f64;
   let jaro =
       (m / (len1 as f64) + m / (len2 as f64) + (m - (transpositions as f64) / 2.0) / m) / 3.0;
   let mut commonprefix = 0;
   for (c1, c2) in st1.chars().zip(st2.chars()).take(std::cmp::min(4, len2)) {
       commonprefix += (c1 == c2) as i32;
   }
   1.0 - (jaro + commonprefix as f64 * 0.1 * (1.0 - jaro))

}

fn within_distance<'a>(

   dict: &'a Vec<String>,
   max_distance: f64,
   stri: &str,
   max_to_return: usize,

) -> Vec<(&'a String, f64)> {

   let mut arr: Vec<(&String, f64)> = dict
       .iter()
       .map(|w| (w, jaro_winkler_distance(stri, w)))
       .filter(|x| x.1 <= max_distance)
       .collect();
   // The trait std::cmp::Ord is not implemented for f64, otherwise
   // we could just do this:
   // arr.sort_by_key(|x| x.1);
   let compare_distance = |d1, d2| {
       use std::cmp::Ordering;
       if d1 < d2 {
           Ordering::Less
       } else {
           if d1 > d2 {
               Ordering::Greater
           } else {
               Ordering::Equal
           }
       }
   };
   arr.sort_by(|x, y| compare_distance(x.1, y.1));
   arr[0..std::cmp::min(max_to_return, arr.len())].to_vec()

}

fn main() {

   match load_dictionary("linuxwords.txt") {
       Ok(dict) => {
           for word in &[
               "accomodate",
               "definately",
               "goverment",
               "occured",
               "publically",
               "recieve",
               "seperate",
               "untill",
               "wich",
           ] {
               println!("Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to '{}' are:", word);
               println!("        Word   |  Distance");
               for (w, dist) in within_distance(&dict, 0.15, word, 5) {
                   println!("{:>14} | {:6.4}", w, dist)
               }
               println!();
           }
       }
       Err(error) => eprintln!("{}", error),
   }

}</lang>

Output:

The output is slightly different from that of the Python example because I've removed a trailing zero-width space character from some of the test strings.

Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'accomodate' are:
        Word   |  Distance
   accommodate | 0.0182
  accommodated | 0.0333
  accommodates | 0.0333
 accommodating | 0.0815
 accommodation | 0.0815

Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'definately' are:
        Word   |  Distance
    definitely | 0.0400
     defiantly | 0.0422
        define | 0.0800
      definite | 0.0850
     definable | 0.0872

Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'goverment' are:
        Word   |  Distance
    government | 0.0533
        govern | 0.0667
   governments | 0.0697
      movement | 0.0810
  governmental | 0.0833

Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'occured' are:
        Word   |  Distance
      occurred | 0.0250
         occur | 0.0571
      occupied | 0.0786
        occurs | 0.0905
      accursed | 0.0917

Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'publically' are:
        Word   |  Distance
      publicly | 0.0400
        public | 0.0800
     publicity | 0.1044
   publication | 0.1327
    biblically | 0.1400

Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'recieve' are:
        Word   |  Distance
       receive | 0.0333
      received | 0.0625
      receiver | 0.0625
      receives | 0.0625
       relieve | 0.0667

Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'seperate' are:
        Word   |  Distance
     desperate | 0.0708
      separate | 0.0917
     temperate | 0.1042
     separated | 0.1144
     separates | 0.1144

Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'untill' are:
        Word   |  Distance
         until | 0.0333
         untie | 0.1067
      untimely | 0.1083
      Antilles | 0.1264
        untidy | 0.1333

Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'wich' are:
        Word   |  Distance
         witch | 0.0533
         which | 0.0600
       witches | 0.1143
          rich | 0.1167
          wick | 0.1167