Jaro-Winkler distance
The Jaro-Winkler distance is a metric for measuring the edit distance between words. It is similar to the more basic Levenstein distance but the Jaro distance also accounts for transpositions between letters in the words. With the Winkler modification to the Jaro metric, the Jaro-Winkler distance also adds an increase in similarity for words which start with the same letters (prefix).
The Jaro-Winkler distance is a modification of the Jaro similarity metric, which measures the similarity between two strings. The Jaro similarity is 1.0 when strings are identical and 0 when strings have no letters in common. Distance measures such as the Jaro distance or Jaro-Winkler distance, on the other hand, are 0 when strings are identical and 1 when they have no letters in common.
The Jaro similarity between two strings s1 and s2, simj, is defined as
- simj = 0 if m is 0.
- simj = ( (m / length of s1) + (m / length of s2) + (m - t) / m ) / 3 otherwise.
Where:
- is the number of matching characters (the same character in the same position);
- is half the number of transpositions (a shared character placed in different positions).
The Winkler modification to Jaro is to check for identical prefixes of the strings.
If we define the number of initial (prefix) characters in common as:
l = the length of a common prefix between strings, up to 4 characters
and, additionally, select a multiplier (Winkler suggested 0.1) for the relative importance of the prefix for the word similarity:
p = 0.1
The Jaro-Winkler similarity can then be defined as
simw = simj + lp(1 - simj)
Where:
- simj is the Jaro similarity.
- l is the number of matching characters at the beginning of the strings, up to 4.
- p is a factor to modify the amount to which the prefix similarity affects the metric.
Winkler suggested this be 0.1.
The Jaro-Winkler distance between strings, which is 0.0 for identical strings, is then defined as
dw = 1 - simw
String metrics such as Jaro-Winkler distance are useful in applications such as spelling checkers, because letter transpositions are common typing errors and humans tend to misspell the middle portions of words more often than their beginnings. This may help a spelling checker program to generate better alternatives for misspelled word replacement.
- The task
Using a dictionary of your choice and the following list of 9 commonly misspelled words:
"accomodate", "definately", "goverment", "occured", "publically", "recieve", "seperate", "untill", "wich"
- Calculate the Jaro-Winkler distance between the misspelled word and words in the dictionary.
- Use this distance to list close alternatives (at least two per word) to the misspelled words.
- Show the calculated distances between the misspelled words and their potential replacements.
- See also
- Jaro–Winkler distance entry on Wikipedia.
- [of algorithms] Comparing string similarity algorithms.
Python
<lang python>""" Test Jaro-Winkler distance metric. linuxwords.txt is from http://users.cs.duke.edu/~ola/ap/linuxwords """
WORDS = [s.strip() for s in open("linuxwords.txt").read().split()] MISSPELLINGS = [
"accomodate", "definately", "goverment", "occured", "publically", "recieve", "seperate", "untill", "wich",
]
def jaro_winkler_distance(st1, st2):
""" Compute Jaro-Winkler distance between two strings. """ if len(st1) < len(st2): st1, st2 = st2, st1 len1, len2 = len(st1), len(st2) if len2 == 0: return 0.0 delta = max(0, len2 // 2 - 1) flag = [False for _ in range(len2)] # flags for possible transpositions ch1_match = [] for idx1, ch1 in enumerate(st1): for idx2, ch2 in enumerate(st2): if idx2 <= idx1 + delta and idx2 >= idx1 - delta and ch1 == ch2 and not flag[idx2]: flag[idx2] = True ch1_match.append(ch1) break
matches = len(ch1_match) if matches == 0: return 1.0 transpositions, idx1 = 0, 0 for idx2, ch2 in enumerate(st2): if flag[idx2]: transpositions += (ch2 != ch1_match[idx1]) idx1 += 1
jaro = (matches / len1 + matches / len2 + (matches - transpositions/2) / matches) / 3.0 commonprefix = 0 for i in range(min(4, len2)): commonprefix += (st1[i] == st2[i])
return 1.0 - (jaro + commonprefix * 0.1 * (1 - jaro))
def within_distance(maxdistance, stri, maxtoreturn):
""" Find words in WORDS of closeness to stri within maxdistance, return up to maxreturn of them. """ arr = [w for w in WORDS if jaro_winkler_distance(stri, w) <= maxdistance] arr.sort(key=lambda x: jaro_winkler_distance(stri, x)) return arr if len(arr) <= maxtoreturn else arr[:maxtoreturn]
for STR in MISSPELLINGS:
print('\nClose dictionary words ( distance < 0.15 using Jaro-Winkler distance) to "', STR, '" are:\n Word | Distance') for w in within_distance(0.15, STR, 5): print('{:>14} | {:6.4f}'.format(w, jaro_winkler_distance(STR, w)))
</lang>
- Output:
Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " accomodate " are: Word | Distance accommodate | 0.0364 accommodated | 0.0515 accommodates | 0.0515 accommodating | 0.0979 accommodation | 0.0979 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " definately " are: Word | Distance definitely | 0.0564 defiantly | 0.0586 define | 0.0909 definite | 0.0977 defiant | 0.1013 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " goverment " are: Word | Distance government | 0.0733 govern | 0.0800 governments | 0.0897 movement | 0.0992 governmental | 0.1033 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " occured " are: Word | Distance occurred | 0.0250 occur | 0.0571 occupied | 0.0786 occurs | 0.0905 accursed | 0.0917 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " publically " are: Word | Distance publicly | 0.0400 public | 0.0800 publicity | 0.1044 publication | 0.1327 biblically | 0.1400 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " recieve " are: Word | Distance receive | 0.0625 received | 0.0917 receiver | 0.0917 receives | 0.0917 relieve | 0.0917 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " seperate " are: Word | Distance desperate | 0.0708 separate | 0.0917 temperate | 0.1042 separated | 0.1144 separates | 0.1144 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " untill " are: Word | Distance until | 0.0571 untie | 0.1257 untimely | 0.1321 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " wich " are: Word | Distance witch | 0.1067 which | 0.1200