Levenshtein Distance
Levenshtein distance measures how many single-character edits are needed to turn one string into another.
Allowed edits:
- insert one character
- delete one character
- substitute one character for another
Example:
kitten -> sitten substitute k with ssitten -> sittin substitute e with isittin -> sitting insert gDistance: 3
Dynamic Programming
Section titled “Dynamic Programming”Let d[i][j] be the edit distance between:
- the first
icharacters of stringa - the first
jcharacters of stringb
Base cases:
d[0][j] = jd[i][0] = iThose mean an empty string needs j insertions to become a prefix of length
j, and a prefix of length i needs i deletions to become empty.12
Recurrence:
cost = 0 if a[i - 1] == b[j - 1] else 1
d[i][j] = min( d[i - 1][j] + 1, # delete d[i][j - 1] + 1, # insert d[i - 1][j - 1] + cost # substitute or match)The answer is d[len(a)][len(b)].
Complexity
Section titled “Complexity”For strings of length m and n:
- time:
O(mn) - memory:
O(mn)with the full table - memory:
O(min(m, n))if only the previous row is kept
Use the full table when you need the edit script. Use row compression when you only need the distance.
Use Case
Section titled “Use Case”Levenshtein distance is useful for:
- spell correction
- fuzzy search
- typo tolerance
- approximate matching
- comparing biological or symbolic sequences
Footnotes
Section titled “Footnotes”-
V. I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals”, 1966 English translation of the 1965 Russian paper. https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf ↩
-
Robert A. Wagner and Michael J. Fischer, “The String-to-String Correction Problem”, Journal of the ACM, 1974. ↩