Home - Contacts - Terms -

Mi Islita

Levenshtein Edit Distance

Classroom Demonstration Tool for the Levenshtein Edit Distance Algorithm


About the Levenshtein Edit Distance


A Local Similarity Measure for the Masses

SUGGESTED CONVERSIONS
CONVERT THISINTO THIS
DemocratsRepublicans
GoogleYahoo!
GoodEvil
passworduserID
JesusSatan
BritneySpears
Lotto No.Quick Pick No.
Definition: The Levenshtein Edit Distance (LED) is the number of edits (insertions, deletions, and substitutions) required to transform a string (A) into another string (B). Among other levels, edit distances can be computed at the level of letters, words, phrases, or even passages.

Instructions: This demonstration tool has been adapted from Lukasz Stilger's (1) script. For those interested in other implementations, Michael Gilleland has a comprehensive list of LED resources, with all kind of programming flavors (2).

Our tool displays edit distance values at the bottom-right corner of the Levenshtein matrix. The matrix is generated using a dynamic programming routine.

To see the algorithm in action, type in string A in the left textarea and string B in the right textarea. These can be single terms, phrases, text windows or combination of these. As you type, watch the algorithm reaching the LED (colored in red). String similarity increases as LED decreases.

For example, the edit distance between Democrats and Republicans is 8 because this is the number of edits required to change Democrats into Republicans. What would be the number of edits necessary to change Democrats and Republicans into Independents? In the real world, all these changes may not be possible :)

Applications: This algorithm can be used with filtering algorithms designed to detect near duplicates. In addition, the LED algorithm can be incorporated into spell checker (3) applications. Additional resources, with applications, are available elsewere (4). A 08/18/2007 search in Google for the Levenshtein Edit Distance Algorithm Tutorial returns thousand of resources.

Known issues: Long text windows may take time to process with this tool. Text windows that are too large can stop your browser. Thus, this tool should be used for demonstration purposes, like in a classroom setting.

Historical Note: The Levenshtein edit distance measure was named after his inventor, Vladimir Levenshtein, who is known as the father of the coding theory in Russia. He is the recipient of the 2006 Richard W. Hamming Medal. An IEEE Fellow, he is a member of the Moscow Mathematical Society (5). See some pictures of Vladimir Levenshtein.

New Similarity Feature: In the paper An Information-Theoretic Definition of Similarity (6), Dekang Lin, Department of Computer Science, University of Manitoba, has stated that edit distances can be converted into similarity values as follows:

simedit(x, y) = 1/(1 + editDist(x, y))

We have added this conversion capability to our tool. This allows users to reproduce Table 1 of Lin's paper, Top-10 Most Similar Words to "grandiloquent".

Applications of the Levenshtein Edit Distance Algorithm

A list of applications or references suggested by readers is given below.

The Levenshtein Edit Distance can be or has been used:

  1. in the implementation of a matrix distance calculator for reverse compliment DNA code design.
  2. for the perceptive evaluation of dialect distance measurements.
  3. for automatic marking of musical dictations.
  4. for regular expressions approximate matching.
  5. to identify if two genetic sequences have similar functions.
  6. to filter blocks of email lists (candidate spam addresses) within a LED threshold value.
  7. as the ultimate baby name explorer.
  8. to name products and services like domains, brands, etc.
  9. to conduct fuzzy search matches in EXCEL or your preferred environment.
  10. for spamdexing search engines - by randomly converting text into gibberish.
  11. for spam stemming search engines - by systematically appending edits to valid stems.
  12. as part of a spell checker routine.
  13. to identify duplicated content and plagiarism.
  14. as a spelling correction for search engine queries.

Got an idea, suggestion, or reference?

References

  1. Lukasz Stilger, LD - Lukasz Stilger.
  2. Michael Gilleland, Levenshtein Distance, in Three Flavors.
  3. Wikipedia, Spell Checker.
  4. Levenshtein, The Levenshtein-Algorithm
  5. IEEE, 2006 Richard W. Hamming Medal.
  6. Dekang Lin, An Information-Theoretic Definition of Similarity.

Have questions about this tool? Drop us an email.

Thank you for using this site.
Status of the Current Document 
W3C CSS Validation  W3C XHTML Validation
Copyright © 2006 Mi Islita.com -