Different Approaches to Name Matching
Fuzzy Name Matching
This is one of the most commonly used approach. The basic idea behind fuzzy match is to measure the edit distance between 2 strings. What does it take to convert from Source String A to Destination String B?
There are many approaches to fuzzy match. But the 2 most common ones are Jaro-Winkler distance and Levenshtein distance. Let us understand how each one of them work.
Here is the more formal definition of this algorithm from Wikipedia
The Jaro–Winkler distance is a string metric measuring an edit distance between two sequences.
The lower the Jaro–Winkler distance for two strings is, the more similar the strings are. The score is normalized such that 0 means an exact match and 1 means there is no similarity. The Jaro–Winkler similarity is the inversion, (1 − Jaro–Winkler distance).
The formula to calculate the Jaro-Winkler similarity is:
Let us understand this with an example.
Here is the formal definition of this algorithm from Wikipedia:
The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
The formula to calculate the Levenshtein is:
As always, formulas are complicated to interpret. Let us understand with an example.
Dynamic Programming Approach
The most common way of calculating the edit distance to convert from one string to another is by the dynamic programming approach. You can watch this video to understand how it works. Below is the table depicting how it works.
Jaro-Winker v/s Levenshtein — Which one to use?
This is the question many developers get into when they decide to use fuzzy match. Jaro-Winkler takes into account only matching characters and any required transpositions (swapping of characters). Also it gives more priority to prefix similarity. Levenshtein counts the number of edits to convert one string to another. So, its really a choice based on use cases and there is no perfect answer of one vs the other. As a general guideline, I have seen Jaro-Winkler work well for single word comparisons and is more dependable. Also, in terms of performance Jaro-Winkler gives better performance than Levenshtein. But I would go with Levenshtein distance for longer string comparisons since I really get to know how different they are in terms of character replacements.