Levenshtein Distance

Introduction

Levenshtein distance between two strings is defined as the minimum number of single-digit edit operations ( Insertion, Deletion, Substitution ) required to convert one string to another.

Single Digit Operations

  1. Insertion : “Sprk” -> “Spark”
  2. Deletion : “Hello” -> “Hllo”
  3. Replace : “Carry” -> “Larry”

For example,

Example 1
s1 = "gone"
s2 = "cone"

gone -> cone ( replace 'g' with 'c' )

Leventshein Distance = 1

Example 2
s1 = "slap"
s2 = "splash"

slap -> splap ( insert 'p' between 's' and 'l' )
splap -> splas ( substitute 'p' with 's' )
splas -> splash ( insert 'h' at the end )

Leventshein Distance = 3

Levenshtein Distance Algorithm

The simplest way to find Levenshtein distance is to use the brute force approach. Suppose we need to convert string S1 to S2. Then for each character of S1, we have 3 choices. We can either replace it, delete it or insert a character after it.

In the brute force approach, we basically use a recursive function to calculate the levenshtein distance between prefixes of S1 and S2. The recursive function calculates levenshtein distance between smaller prefixes then uses that information during backtracking to calculate levenshtein distance between longer prefixes and so on.

Let S1, S2 be two strings of length N, M respectively. Then the recurrence relation for Levenshtein distance is

dp(N, M) is the Levenshtein distance between prefix of length N of S1 and prefix of length M of S2.

Read the pseudo code for better understanding.

Pseudo Code ( Recursive Top Down Implementation )

Note : We are assuming string to be zero indexed
S1: String 1
S2: String 2
N: Length of S1
M: Length of S2
i: length of prefix of S1
j: length of prefix of S2

/*
recursive function to calculate Levenshtein Distance between two strings
LD(S1, S2, i, j) returns the Levenshtein Distance between prefix of length i of S1
and prefix of length j of S2
To find Levenshtein distance between S1 and S2, we call function as LD( S1, S2, N, M)
*/

function LD (String S1, String S2, i, j)
	
	/* 
	i == 0 means that prefix of S1 is empty
	The only way to convert prefix of S1 to prefix of S2 is to insert all character of S2 prefix to S1 prefix
	number of characters in S2 prefix = j
	So, we simply return j
	*/
	
	if i == 0
		return j
	
	/* 
	j == 0 means that S2 prefix is empty
	The only way to convert S1 to S2 is to delete all character of S1 prefix
	Number of characters in S1 prefix = i
	So, we simply return i
	*/
	
	if j == 0
		return i
		
	/* 
	S1[i] is the last character of the current prefix of S1
	S2[j] is the last character of the current prefix of S2
	*/
	
	if S1[i] == S2[j]
	
		/* 
		If S1[i] == S2[j] then we simply move to next set of prefix
		We do not need to perform any single digit operation
		*/
	
		return LD ( S1, S2, i-1, j-1)
	
	else
	
		/* 
		If S1[i] is not equal to S2[j] then we have 2 choices
		1. Replace S1[i] with S[j] and return LD( S1, S2, i-1, j-1).
		2. Delete S1[i] and return LD( S1, S2, i-1, j).
		3. Insert S2[j] after S1[i] in S1 and return LD( S1, S2, i, j-1).
		We take the minimum of the above 3 choices
		*/
		
		return 1 + minimum(
			LD( S1, S2, i-1, j-1),
			LD( S1, S2, i-1, j),
			LD( S1, S2, i, j-1)
			) 

From above, we can see that for each character, we have 3 options. Therefore, the time complexity of the brute force method is O(3max(N,M)) in the worst case.

Dynamic Programming Approach

The recursive tree of the above recursive function is

From the recursion tree, we see that the recursive function is dividing the problem into many subproblems and solving them. But there are many subproblems that are called multiple times. Therefore, instead of calculating the same subproblem, again and again, we can use an array to save the result and re-use it. Let the array be dp[N+1][M+1].

In the dynamic programming approach, we basically compare all prefixes of S1 and S2 using a nested loop. In dp[i][j], i and j denote the length of prefixes of S1 and S2 respectively.

Example

Suppose we want to find levenshtein distance between “bravo” and “raven”.

STEP 1

Initialize an array of size dp[6][6].

STEP 2

Compare prefix of length i = 0 of “bravo” with all prefix of “raven”.

STEP 3

Compare prefix of length i = 1 of “bravo” with all prefix of “raven”

STEP 3

Compare prefix of length i = 2 to i = 5 of “bravo” with all prefix of “raven”

The levenshtein distance between “bravo” and “raven” is given by dp[5][5].

Pseudocode using Dynamic Programming approach

S1: String 1
S2: String 2
N: Length of S1
M: Length of S2
  
Function LD(String S1, String S2, Int N, Int M)
        
       Boolean dp[N+1][M+1]
        
       FOR i = 0 to i = N
             For j = 0 to j = M
                     
                    If i == 0
                           dp[i][j] = j
                     
                    Else If j == 0
                           dp[i][j] = i
                            
                    Else If S1[i-1] == S2[j-1]
                           dp[i][j] = dp[i-1][j-1]
                     
                    Else 
                           dp[i][j] = 1 + minimum( dp[i-1][j-1], dp[i-1][j], dp[i][j-1] )
                            
             END FOR
       END FOR
        
       Return dp[N][M]
        
END Function

Implementation

#include <iostream>
#include <string>
#include <vector>
using namespace std;
 
// function to calculate levenshtein distance between string s1 and s2
// n : length of s1
// m : length of s2
int Levenshtein_Distance(string s1, string s2, int n,int m){
	
	// initializing 2d vector dp[n+1][m+1] to store
	// subproblem solution
	vector<vector<int>> dp(n+1,vector<int>(m+1,0));
	
	// the nested loop below calculates
	// levenshtein distance between all prefixes of s1 ans s2
	// i is the length of prefix of s1
	// j is the length of prefix of s2
	for(int i=0;i<=n;++i)
	{
		for(int j=0;j<=m;++j)
		{
			
			if(i==0){
				// if i=0, then it means prefix of s1 is empty string
				// the only way to make s2 prefix equal to empty string
				// is to delete all characters of s2 prefix
				dp[i][j] = j;
			}
			else if(j==0){
				// if j=0, then it means prefix of s2 is empty string
				// the only way to make s1 prefix equal to empty string
				// is to delete all characters of s1 prefix
				dp[i][j] = i;
			}
			else if(s1[i-1]==s2[j-1]){
				// s1[i-1] is the last character of current prefix of s1
				// s2[j-1] is the last character of current prefix of s2
				// if s1[i-1] = s2[i-1] then we don't have to perform any operation
				dp[i][j] = dp[i-1][j-1];
			}
			else{
				// if s1[i-1] is not equal to s2[j-1] then we need to perform single-digit operation
				// to make prefixes equal
				// we have 2 options
				// 1. Replace s1[i-1] with s2[j-1]. In this case, dp[i][j] = 1 + dp[i-1][j-1].
				// 2. Delete s1[i-1]. In this case, dp[i][j] = 1 + dp[i-1][j]
				// 2. Insert s2[j-1] after s1[i-1] in s1. In this case, dp[i][j] = 1 + dp[i][j-1]
				// We take the minimum of the above 3 options
				dp[i][j] = 1 + min(dp[i-1][j-1], min(dp[i-1][j], dp[i][j-1]));
			}
			
		}
	}
	
	return dp[n][m];
	
}
 

int main()
{
   
    string s1 = "bravo";
    string s2 = "raven";
 
    cout<<"Levenshtein Distance between \""<<s1<<"\" and \""<<s2<<"\" : "<<Levenshtein_Distance(s1, s2, s1.length(), s2.length())<<endl;
 
    return 0;
}

Output

Levenshtein Distance between "bravo" and "raven" : 3

Time Complexity: O(N*M)
Space Complexity: O(N*M)

Optimization

From the above code, you must have observed that to calculate levenshtein distance for ith prefix of S1, we only need information about (i-1)th prefixes. Therefore, we can solve the above problem using 2 arrays of size M instead of using an array of size N*M.

#include <iostream>
#include <string>
#include <vector>
using namespace std;

// function to calculate levenshtein distance between string s1 and s2
// n : length of s1
// m : length of s2
int Levenshtein_Distance(string s1, string s2, int n, int m) {

	vector<int> dp0(m + 1);
	vector<int> dp1(m + 1);

	for (int i = 0; i <= n; ++i)
	{
		for (int j = 0; j <= m; ++j)
		{

			if (i == 0) {
				dp1[j] = j;
			}
			else if (j == 0) {
				dp1[j] = i;
			}
			else if (s1[i - 1] == s2[j - 1]) {
				dp1[j] = dp0[j - 1];
			}
			else {
				dp1[j] = 1 + min(dp0[j - 1], min(dp0[j], dp1[j - 1]));
			}

		}
		dp0 = dp1;
	}

	return dp1[m];

}


int main()
{

	string s1 = "bravo";
	string s2 = "raven";

	cout << "Levenshtein Distance between \"" << s1 << "\" and \"" << s2 << "\" : " << Levenshtein_Distance(s1, s2, s1.length(), s2.length()) << endl;

	return 0;
}

Output

Levenshtein Distance between "bravo" and "raven" : 3

Applications

  • String Matching
  • Spell Correction
  • Plagiarism Detection
  • DNA analysis

References

Leave a Comment

Your email address will not be published. Required fields are marked *