In the real world, you sometimes need to make matching character strings more flexible. This is especially common when merging data sets. Variables, especially names, are not always exactly the same in all sources of your data. When matching data, you need to be able to programmatically determine if ‘John Doe’ is the same as ‘Johnny Doe’. The term most often associated with this type of matching is ‘fuzzy matching’. Fortunately within SAS, there are several functions that allow you to perform a fuzzy match. I’ll show you the most common of these functions and then I will show you an example that uses my favorite from this list.
The COMPARE function returns the position of the leftmost character by which two strings differ, or returns 0 if there is no difference. This function is most useful when comparing two strings that should be the same, but may only differ in case or due to leading or trailing blanks. This function takes the place of the following code:
if strip(upcase(name1))=strip(upcase(name2)) then do;
The COMPARE function can be used to make the same comparison.
if compare(name1,name2,'il')=0 then do;
The COMPGED function returns the generalized edit distance between two strings. Specifically, the COMPGED function returns a generalization of the Levenshtein edit distance, which is a measure of dissimilarity between two strings. The Levenshtein edit distance is the number of operations (deletions, insertions, or replacement) of a single characters that are required to transform string-1 into string-2.
Each operation basically ‘costs’ a certain value. For example, if string-1 is the result of inserting a space into string-2, this has a cost of 10. The more dramatic the operation, the greater the cost. The COMPGED will return the total cost for all operations that occur. The costs returned by COMPGED can be altered by using CALL COMPCOST so that the cost are specific to your needs. A common use I have seen for using the COMPGED function is using it to compare email addresses.
email1='JohnDoe@abc.com'; email2='John_Doe@abc.com'; cost=compged(email1,email2);
The value of COST will be 30 which is the cost of adding punctuation to a string.
The COMPLEV function is very similar to the COMPGED function. The difference is that the Levenshtein edit distance that is computed by COMPLEV is a special case of the generalized edit distance that is computed by COMPGED. The result is the COMPLEV executes much more quickly than COMPGED. However, the COMPLEV function is not as powerful or versatile as the COMPGED function. The COMPLEV function is generally most useful when comparing simple strings and when speed of comparison is important.
The SPEDIS function is the oldie-but-a-goodie of the bunch. The SPEDIS function determines the likelihood of two words matching, expressed as the asymmetric spelling distance between the two words. SPEDIS is similar to COMPGED in that it assigns a cost to the each operation such as swap, append and delete. SPEDIS will sum the costs and then divide the sum by the length of the first argument. It is important to remember this because this makes the order of the arguments important. This means that SPEDIS(X,Y) does not always equal SPEDIS(Y,X).
Summary: Review your data
Any process that is described as ‘fuzzy’ is obviously not an exact science. With each of these functions using different algorithms, each is going to have its own strengths.
- If you are comparing complex strings and need the most control, then I would recommend looking at the COMPGED function.
- If you are comparing fairly simple strings and within large data sets, then COMPLEV may be a better choice.
The bottom line is that you will have to review your data and what differences are important to you. The results returned by these functions are subjective. You have to determine what an acceptable difference is. I suggest that you simply test out each function to see which one works best for you.
COMPGED and SOUNDEX Example
Of the above functions, COMPGED is the one I tend to use most often because, for the scenarios brought to me by various customers, COMPGED has produced the most precise results. I have been able to get the best results by combining the COMPGED function with the SOUNDEX function. Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.
The below example encodes each of the first names that is to be compared with the SOUNDEX function and then evaluates the results of using SOUNDEX with the COMPGED function. The goal of this example is to determine if the value of FNAME is the same as ‘Johnathan’ or ‘William’.
/* Sample data that contains names and the class being taken */ data class; input fname : $12. lname $ class : $9.; datalines; Jon Smith Math John Smith Math Johnny Smith Math James Smith Math Will Miller Chemistry Willy Miller Chemistry Willie Miller Chemistry Bonny Davis Gym Milly Wilson Biology ; /* Data set that contains the name and the grade for the class */ data grade; input fname : $12. lname $ grade $; datalines; Johnathan Smith A William Miller B ; /* This code uses both the SOUNDEX and COMPGED functions. I have found */ /* that by comparing the strings produced by SOUNDEX, I have been able */ /* to get a tighter comparison. */ data c; set class; tmp1=soundex(fname); do i=1 to nobs; set grade(rename=(fname=fname2)) point=i nobs=nobs; tmp2=soundex(fname2); dif=compged(tmp1,tmp2); if dif<=50 then do; possible_match='Yes'; drop i tmp1 tmp2 fname2; output; end; end; run; proc print; run;
One of the great things about SAS is the number of different ways you can accomplish a given goal. If you have a technique to make fuzzy comparisons that you like to use, please share. I would love to see some of the techniques being used!