“Here’s Johnny!!!” and well sometimes John and sometimes Jonathan and sometimes Jon.

In the real world, you sometimes need to make matching character strings more flexible. This is especially common when merging data sets. Variables, especially names, are not always exactly the same in all sources of your data. When matching data, you need to be able to programmatically determine if ‘John Doe’ is the same as ‘Johnny Doe’. The term most often associated with this type of matching is ‘fuzzy matching’. Fortunately within SAS, there are several functions that allow you to perform a fuzzy match. I’ll show you the most common of these functions and then I will show you an example that uses my favorite from this list.

COMPARE Function

The COMPARE function returns the position of the leftmost character by which two strings differ, or returns 0 if there is no difference. This function is most useful when comparing two strings that should be the same, but may only differ in case or due to leading or trailing blanks. This function takes the place of the following code:

```if strip(upcase(name1))=strip(upcase(name2)) then do;
```

The COMPARE function can be used to make the same comparison.

```if compare(name1,name2,'il')=0 then do;
```

COMPGED Function

The COMPGED function returns the generalized edit distance between two strings. Specifically, the COMPGED function returns a generalization of the Levenshtein edit distance, which is a measure of dissimilarity between two strings. The Levenshtein edit distance is the number of operations (deletions, insertions, or replacement) of a single characters that are required to transform string-1 into string-2.

Each operation basically ‘costs’ a certain value. For example, if string-1 is the result of inserting a space into string-2, this has a cost of 10. The more dramatic the operation, the greater the cost.  The COMPGED will return the total cost for all operations that occur. The costs returned by COMPGED can be altered by using CALL COMPCOST so that the cost are specific to your needs. A common use I have seen for using the COMPGED function is using it to compare email addresses.

```email1='JohnDoe@abc.com';
email2='John_Doe@abc.com';
cost=compged(email1,email2);
```

The value of COST will be 30 which is the cost of adding punctuation to a string.

COMPLEV Function

The COMPLEV function is very similar to the COMPGED function. The difference is that the Levenshtein edit distance that is computed by COMPLEV is a special case of the generalized edit distance that is computed by COMPGED. The result is the COMPLEV executes much more quickly than COMPGED. However, the COMPLEV function is not as powerful or versatile as the COMPGED function. The COMPLEV function is generally most useful when comparing simple strings and when speed of comparison is important.

SPEDIS Function

The SPEDIS function is the oldie-but-a-goodie of the bunch. The SPEDIS function determines the likelihood of two words matching, expressed as the asymmetric spelling distance between the two words. SPEDIS is similar to COMPGED in that it assigns a cost to the each operation such as swap, append and delete. SPEDIS will sum the costs and then divide the sum by the length of the first argument. It is important to remember this because this makes the order of the arguments important. This means that SPEDIS(X,Y) does not always equal SPEDIS(Y,X).

Summary: Review your data

Any process that is described as ‘fuzzy’ is obviously not an exact science. With each of these functions using different algorithms, each is going to have its own strengths.

• If you are comparing complex strings and need the most control, then I would recommend looking at the COMPGED function.
• If you are comparing fairly simple strings and within large data sets, then COMPLEV may be a better choice.

The bottom line is that you will have to review your data and what differences are important to you. The results returned by these functions are subjective. You have to determine what an acceptable difference is. I suggest that you simply test out each function to see which one works best for you.

COMPGED and SOUNDEX Example

Of the above functions, COMPGED is the one I tend to use most often because, for the scenarios brought to me by various customers, COMPGED has produced the most precise results. I have been able to get the best results by combining the COMPGED function with the SOUNDEX function. Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

The below example encodes each of the first names that is to be compared with the SOUNDEX function and then evaluates the results of using SOUNDEX with the COMPGED function. The goal of this example is to determine if the value of FNAME is the same as ‘Johnathan’ or ‘William’.

```/* Sample data that contains names and the class being taken */
data class;
input fname : \$12. lname \$ class : \$9.;
datalines;
Jon Smith Math
John Smith Math
Johnny Smith Math
James Smith Math
Will Miller Chemistry
Willy Miller Chemistry
Willie Miller Chemistry
Bonny Davis Gym
Milly Wilson Biology
;

/*  Data set that contains the name and the grade for the class */
input fname : \$12. lname \$ grade \$;
datalines;
Johnathan Smith A
William Miller B
;

/* This code uses both the SOUNDEX and COMPGED functions.  I have found */
/* that by comparing the strings produced by SOUNDEX, I have been able */
/* to get a tighter comparison.                                        */
data c;
set class;
tmp1=soundex(fname);
do i=1 to nobs;
set grade(rename=(fname=fname2)) point=i nobs=nobs;
tmp2=soundex(fname2);
dif=compged(tmp1,tmp2);
if dif<=50 then do;
possible_match='Yes';
drop  i tmp1 tmp2  fname2;
output;
end;
end;
run;
proc print; run;

```

One of the great things about SAS is the number of different ways you can accomplish a given goal.  If you have a technique to make fuzzy comparisons that you like to use, please share.  I would love to see some of the techniques being used!

Share

SAS Technical Support Engineer, CAS and Open Source Languages

Kevin Russell is a Technical Support Engineer in the CAS and Open Source Languages group in Technical Support. He has been a SAS user since 1994. His main area of expertise is the macro language, but provides general support for the DATA step and Base procedures. He has written multiple papers and presented them at various SAS conferences and user events.

1. Vincent Rejany on

You can also have a look to vthe very powerful SAS Data Quality functions for matching customer information. Especially the DQMATCH function

The DQMATCH function parses the input character value and creates a match code. The match code represents a condensed version of the character value. The amount of information in the match code is determined by the sensitivity level. For higher sensitivities, two values must be very similar to produce the same match codes. At lower sensitivities, two values produce the same match codes despite their dissimilarities.

http://support.sas.com/documentation/cdl/en/dqclref/70016/HTML/default/viewer.htm#p0ccmxri1e463hn15g6uxn5dz5cc.htm

2. gaurav manchanda on

Hi,
I have a dataset Customer with variables customer name, id, email and address. How Can I find customers with same address?

3. Have you or anyone commenting had experience utilizing SAS (along with any of these functions) to match up 2 data sets each with many values based on a fuzzy logic match?

• Wren Nessle Buck on

proc sql allows you to use functions in join conditions, which has been eminently useful for me when the linking fields on my two (or more) datasets are not exact matches ,but have quantifiable similarity. You can get all the possible matches with liberal join criteria, and then select the best one with a "having" statement.

• Lysbet,

Read my white paper on "Fuzzy Merge / Match" which I published back in 1998 - It was published in SAS Observation Magazine - I still have the paper on my website under SAS Tips and the actual SAS code is included with the paper.

I use the routine to match records between two files using such fields as “Name”, “Address”, “City”, “State”, and “can be Zipcode”. It all depends on what information/data each file has that can be used to match on. The code does not care if Name=”Charles Patridge” is in file One, and then Name in file Two has it as “Partridge, Chuck” – ie, the order of how data is stored is NOT important.

HTH,
Chuck Patridge

4. Matching names is an common application for fuzzy matching. But it also happens in other area's. Take for instance a situation in the airline industry. I want to match last year's flights with this year's flights. What are the matching elements: Flight number, flight leg (from-to), flight date, departure and arrival time. Quite likely that one or more of those elements cannot be matched: new flight number, same flight number but new departure time and so on. Even matching on flight date is problematic. Match on calendar date or shift a day to match on day of week (to analyse weekly patterns).

For these situations I have developed a 'fuzzy merge' that takes e.g. this year's flight data set and steps through last year's data set to find the closest match.

5. Kevin,

Here is a paper I published in the SAS Observation back in 1998 or so - thought you would enjoy the various techniques I used / use in fuzzy matching

http://www.sconsig.com/sastips/obswww15.pdf

When using the soundex function, I sometimes will modify the resulting value and then do my test for "fuzziness". I, too, have used these four functions - I especially like complev function. It is very handy in finding strings that are close in nature/meaning to the target string, and can be used to expand your searches for strings you may not have thought of - especially in the area of Text Mining.

6. Fuzzy matching is used by many modern electronic devices to autocorrect your (mis)typed words. For more on finding "nearby words," see the article "Distance between Words"

This site uses Akismet to reduce spam. Learn how your comment data is processed.