SAS author's tip: SPEDIS and fuzzy matching

This week’s featured SAS author’s tip comes from SAS user extraordinaire Ron Cody.  Honestly, because Ron has written so many SAS books, I could probably feature a year’s worth of tips from his work alone. To find something useful in any of Ron’s books, one merely needs to let the book fall open to a random page. Although I thought about trying this, I decided to select an excerpt from his bestselling book Learning SAS by Example: A Programmer’s Guide.

The following excerpt is from SAS Press author Ron Cody's book Learning SAS by Example: A Programmer's Guide,  Copyright © 2007, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. (please note that results may vary depending on your version of SAS software)

Performing a Fuzzy Match

The SPEDIS function (stands for spelling distance) is used for fuzzy matching, which is comparing character values that may be spelled differently. The logic is a bit complicated, but using this function is quite easy. As an example, suppose you want to search a list of names to see if the name Friedman is in the list. You want to look for an exact match or names that are similar. Here is such a program:

 Program 12-18 Using the SPEDIS function to perform a fuzzy match

data fuzzy;
input Name $20.;
Value = spedis(Name,'Friedman');

Here is a listing of data set Fuzzy:

data set from Ron Cody's Learning SAS by Example

The SPEDIS function returns a 0 if the two arguments match exactly. The function assigns penalty points for each type of spelling error. For example, getting the first letter wrong is assigned more points than misspelling other letters. Interchanging two letters is a relatively small error, as is adding an extra letter to a word.

Once the total number of penalty points has been computed, the resulting value is computed as a percentage of the length of the first argument. This makes sense because getting one letter wrong in a 3-letter word would be a more serious error than getting one letter wrong in a 10-letter word.

Notice that the two character values evaluated by the SPEDIS function are case-sensitive (look at the last observation in the listing). If case may be a problem, use the UPCASE or LOWCASE function before testing the value with SPEDIS.

To identify any name that is similar to Friedman, you could extract all names where the value returned by the SPEDIS function is less than some predetermined value. In the program here, values less than 15 or 20 would identify some reasonable misspellings of the name.

To learn more Ron Cody and all of his books, visit his author page. There you can read free book chapters, see reviews from other SAS users, listen to interviews, and find out more about his soon-to-be-published book SAS Statistics by Example. To receive notification about the availability of this new book, sign up here.

tags: learning SAS, ron cody, sas press, sas tip, sas user