How to perform a fuzzy match using SAS functions

SAS Technical Support Problem Solvers“Here’s Johnny!!!” and well sometimes John and sometimes Jonathan and sometimes Jon.

In the real world, you sometimes need to make matching character strings more flexible. This is especially common when merging data sets. Variables, especially names, are not always exactly the same in all sources of your data. When matching data, you need to be able to programmatically determine if ‘John Doe’ is the same as ‘Johnny Doe’. The term most often associated with this type of matching is ‘fuzzy matching’. Fortunately within SAS, there are several functions that allow you to perform a fuzzy match. I’ll show you the most common of these functions and then I will show you an example that uses my favorite from this list.

COMPARE Function

The COMPARE function returns the position of the leftmost character by which two strings differ, or returns 0 if there is no difference. This function is most useful when comparing two strings that should be the same, but may only differ in case or due to leading or trailing blanks. This function takes the place of the following code:

if strip(upcase(name1))=strip(upcase(name2)) then do;

The COMPARE function can be used to make the same comparison.

if compare(name1,name2,’il’)=0 then do;

COMPGED Function

The COMPGED function returns the generalized edit distance between two strings. Specifically, the COMPGED function returns a generalization of the Levenshtein edit distance, which is a measure of dissimilarity between two strings. The Levenshtein edit distance is the number of operations (deletions, insertions, or replacement) of a single characters that are required to transform string-1 into string-2. 

Each operation basically ‘costs’ a certain value. For example, if string-1 is the result of inserting a space into string-2, this has a cost of 10. The more dramatic the operation, the greater the cost.  The COMPGED will return the total cost for all operations that occur. The costs returned by COMPGED can be altered by using CALL COMPCOST so that the cost are specific to your needs. A common use I have seen for using the COMPGED function is using it to compare email addresses.

email1='JohnDoe@abc.com';                                                                                                        
email2='John_Doe@abc.com';                                                                                                          
cost=compged(email1,email2);

The value of COST will be 30 which is the cost of adding punctuation to a string.

COMPLEV Function

The COMPLEV function is very similar to the COMPGED function. The difference is that the Levenshtein edit distance that is computed by COMPLEV is a special case of the generalized edit distance that is computed by COMPGED. The result is the COMPLEV executes much more quickly than COMPGED. However, the COMPLEV function is not as powerful or versatile as the COMPGED function. The COMPLEV function is generally most useful when comparing simple strings and when speed of comparison is important.

SPEDIS Function

The SPEDIS function is the oldie-but-a-goodie of the bunch. The SPEDIS function determines the likelihood of two words matching, expressed as the asymmetric spelling distance between the two words. SPEDIS is similar to COMPGED in that it assigns a cost to the each operation such as swap, append and delete. SPEDIS will sum the costs and then divide the sum by the length of the first argument. It is important to remember this because this makes the order of the arguments important. This means that SPEDIS(X,Y) does not always equal SPEDIS(Y,X).

Summary: Review your data

Any process that is described as ‘fuzzy’ is obviously not an exact science. With each of these functions using different algorithms, each is going to have its own strengths.

  • If you are comparing complex strings and need the most control, then I would recommend looking at the COMPGED function.
  • If you are comparing fairly simple strings and within large data sets, then COMPLEV may be a better choice.

The bottom line is that you will have to review your data and what differences are important to you. The results returned by these functions are subjective. You have to determine what an acceptable difference is. I suggest that you simply test out each function to see which one works best for you.

COMPGED and SOUNDEX Example

Of the above functions, COMPGED is the one I tend to use most often because, for the scenarios brought to me by various customers, COMPGED has produced the most precise results. I have been able to get the best results by combining the COMPGED function with the SOUNDEX function. Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

The below example encodes each of the first names that is to be compared with the SOUNDEX function and then evaluates the results of using SOUNDEX with the COMPGED function. The goal of this example is to determine if the value of FNAME is the same as ‘Johnathan’ or ‘William’.

/* Sample data that contains names and the class being taken */                                                                         
data class;                                                                                                                             
  input fname : $12. lname $ class : $9.;                                                                                               
  datalines;                                                                                                                            
Jon Smith Math                                                                                                                          
John Smith Math                                                                                                                         
Johnny Smith Math                                                                                                                       
James Smith Math                                                                                                                        
Will Miller Chemistry                                                                                                                   
Willy Miller Chemistry                                                                                                                  
Willie Miller Chemistry                                                                                                                 
Bonny Davis Gym                                                                                                                         
Milly Wilson Biology                                                                                                                    
;                                                                                                                                       
                                                                                                                                        
/*  Data set that contains the name and the grade for the class */                                                                      
data grade;                                                                                                                             
  input fname : $12. lname $ grade $;                                                                                                   
  datalines;                                                                                                                            
Johnathan Smith A                                                                                                                       
William Miller B                                                                                                                        
;                                                                                                                                       
                                                                                                                                        
                                                                                                                                        
/* This code uses both the SOUNDEX and COMPGED functions.  I have found */                                                              
/* that by comparing the strings produced by SOUNDEX, I have been able */                                                               
/* to get a tighter comparison.                                        */                                                               
data c;                                                                                                                                 
  set class;                                                                                                                            
  tmp1=soundex(fname);                                                                                                                  
  do i=1 to nobs;                                                                                                                       
    set grade(rename=(fname=fname2)) point=i nobs=nobs;                                                                                 
    tmp2=soundex(fname2);                                                                                                               
    dif=compged(tmp1,tmp2);                                                                                                             
    if dif<=50 then do;                                                                                                                 
      possible_match='Yes';                                                                                                             
      drop  i tmp1 tmp2  fname2;                                                                                                        
      output;                                                                                                                           
    end;                                                                                                                                
  end;                                                                                                                                  
run;                                                                                                                                    
proc print; run;

One of the great things about SAS is the number of different ways you can accomplish a given goal.  If you have a technique to make fuzzy comparisons that you like to use, please share.  I would love to see some of the techniques being used!

Post a Comment

Need test data? Use SAS to extract it from these surprising sources

Everyone is always looking for test data. Business analysts want it for demos and prototypes. Software developers want it for development and unit testing. Testers want it for system and integration testing.

I’ve written many programs to generate test data over the years, as have many other SAS users. Generated data can be great, but it’s always nice to have real data when you can get it. It demos better. It highlights issues better (Believe me, real people will enter things you never dreamed of including in your generated data). It even helps you be a better software designer since you can actually look at real data and react to what you’re seeing.

So how do you get real data when individuals, businesses and organizations are increasingly protective of it?

Well, actually, in some areas of business that seems to be changing. In an effort to increase transparency and visibility, many organizations make some data available for use by outsiders. For example, Amazon, Google Shopping, and BestBuy all make their product and review data available for integration into other web sites.

This product data is meant for web integration so PROC HTTP is the key to downloading, and it could hardly be easier!

Let’s look at getting some BestBuy product review data. Looking at the documentation for the Reviews API, we are told to format our request like this:

http://api.remix.bestbuy.com/v1/reviews(sku=1780275)?apiKey=YourAPIKey&s

Here are the key parameters:

  • sku parameter denotes the particular product being reviewed
  • apiKey parameter denotes your authentication key (you must register to get it)
  • show parameter denotes which columns to return

From there, downloading the data is just a matter of putting the request into PROC HTTP and changing the parameters to meet your needs. Here’s some code I ran to get iPad 2 Air reviews:

filename out temp;

PROC HTTP out=out url=’http://api.remix.bestbuy.com/v1/reviews(sku=3315023)?apiKey=your-authentication-key&page=1&pageSize=100&show=id,sku,comment,rating’
method=”get” ;
run;

libname out xmlv2;

proc append base=iPadReviews data=out.review; run;

BestBuy places a limited of 100 records (pagesize parameter) per request, so I wrote the code so it could be used iteratively. To do so, simply create a SAS macro from this example and increment the page parameter via macro substitution to read through all the reviews.

Now, let’s look our new iPad 2 Air product review test data!

BestBuydata

Post a Comment

SAS timer - the key to writing efficient SAS code

New Year to me is always a stark reminder of the inexorability of Time. In a day-to-day life, time is measured in small denominations - minutes, hours, days… But come New Year, and this inescapable creature – Time – makes its decisive leap – and in a single instant, we become officially older and wiser by the entire year’s worth.

What’s a better time to re-assess ourselves, personally and professionally! What’s a better time to Resolve to improve your SAS programming skills, as skillfully crafted by Michael A. Raithel in his recent blog post.

I thought I could write a post showing how to be efficient and kill two birds with one stone.  The birds here are two New Year’s Raithel’s proposed resolutions:

#2 Volunteer to help junior SAS programmers.

#12 Reduce processing time by writing more efficient programs.

To combine the two, I could have titled this post “Helping junior SAS programmers to reduce processing time by writing more efficient programs”. However, I am not going to “teach” you efficient coding techniques which are a subject deserving of a multi-volume treatise. I will just give you a simple tool that is a must-have for any SAS programmer (not just junior) who considers writing efficient SAS code important. This simple tool has been the ultimate judge of any code’s efficiency and it is called timer.

What is efficient?

Setting aside hardware constraints and limitations (which are increasingly diminishing nowadays), efficient means fast or at least fast enough not to exceed ever-shrinking user tolerance of wait time.

Of course, if you are developing a one-time run code to generate some ad-hoc report or produce results for uniquely custom computations, your efficiency criteria might be different, such as “as long as it ends before the deadline” or at least “does not run forever”.

However, in most cases, SAS code is developed for some applications, in many cases interactive applications, where many users run the code over and over again. It may run behind the scenes of a web application with a user waiting (or rather not wanting to wait) for results. In these cases, SAS code must be really fast, and any improvement in its efficiency is multiplied by the number of times it is run.

What is out there?

SAS provides the following SAS system options to measure the efficiency of SAS code:

STIMER. You may not realize that you use this option every time you run a SAS program. This option is turned on by default (NOSTIMER to turn it off) and controls information written to the SAS Log by each SAS step. Each step of a SAS program by default generates the following sample NOTE in SAS Log:

NOTE: DATA statement used (Total process time):
      real time           1.31 seconds
      cpu time            1.10 seconds

FULLSTIMER. This option (NOFULLSTIMER to turn it off) provides much more information on used resources for each step. A sample Log output of a FULLSTIMER option for a SAS Data Step is listed below:

NOTE: DATA statement used:
real time                   0.06 seconds
user cpu time               0.02 seconds
system cpu time             0.00 seconds
Memory                      88k
Page Faults                  10
Page Reclaims                 0
Page Swaps                    0
Voluntary Context Switches   22
Involuntary Context Switches  0
Block Input Operations       10
Block Output Operations      12

While the FULLSTIMER option provides plenty of information for SAS code optimization, in many cases it is more than you really need. On the other hand, STIMER may provide quite valuable information about each step, thus identifying the most critical steps of your SAS program.

Get your own SAS timer

If your efficiency criteria is how fast your SAS program runs as a whole, than you need an old-fashioned timer, with start and stop events and time elapsed between them. To achieve this in SAS programs, I use the following technique.

  1. At the very beginning of your SAS program, place the following line of code that effectively starts the timer and remembers the start time:
  2. /* Start timer */
    %let _timer_start = %sysfunc(datetime());
    

  3. At the end of your SAS program place the following code snippet that captures the end time, calculates duration and outputs it to the SAS Log:
  4. /* Stop timer */
    data _null_;
      dur = datetime() - &_timer_start;
      put 30*'-' / ' TOTAL DURATION:' dur time13.2 / 30*'-';
    run;
    

    The resulting output in the SAS log will look like this:

    ------------------------------
     TOTAL DURATION:   0:01:31.02
    ------------------------------
    

    Despite its utter simplicity, this little timer is a very convenient little tool to improve your SAS code efficiency. You can use it to compare or benchmark your SAS programs in their entirety.

    Warning. In the above timer, I used the datetime() function, and I insist on using it instead of the time() function as I saw in many online resources. Keep in mind that the time() function resets to 0 at midnight. While time() will work just as well when start and stop times are within the same date, it will produce completely meaningless results when start time falls within one date and stop time falls within another date. You can easily trap yourself in when you submit your SAS program right before midnight while it ends after midnight, which will result in an incorrect, even negative, duration.

    I hope using this SAS timer will help you writing more efficient SAS programs.

Post a Comment

Cloud: 4 deployment models

cloud4modelsThis is the last of my series of posts on the NIST definition of cloud computing. As you can see from this Wikipedia definition, calling anything a “cloud” is likely to be the fuzziest way of describing it.

In meteorology, a cloud is a visible mass of liquid droplets or frozen crystals made of water or various chemicals suspended in the atmosphere above the surface of a planetary body. These suspended particles are also known as aerosols and are studied in the cloud physics branch of meteorology.

Not that there is anything wrong with the label “cloud”--it’s a shortcut that allows us to quickly convey an idea. But for anything beyond that, when talking about functionality, we would be well advised to define and describe “cloud” in as much detail as possible so that all people involved have the same picture in their mind, and not whatever it is they think of when they think of “cloud”.

The NIST definitions help us narrow down features, functionality and models, but those are still only broad categories that leave certain gaps in which misunderstandings can easily sprout. I encourage you to use these definitions, but also to go further and describe cloud architectures by using terms that are as precise as possible.

In recent posts, I talked about the five characteristics of cloud, as well as the three service models. In this final installment of the series, I will discuss the four cloud deployment models. Read More »

Post a Comment

SAS Grid Manager, shared-nothing storage and Hadoop: Part 1

When designing a SAS Grid Manager architecture, there is a requirement that has always been a critical component: a clustered file system. Over the years, vendors have released versions of these systems that are more robust and SAS has increased the minimum IO requirements, but the basic design has never changed—until now.

Any guess who the driver of this change could be? I heard a yellow elephant somewhere? Yes, Hadoop, but not only! File systems are now available that support SAS Grid computing in other shared-nothing storage architectures.

Let’s take a step back to understand how new file system options can facilitate your SAS Grid deployment. In this post, I’ll start with a quick review of storage architectures for SAS Grid Manager and what other vendors are doing. In a subsequent blog post, I’ll dive more specifically into the interaction of Hadoop and SAS Grid Manager. Read More »

Post a Comment

When to use NFS with SAS

With the growing use of SAS on commodity hardware, many organizations are running lots of SAS servers on separate instances of operating system in a SAS infrastructure. This configuration is great for optimizing resources, but when these SAS servers have to share data, then SAS recommends the use of a clustered file system.

This recommendation presents an issue for some companies. Because clustered file systems are not part of their standard operating system, it is an additional expense. So, to avoid driving up the cost of the hardware infrastructure for SAS, some IT administrators are proposing the use of NFS to share files among the SAS servers running on different instances of an operating system. Let’s look in more detail at the pros and cons for NFS as a shared file system with SAS. Read More »

Post a Comment

SAS Global Forum: I want to volunteer!

Most of us have volunteered in some capacity -- as a guinea pig in a magic show, as a helping hand with a charity, or perhaps as a coach for a child’s sports team. Why do we do it?

I think the cliché holds true. We get as much, or more, benefit from the experience as the people we are helping. We feel better about ourselves, we make new friends, and we expand our comfort zone or learn something new. Sometimes we are “giving back” to an organization that previously helped us--we “pay it forward”. At other times, we simply want to spend time with other volunteers.

Volunteers advance much of society’s support services. But what about volunteering for professional organizations, societies, associations or communities? They rely on volunteers too. Read More »

Post a Comment

SAS Backup and Recovery Tool

In researching material for an upcoming project, I investigated the SAS Deployment Backup and Recovery Tool available in SAS 9.4. Here are some of my findings on identifying what directories are included in a Backup and Recovery session and how to add custom directories to a backup configuration.

The SAS Backup and Recovery Tool is designed to provide an integrated method for backing up and recovering your SAS content across multiple tiers and machines. Note that this utility is a system-wide tool. It’s intended to back up an entire environment as opposed to a single machine in a multi-machine deployment. Read More »

Post a Comment

SAS Global Forum: Give back, become more

Life is a gift, and it offers us the privilege, opportunity, and responsibility to give something back by becoming more.
- Anthony Robbins

Every year at SAS Global Forum, SAS professionals from all over the globe gather, reconnect and learn from each other. And wherever they gather, SAS Global Forum attendees demonstrate their appreciation for the opportunity to become more by giving back to the host community.

December is the season of giving. In that spirit, let's take a look at how SAS Global Forum 2015 will be giving back to the community of Dallas. Read More »

Post a Comment

Have you created Scalable Vector Graphics with SAS?

SAS Technical Support Problem SolversIf you haven’t tried them for your web applications and other graphics needs, you’ll want to read further!

Scalable Vector Graphics (SVG) output is vector graphics output you can display with most (if not all) modern web browsers. Because SVG graphic output is scalable, you can zoom in on the graphics output without losing resolution. Unlike bit-mapped images such as PNG or GIF output, they can be resized or transformed without compromising the clarity, eliminating the need to produce multiple versions of the same image! There are other advantages for using Scalable Vector Graphics like their ability to zoom in to view details, their smaller output file size and their usefulness for producing graphics for a range of display sizes and types. Read More »

Post a Comment