A poor man's approach to fuzzy data matching

4

Today at IFSUG, Chuck Patridge presented a wonderful talk about how to complete fuzzy matching using BASE SAS tools. Chuck has been programming SAS since 1979 and has been tasked multiple times with coming up with in-house solutions to address business needs without the software costs that are typically associated with solving these complex data problems.

When specifically addressing fuzzy data matching, Chuck suggests that we consider it an algebric equation. Standardize/cleanse the left side and then standardize/cleanse the right side - then begin the process of matching the two sides together.

Here is the high-level strategy:

  1. Cleanse: Up/Low Case all the data, use the ZIPSTATE function to find the true ZIP code, drop street types (Ct, Ln, St, etc), drop other 'noise'
  2. Normalize: Create standard tables to translate or normalize data and use HASH, Perl regular expressions, custom formats, etc to normalize or standardize the data table
  3. Compare: Chuck recommends using the POINT= option to compare 1 record in the 1st table with all the records in the 2nd. To improve processing he also recommends that you restrict the sizes of at least one of the tables (such as by birth year, state, etc).
  4. Further Comparisons: There are various ways to further improve the matching. Chuck suggests removing vowels and using standard SAS functions such as SOUNDEX to eliminate data differences due to typos. You can also use SUBSTR to compare smaller portions of an attribute, such as the last 7 digits of the telephone number (to reduce impact of area code changes).
  5. Score & Rank Results: Finally when running through the comparisons of various attributes, score results based on % of similarities.

As you build this out, understanding the data is absolutely required. Foreign addresses, common abbreviations, and what constitutes as noise depends on the industry or source of information. Chuck also recommends that the routine is built off of test data (or copies of production data) and are routinely checked by the ETL team for continually improvement of the match %.

Looking for more information? Chuck actually published a white paper on this topic years ago & it is still very useful for SAS programmers to this day. Thanks Chuck for this great resource!
http://sconsig.com/sastips/obswww15.pdf

Share

About Author

Angela Hall

Senior Technical Architect

Angela offers tips on using the SAS Business Intelligence solutions. She manages a team of SAS Fraud Framework implementers within the SAS Solutions On-Demand organization. Angela also has co-written two books, 'Building BI using SAS, Content Development Examples' & 'The 50 Keys to Learning SAS Stored Processes'.

Related Posts

4 Comments

  1. Pingback: Managing big data at the speed of risk - SAS Users Groups

  2. Hi Angela

    The SPEDIS function will also assist with trying to improve matching - use an IF statement to set a cut-off, say at 30.

  3. I think the article "A poor man's approach to fuzzy data matching" is very apt, and perhaps it might event get 80% with 20% of the effort. A few thoughts though:

    Cleanse & Normalize: This is where the rubber meets the road. It requires a series of (type of data) based transformations to be applied using curated and categorized reference data. Using our reference data, we elaborate, abbreviate or exclude to bring client data to reduce it to a common denominator or elemental equivalence.

    Compare: Not sure I like the idea of comparing every record to every other. It's an exponential problem, better to compare only those records considered matches by using other techniques.

    Further Comparisons: I categorically suggest not using SOUNDEX. It only works on spoken English words, it takes no account of first letter similarity and delivers too many false positives.

    I'd suggest Double Metaphone as an improvement on Soundex. We use our Fonetix algorithm developed in house to manage sounds like issues in the 5 major spoken languages.

    I like the SUBSTR to compare smaller portions of an attribute, such as the last 7 digits of the telephone number (to reduce impact of area code changes).

    Score & Rank Results: the challenge surrounding % scoring is comparing Rob and Bob might be close, but Bob and Robert are not similar at all, hence better strategies are required.

Back to Top