A new year’s resolution that sounds like more fun than a spinning class

Last December I taught a SAS Programming 1: Essentials class at Statistics Canada (Statcan). My class could barely contain their mirth while I valiantly struggled to find the semicolon on the French keyboard. Far cry from my first move to Canada (which is a bilingual country) and my excitement about practicing French learned at Alliance Française. Clearly it has gotten rusty as my recent keyboard experience proved. So when this January rolled along, I’m sure you can picture the resolve that first jumped to mind. Thanks to my humbling Statcan experience my lofty new year’s work aspiration is to teach a SAS course in French or maybe present to a user group in French!

Meanwhile here is a SAS resolution to consider for an easy data life:

Scrub data with the SOUNDS-LIKE operator

I have to confess this is a personal favourite. Having worked at Devry Institute of Technology, from personal experience, I’ve seen just how complex student registration data can get. So how did I go about finding all clients from the suburb of Mississauga without complex WHERE clauses? Take a look at my data with its many misspellings and the code I wrote to capture it.

The clever SOUNDS-LIKE operator (=*) uses the SOUNDEX algorithm to test whether a character variable contains a spelling variation of a word. It searches and selects character data based on two expressions: the search value and the matched value and brings up possible phonetic variations. So you too can find misspelled data easily with no complicated coding involved. You can find additional information on the SOUNDS-LIKE operator from this SUGI 29 paper.

As a musician I am constantly listening to other forms and rhythms. I’d like to leave you with a catchy beat played at parties over the holidays. From the island of Mauritius on the Indian Ocean here’s the Sega –sounds like fun, doesn’t it. Did you find the SOUNDS-LIKE operator useful? Isn’t it more fun to keep than that spinning class resolve? I’d love to hear any resolutions, SAS or otherwise that you might have made!

6 Comments

  1. statgirl20
    Posted January 13, 2011 at 9:10 am | Permalink

    Thanks. This was a good tip.

  2. charu
    Posted January 18, 2011 at 10:58 am | Permalink

    glad you found it useful. Thanks for reading!

  3. Anonymous
    Posted February 8, 2011 at 3:22 pm | Permalink

    Many years ago, I had to write a phonetic search program in COBOL. It would be a very easy task if I did in SAS.

  4. Andrew Karp
    Posted February 8, 2011 at 6:32 pm | Permalink

    I think the SOUNDEX the SOUNDS-LIKE operator and its underlying SOUNDEX algorithm is not very effective and often leads to improper results.
    I would caution against using it, especially on family (last) names that are transliterations in the English from languages that do not use the Latin character set. It's limitations with non-English words is well documented in the literature.
    SAS has implemented much more powerful functions for computing measures of similarity/dissimilarity between two character strings that should be used instead of SOUNDEX.
    These include the SPEDIS (spelling distance) function added in SAS 8, and the COMPLEV and COMGED functions added in SAS 9. COMPLEV calculates the Levenshtine edit distance and COMPGED the generalized edit distance between two strings. And, the CALL COMPCOST routine allows the user to override the default penalty costs embedded in COMPGED, if needed.

  5. charu
    Posted February 11, 2011 at 3:57 pm | Permalink

    thanks Andrew for pointing out other functions to deal with non-English words. this may be a topic for a future post. Appreciate your comments..

  6. charu
    Posted March 12, 2011 at 3:13 pm | Permalink

    Having worked with Cobol, I know the feeling. Lucky, there's SAS now to help out with quick & easy searches through reams & reams of data!

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>