Jedi SAS Tricks: These aren't the droids... Episode 1


How the power of the Force makes ridding yourself of problematic characters so much easier! I recently was invited to become an alternate instructor for Ron Cody’s SAS Business Knowledge Series class, "SAS Functions by Example", and had the privilege of taking the class as a student under Ron’s tutelage. As Ron was introducing the advanced features of the COMPRESS function, I remembered how I've often needed to rid myself of problematic characters in a SAS program.  I sure wish I’d know about the advanced features of COMPRESS sooner! And my next thought was: I must share this incredible power with all the SAS Jedi out there, and what better place to do it than right here on the Jedi SAS Tricks blog!

Consider the following data subset from the academy records:

Data table with garbage punctuation in Names, various formats for phone numbers and height and weight data entered in mixed metric and imperial units.

Data set MixedUp

As you can see, no one seems to have gotten the memo on how to enter a phone number. Mr. Vader seems a bit disgruntled at having to fill in his data (note the tirade in the Name column), and has insisted on entering his height and weight in Imperial units instead of the universal metric standard (we'll clean up the height and weight data in Episode 2 - using tricks I learned in Ron Cody's other BKS course, Data Cleaning Techniques)

In this episode, we'll use the COMPRESS function to easily clean up the input data in the Name and Phone variables and a custom PICTURE format to uniformly display phone numbers (which we will store as numeric). After both episodes, our finished data set will look like this:



Cleaned up data table with standardized Names, phone numbers and height and weight data in metric units.

Data set: CleanedUp

Many of you may have used the COMPRESS function to remove all of the blanks from some text, especially handy when converting character data to numeric. For example:

data _null_;
   Text='1234 5678 9123 4567';
   Num=input(COMPRESS(Text),17.);
   put text= num=17.;
run;

From the SAS Log:
Text=1234 5678 9123 4567 Num=1234567891234567

The more adventuresome may have noticed that, if you don't want to remove only spaces, you can specify which characters should be removed by adding the optional second parameter chars to the function:

data _null_;
   Text='1234 5678-9123 4567';
   Num=input(COMPRESS(Text,' -'),17.);
   put text= num=17.;
run;

From the SAS Log:
Text=1234 5678-9123 4567 Num=1234567891234567

And just like that, SAS removes all of the offending characters from your text! But the list of unsavory characters can be long, and the typing of them tedious and error-prone. The key to making COMPRESS sit up and do SAS Tricks is the optional third parameter, modifers. There are many useful modifers, but for this job we'll focus on just two:

  • The 'k' modifier, which makes COMPRESS keep the characters on your list, instead of deleting them
  • The 'd' modifier which adds all digits to your list without having to type them yourself

For example:

data _null_;
   length text $100;
   do Text='123.555.4567'
         , '(123) 555-4567'
         ,'Area Code: 123 Exchange:555 Last four is: 4567';
      Num=input(COMPRESS(Text,,'kd'),12.);
      put "Text was: " text= ;
      PUT "Number is: " num=12.;
   end;
run;

From the SAS Log:
Text was: text=123.555.4567
Number is: Num=1235554567

Text was: text=(123) 555-4567
Number is: Num=1235554567

Text was: text=Area Code: 123 Exchange:555 Last four is: 4567
Number is: Num=1235554567

What manner of wizardry is this? All extraneous characters have been removed! The SAS is strong in this one...

To finish things up, we'll need a custom picture format to display the phone numbers. We'll want an open parenthesis as the prefix for every number displayed, a closing parenthesis after the third digit (area code) and a dash between the exchange number and the last four digits. We'll also want the default width for our format to be 13 characters - the 10 digits of the phone number plus one for each parenthesis and dash. This is easy to define in PROC FORMAT:

proc format;
   picture phone (default=13) low-high='000)000-0000' (prefix='(');
run;

So, to wrap up Episode 1 - here is a program that creates the original data set, the custom format, and cleans up the Name and Phone variables:

  /* Jedi SAS Tricks: These aren't the droids... Episode 1 */
data MixedUp;
   infile datalines dlm='*' dsd truncover;
   input Name:$20. ph:$15. ht:$15. wt:$15.;
   LABEL Name='Student Name'
         Ph='Primary Phone Number'
         Ht='Height'
         Wt='Weight';
datalines;
M. Jordan*(123)555-1234*178cm*81 kg
O. Kenobi*123.555.9999*175cm*75 kg
D. Vader,&#%%!!*1235556666*82in*200 lbs
;
run;
 
proc sql;
   select * from MixedUp;
quit;
 
proc format;
   picture phone (default=13) low-high='000)000-0000' (prefix='(');
run;
 
data CleanedUp1;
   LENGTH Name $ 20 Phone 8;
   LABEL Name='Student Name'
         Phone='Primary Phone Number';
   keep Name Phone ht wt;
   format Phone phone. ;
   set MixedUp;
   Name=Compress(Name,'. ','ka');
   Phone=input(compress(ph,,'kd'),15.);
run;
 
proc sql;
   select * from CleanedUp1;
quit;

Data set: CleanedUp1

In Episode 2 we'll automate cleanup for those pesky height and weight entries!

Until next time, may the SAS be with you.
Mark

tags: data step, programming, tips and tricks

12 Comments

  1. Posted January 14, 2012 at 11:53 am | Permalink

    Great tip. I often use ANYDIGIT or ALPHA with COMPRESS to determine if there is an issue. So something like this:

    if anydigit(char_var) = 1 then char_var=compress(char_var, '1234567890')

    It's helpful if you have a large dataset but there may be a small percentage of offending entries.

    Mark - Check out the Thumb Wars if you have not already - it's worth some girly giggles. ;-)

    • Mark Jordan Mark Jordan
      Posted January 14, 2012 at 8:51 pm | Permalink

      Nice! The "ANY" (ANYDIGIT, ANYALPHA, ANYPUNCT, etc.) and "NOT" functions (NOTDIGIT, NOTALNUM, etc.) were also covered in Ron's class - and I found several very useful ones in that bunch, too. SAS has so many great functions. Thanks for sharing your expereinces.

      PS: I checked out Thumb Wars, and I will admit to a guffaw or two - but no "girly giggles" :-D

  2. Jenni Elion
    Posted January 14, 2012 at 12:20 pm | Permalink

    I use the COMPRESS most often to strip values of unwanted hex characters. When I wasn't getting the results I expected with a 'if VAR=' ' then [do stuff]', I had a hard time finding the root cause. It looked like a lot of records should have satisfied the condition, but none did. The records actually contained an insidious hex character, invisible to the naked eye, but not undectable to SAS. COMPRESS to the rescue!

    Since then, it's become one of the most used tools in my tester 'toolbox'.

    • Mark Jordan Mark Jordan
      Posted January 14, 2012 at 8:24 pm | Permalink

      Thanks for sharing another great example of using COMPRESS to rid yourself of "unsavory" characters!

      • Ginger R.
        Posted January 16, 2012 at 1:45 pm | Permalink

        Sorry to be so lazy, but is there a parameter to strip hex characters or do you need to specify them all?

        I've got several things where I go stringing through stuff in search of hex characters and would love to replace what I have with a quick function.

        • Mark Jordan Mark Jordan
          Posted January 16, 2012 at 3:58 pm | Permalink

          Well, all characters are HEX characters, aren't they? For example, an ASCII 'B' is actually represented by '42'x when stored in a variable. Perhaps you just want to get rid of non-printable characters? If so, you're in luck :-) COMPRESS uses a 'w' in the third argument to reference all printable characters. Combine that with 'k' for keep - and voila!!

          data _null_;
             Text=cat('ABCD','07'x,'07'x,'07'x,'EFG');
             Text2=compress(text,,'kw');
             PUTLOG 'NOTE: ' text= text2=;
          run;
          
  3. Edward Ballard
    Posted February 15, 2012 at 8:22 pm | Permalink

    If there is any chance that your data contains internation numbers you may still have problems. I had some experience with client data that included phone numbers and had to perform similar clean up. However some of the internation codes and subsequent digits matched the number of digits in a US phone number.

    • Mark Jordan Mark Jordan
      Posted February 16, 2012 at 9:06 am | Permalink

      True dat. If expecting "mixed" phone numbers (numbers from more than one country), we'd have to put a little more thought into constructing our picture format for displaying them.

  4. jerry d
    Posted February 18, 2012 at 6:00 pm | Permalink

    Just one question concerning your datalines - If I copy and paste your code I run into problems until removing the semi-colon from the D. Vader line.

    Doesn't work:
    D. Vader,$& #%%!!*1235556666*82in*200 lbs

    Works:
    D. Vader,$&amp #%%!!*1235556666*82in*200 lbs
    I'm using SAS 9.3. Is a semicolon not allowed in a dataline when creating data? thanks.

    • Mark Jordan Mark Jordan
      Posted February 18, 2012 at 6:15 pm | Permalink

      Jerry,
      You are correct - in a DATALINES statement, a line of data can't contain a semicolon. The semicolon ends the DATALINES statement! I'm not sure how a literal ampersand character (&) in my original code got replaced with the HTML equivalent (&) during the blog posting process, but I've fixed the code in the original post. Try copying out the modified code.
      Mark

      • jerry d
        Posted February 18, 2012 at 10:37 pm | Permalink

        Great. Thank you.

        • Mark Jordan Mark Jordan
          Posted February 20, 2012 at 1:27 pm | Permalink

          And, as Jan (a SAS colleague in Technical Support) pointed out, if I had actually WANTED to include semicolons in the data I could have used a DATALINES4 statement instead.

One Trackback

  1. [...] started out in Episode 1 cleaning and standardizing data from the Academy.  As you recall, Mr. Vader apparently had a [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>