Jedi SAS Tricks: These aren't the droids... Episode 1

13

In the Star Wars movie, Obi-wan could just wave his hand, mutter a few words, and the stormtroopers would "move along".  How the power of the Force makes ridding yourself of problematic characters so much easier! I recently was invited to become an alternate instructor for Ron Cody’s SAS Business Knowledge Series class, "SAS Functions by Example", and had the privilege of taking the class as a student under Ron’s tutelage. As Ron was introducing the advanced features of the COMPRESS function, I remembered how I've often needed to rid myself of problematic characters in a SAS program.  I sure wish I’d know about the advanced features of COMPRESS sooner! And my next thought was: I must share this incredible power with all the SAS Jedi out there, and what better place to do it than right here on the Jedi SAS Tricks blog!

Consider the following data subset from the academy records:

Data table with garbage punctuation in Names, various formats for phone numbers and height and weight data entered in mixed metric and imperial units.
Data set MixedUp

As you can see, no one seems to have gotten the memo on how to enter a phone number. Mr. Vader seems a bit disgruntled at having to fill in his data (note the tirade in the Name column), and has insisted on entering his height and weight in Imperial units instead of the universal metric standard (we'll clean up the height and weight data in Episode 2 - using tricks I learned in Ron Cody's other BKS course, Data Cleaning Techniques)

In this episode, we'll use the COMPRESS function to easily clean up the input data in the Name and Phone variables and a custom PICTURE format to uniformly display phone numbers (which we will store as numeric). After both episodes, our finished data set will look like this:

Cleaned up data table with standardized Names, phone numbers and height and weight data in metric units.
Data set: CleanedUp

Many of you may have used the COMPRESS function to remove all of the blanks from some text, especially handy when converting character data to numeric. For example:

data _null_;
   Text='1234 5678 9123 4567';
   Num=input(COMPRESS(Text),17.);
   put text= num=17.;
run;

From the SAS Log:
Text=1234 5678 9123 4567 Num=1234567891234567

The more adventuresome may have noticed that, if you don't want to remove only spaces, you can specify which characters should be removed by adding the optional second parameter chars to the function:

data _null_;
   Text='1234 5678-9123 4567';
   Num=input(COMPRESS(Text,' -'),17.);
   put text= num=17.;
run;

From the SAS Log:
Text=1234 5678-9123 4567 Num=1234567891234567

And just like that, SAS removes all of the offending characters from your text! But the list of unsavory characters can be long, and the typing of them tedious and error-prone. The key to making COMPRESS sit up and do SAS Tricks is the optional third parameter, modifers. There are many useful modifers, but for this job we'll focus on just two:

  • The 'k' modifier, which makes COMPRESS keep the characters on your list, instead of deleting them
  • The 'd' modifier which adds all digits to your list without having to type them yourself

For example:

data _null_;
   length text $100;
   do Text='123.555.4567'
         , '(123) 555-4567'
         ,'Area Code: 123 Exchange:555 Last four is: 4567';
      Num=input(COMPRESS(Text,,'kd'),12.);
      put "Text was: " text= ;
      PUT "Number is: " num=12.;
   end;
run;

From the SAS Log:
Text was: text=123.555.4567
Number is: Num=1235554567

Text was: text=(123) 555-4567
Number is: Num=1235554567

Text was: text=Area Code: 123 Exchange:555 Last four is: 4567
Number is: Num=1235554567

What manner of wizardry is this? All extraneous characters have been removed! The SAS is strong in this one...

To finish things up, we'll need a custom picture format to display the phone numbers. We'll want an open parenthesis as the prefix for every number displayed, a closing parenthesis after the third digit (area code) and a dash between the exchange number and the last four digits. We'll also want the default width for our format to be 13 characters - the 10 digits of the phone number plus one for each parenthesis and dash. This is easy to define in PROC FORMAT:

proc format;
   picture phone (default=13) low-high='000)000-0000' (prefix='(');
run;

So, to wrap up Episode 1 - here is a program that creates the original data set, the custom format, and cleans up the Name and Phone variables:

  /* Jedi SAS Tricks: These aren't the droids... Episode 1 */
data MixedUp;
   infile datalines dlm='*' dsd truncover;
   input Name:$20. ph:$15. ht:$15. wt:$15.;
   LABEL Name='Student Name'
         Ph='Primary Phone Number'
         Ht='Height'
         Wt='Weight';
datalines;
M. Jordan*(123)555-1234*178cm*81 kg
O. Kenobi*123.555.9999*175cm*75 kg
D. Vader,&#%%!!*1235556666*82in*200 lbs
;
run;
 
proc sql;
   select * from MixedUp;
quit;
 
proc format;
   picture phone (default=13) low-high='000)000-0000' (prefix='(');
run;
 
data CleanedUp1;
   LENGTH Name $ 20 Phone 8;
   LABEL Name='Student Name'
         Phone='Primary Phone Number';
   keep Name Phone ht wt;
   format Phone phone. ;
   set MixedUp;
   Name=Compress(Name,'. ','ka');
   Phone=input(compress(ph,,'kd'),15.);
run;
 
proc sql;
   select * from CleanedUp1;
quit;
Data set: CleanedUp1

In Episode 2 we'll automate cleanup for those pesky height and weight entries!

Until next time, may the SAS be with you.
Mark

Share

About Author

SAS Jedi

Principal Technical Training Consultant

Mark Jordan (a.k.a. SAS Jedi) grew up in northeast Brazil as the son of Baptist missionaries. After 20 years as a US Navy submariner pursuing his passion for programming as a hobby, in 1994 he retired, turned his hobby into a dream job, and has been a SAS programmer ever since. Mark writes and teaches a broad spectrum of SAS programming classes, and his book, "Mastering the SAS® DS2 Procedure: Advanced Data Wrangling Techniques" is in its second edition. When he isn’t writing, teaching, or posting “Jedi SAS Tricks”, Mark enjoys playing with his grand and great-grandchildren, hanging out at the beach, and reading science fiction novels. His secret obsession is flying toys – kites, rockets, drones – and though he usually tries to convince Lori that they are for the grandkids, she isn't buying it. Mark lives in historic Williamsburg, VA with his wife, Lori, and Stella, their cat. To connect with Mark, check out his SAS Press Author page, follow him on Twitter @SASJedi or connect on Facebook or LinkedIn.

Related Posts

13 Comments

  1. Pingback: Jedi SAS Tricks: These aren’t the droids… Episode 2 - The SAS Training Post

  2. Just one question concerning your datalines - If I copy and paste your code I run into problems until removing the semi-colon from the D. Vader line.

    Doesn't work:
    D. Vader,$& #%%!!*1235556666*82in*200 lbs

    Works:
    D. Vader,$&amp #%%!!*1235556666*82in*200 lbs
    I'm using SAS 9.3. Is a semicolon not allowed in a dataline when creating data? thanks.

    • SAS Jedi

      Jerry,
      You are correct - in a DATALINES statement, a line of data can't contain a semicolon. The semicolon ends the DATALINES statement! I'm not sure how a literal ampersand character (&) in my original code got replaced with the HTML equivalent (&) during the blog posting process, but I've fixed the code in the original post. Try copying out the modified code.
      Mark

  3. Edward Ballard on

    If there is any chance that your data contains internation numbers you may still have problems. I had some experience with client data that included phone numbers and had to perform similar clean up. However some of the internation codes and subsequent digits matched the number of digits in a US phone number.

    • SAS Jedi

      True dat. If expecting "mixed" phone numbers (numbers from more than one country), we'd have to put a little more thought into constructing our picture format for displaying them.

  4. I use the COMPRESS most often to strip values of unwanted hex characters. When I wasn't getting the results I expected with a 'if VAR=' ' then [do stuff]', I had a hard time finding the root cause. It looked like a lot of records should have satisfied the condition, but none did. The records actually contained an insidious hex character, invisible to the naked eye, but not undectable to SAS. COMPRESS to the rescue!

    Since then, it's become one of the most used tools in my tester 'toolbox'.

    • SAS Jedi

      Thanks for sharing another great example of using COMPRESS to rid yourself of "unsavory" characters!

      • Sorry to be so lazy, but is there a parameter to strip hex characters or do you need to specify them all?

        I've got several things where I go stringing through stuff in search of hex characters and would love to replace what I have with a quick function.

        • SAS Jedi

          Well, all characters are HEX characters, aren't they? For example, an ASCII 'B' is actually represented by '42'x when stored in a variable. Perhaps you just want to get rid of non-printable characters? If so, you're in luck :-) COMPRESS uses a 'w' in the third argument to reference all printable characters. Combine that with 'k' for keep - and voila!!

          data _null_;
             Text=cat('ABCD','07'x,'07'x,'07'x,'EFG');
             Text2=compress(text,,'kw');
             PUTLOG 'NOTE: ' text= text2=;
          run;
          
  5. Great tip. I often use ANYDIGIT or ALPHA with COMPRESS to determine if there is an issue. So something like this:

    if anydigit(char_var) = 1 then char_var=compress(char_var, '1234567890')

    It's helpful if you have a large dataset but there may be a small percentage of offending entries.

    Mark - Check out the Thumb Wars if you have not already - it's worth some girly giggles. ;-)

    • SAS Jedi

      Nice! The "ANY" (ANYDIGIT, ANYALPHA, ANYPUNCT, etc.) and "NOT" functions (NOTDIGIT, NOTALNUM, etc.) were also covered in Ron's class - and I found several very useful ones in that bunch, too. SAS has so many great functions. Thanks for sharing your expereinces.

      PS: I checked out Thumb Wars, and I will admit to a guffaw or two - but no "girly giggles" :-D

Back to Top