How the power of the Force makes ridding yourself of problematic characters so much easier! I recently was invited to become an alternate instructor for Ron Cody’s SAS Business Knowledge Series class, "SAS Functions by Example", and had the privilege of taking the class as a student under Ron’s tutelage. As Ron was introducing the advanced features of the COMPRESS function, I remembered how I've often needed to rid myself of problematic characters in a SAS program. I sure wish I’d know about the advanced features of COMPRESS sooner! And my next thought was: I must share this incredible power with all the SAS Jedi out there, and what better place to do it than right here on the Jedi SAS Tricks blog!
Consider the following data subset from the academy records:
As you can see, no one seems to have gotten the memo on how to enter a phone number. Mr. Vader seems a bit disgruntled at having to fill in his data (note the tirade in the Name column), and has insisted on entering his height and weight in Imperial units instead of the universal metric standard (we'll clean up the height and weight data in Episode 2 - using tricks I learned in Ron Cody's other BKS course, Data Cleaning Techniques)
In this episode, we'll use the COMPRESS function to easily clean up the input data in the Name and Phone variables and a custom PICTURE format to uniformly display phone numbers (which we will store as numeric). After both episodes, our finished data set will look like this:
Many of you may have used the COMPRESS function to remove all of the blanks from some text, especially handy when converting character data to numeric. For example:
data _null_; Text='1234 5678 9123 4567'; Num=input(COMPRESS(Text),17.); put text= num=17.; run;
From the SAS Log: Text=1234 5678 9123 4567 Num=1234567891234567
The more adventuresome may have noticed that, if you don't want to remove only spaces, you can specify which characters should be removed by adding the optional second parameter chars to the function:
data _null_; Text='1234 5678-9123 4567'; Num=input(COMPRESS(Text,' -'),17.); put text= num=17.; run;
From the SAS Log: Text=1234 5678-9123 4567 Num=1234567891234567
And just like that, SAS removes all of the offending characters from your text! But the list of unsavory characters can be long, and the typing of them tedious and error-prone. The key to making COMPRESS sit up and do SAS Tricks is the optional third parameter, modifers. There are many useful modifers, but for this job we'll focus on just two:
- The 'k' modifier, which makes COMPRESS keep the characters on your list, instead of deleting them
- The 'd' modifier which adds all digits to your list without having to type them yourself
data _null_; length text $100; do Text='123.555.4567' , '(123) 555-4567' ,'Area Code: 123 Exchange:555 Last four is: 4567'; Num=input(COMPRESS(Text,,'kd'),12.); put "Text was: " text= ; PUT "Number is: " num=12.; end; run;
From the SAS Log: Text was: text=123.555.4567 Number is: Num=1235554567 Text was: text=(123) 555-4567 Number is: Num=1235554567 Text was: text=Area Code: 123 Exchange:555 Last four is: 4567 Number is: Num=1235554567
What manner of wizardry is this? All extraneous characters have been removed! The SAS is strong in this one...
To finish things up, we'll need a custom picture format to display the phone numbers. We'll want an open parenthesis as the prefix for every number displayed, a closing parenthesis after the third digit (area code) and a dash between the exchange number and the last four digits. We'll also want the default width for our format to be 13 characters - the 10 digits of the phone number plus one for each parenthesis and dash. This is easy to define in PROC FORMAT:
proc format; picture phone (default=13) low-high='000)000-0000' (prefix='('); run;
So, to wrap up Episode 1 - here is a program that creates the original data set, the custom format, and cleans up the Name and Phone variables:
/* Jedi SAS Tricks: These aren't the droids... Episode 1 */ data MixedUp; infile datalines dlm='*' dsd truncover; input Name:$20. ph:$15. ht:$15. wt:$15.; LABEL Name='Student Name' Ph='Primary Phone Number' Ht='Height' Wt='Weight'; datalines; M. Jordan*(123)555-1234*178cm*81 kg O. Kenobi*123.555.9999*175cm*75 kg D. Vader,&#%%!!*1235556666*82in*200 lbs ; run; proc sql; select * from MixedUp; quit; proc format; picture phone (default=13) low-high='000)000-0000' (prefix='('); run; data CleanedUp1; LENGTH Name $ 20 Phone 8; LABEL Name='Student Name' Phone='Primary Phone Number'; keep Name Phone ht wt; format Phone phone. ; set MixedUp; Name=Compress(Name,'. ','ka'); Phone=input(compress(ph,,'kd'),15.); run; proc sql; select * from CleanedUp1; quit;
In Episode 2 we'll automate cleanup for those pesky height and weight entries!
Until next time, may the SAS be with you.