We started out in Episode 1 cleaning and standardizing data from the Academy. As you recall, Mr. Vader apparently had a attitude when making his entry in the database. The original data looked like this:
You can find the code for creating the original data set back in Episode 1. There, we demonstrated the power of the COMPRESS function which, when combined with a nice little picture format, did the initial data cleanup:
In this episode, we'll finish cleaning up the last problem: Mr Vader insisted on entering his height and weigh data in Imperial vs. metric units (our standard). Once again, we'll leverage the power of the COMPRESS function, but two additional functions will help make our code succinct and efficient: FIND and IFN.
The FIND function searches for a specific substring of characters within a character string, and can do case-sensitive searches or ignore case. It returns a 0 (boolean false) if the text is not found, or a number indicating the position in the larger string where the text of the smaller substring starts (boolean true).
The IFN function returns a numeric value based on whether an expression is true, false, or missing. The first argument is the expression to evaluate for true or false, the second is the expression producing the value to return if true, the third the expression producing the value to return if false.
To fix the height data, we will COMPRESS out everything except the digits, just like we did for the phone number. But before storing the value in the Height variable, we'll test to see if it was entered in inches by trying to FIND 'in' in the original text value. If found, we'll multiply the value by 2.54 (there are 2.54 cm/in) before storing it in the Height variable. We'll use a similar process for weight. If the weight was given in pounds, we'll divide the value by 2.2046 before storing it in the Weight variable.
Here is the syntax to determine if the text variable ht contains the text 'in' using the FIND function:
find(ht,'in','i') |
Now, when we process Mr. Vader's height information, it DOES contain the string 'in' so our code must use conditional processing to decide to do the math. We'll use the IFN function for this.
Here is the syntax which conditionally determines a multiplier value based on the results of the FIND function:
IFN(find(ht,'in','i'),2.54,1) |
Using similar syntax to determine the divisor for weight, and adding new picture formats for our height and weight values, we get this final "cleanup code":
proc format; picture phone (default=13) low-high='000)000-0000' (prefix='('); picture kg (default=7) low-high='0000 kg' ; picture cm (default=7) low-high='0000 cm' ; run; data CleanedUp; LENGTH Name $ 20 Phone Height Weight 8; LABEL Name='Student Name' Phone='Primary Phone Number' Height='Height' Weight='Weight'; keep Name--Weight; format Phone phone. height weight 5.; set MixedUp; /* Compress to keep only alpha characters */ Name=Compress(Name,'. ','ka'); /* Compress to keep only digits. */ Phone=input(compress(ph,,'kd'),15.); /* Compress to keep digits. If originally in inches, convert value */ Height=input(compress(ht,,'kd'),15.)*IFN(find(ht,'in','i'),2.54,1); /* Compress to keep digits. If originally in lbs, convert value */ Weight=input(compress(wt,,'kd'),15.)/ifn(find(Wt,'lbs','i'),2.20462262,1); format Weight kg. Height cm.; run; |
And that did the trick! Our final, cleaned up data looks great:
You may be wondering about my recent fascination with data cleaning... did I mention that I also recently started teaching Ron Cody's "Data Cleaning Techniques" class? I've been preparing to teach it for the first time in March - and thought I'd share a few of the things I'd learned with you.
That's all for this episode. Until the next time, may the SAS be with you!
Mark