Wait! Don't close this window. I understand that regular expressions can be very complicated (yes, there are many books on the subject), but some basic expressions to test patterns such as zip codes or telephone numbers are not that difficult. In addition, you can sometimes use Google to search for a regular expression to fit your requirements.
I added a new chapter in my recent book—the third edition of Cody's Data Cleaning Techniques Using SAS, to cover this topic. As an example to show that regular expressions can be used to test for a valid pattern, the following program can be used to test Canadian Postal codes. These codes are in the form of:
LDLDLD or LDL DLD
where 'L' is a letter and 'D' is a digit. There are some rules about what letters can be used in the first, third, and fifth position, but we will ignore that detail for now.
A regular expression starts and ends with a delimiter (usually a forward slash /). For example, a regular expression to match the word 'cat' would be: /cat/. Pretty simple. Of course, if you are looking for a cat (we couldn't find our cat Mickey the other day but we finally found him without using a regular expression) you would probably use the FIND function that searches for strings. The power of a regular expression is that you can specify classes of character data such as all letters or all digits. The regular expression to match a Canadian postal code (without the space in the middle) is:
/([A-Z]\d){3}/
The letters in the square bracket match all the uppercase letters—the expression '\d' matches any digit. The {3} is a repetition operator that says to repeat the sequence of a letter and a digit three times.
How do you use a regular expression in a SAS program? Take a look:
title "Testing for Valid Canadian Postal Codes"; data _null_; input Code $; file print; if prxmatch("/([A-Z]\d){3}/",Code) eq 0 then put "Error: the code " Code "is not valid."; datalines; A1B2C3 123456 B2N3M4 X7X6S5 ABCDEF ; |
The PRXMATCH function is similar to the FIND function. The first argument is a regular expression, the second argument is the string you are testing. If a pattern is found that matches the regular expression, the function returns the starting position of the pattern—if a pattern is not found, the function returns a 0. Here is the output from running this DATA step:
Keep in mind, that if the pattern you are checking is all digits, or all letters, SAS has a group of functions that I call the NOT functions that offer tests of this kind. For example, you can use the NOTDIGIT function to test if there is a non-digit in a string. Likewise, two other useful NOT functions are NOTALPHA that searches for non-letters and NOTALNUM that is useful to determine if a value is not an alpha-numeric.
2 Comments
Hi Ron,
I too have found the regular expression capabilities within SAS to be very helpful. Do you have any advice or suggestions on when it's better to compile the expression using prxparse versus letting the expression be compiled each iteration of the data step?
-Kevin
Ron,
Great introduction to the regular expression! Here is one more resource your readers might find useful - Introduction to Regular Expressions in SAS® by Matthew Windham.