Using regular expressions to verify the pattern of character data

2

Verify the pattern of character dataWait!  Don't close this window.  I understand that regular expressions can be very complicated (yes, there are many books on the subject), but some basic expressions to test patterns such as zip codes or telephone numbers are not that difficult.  In addition, you can sometimes use Google to search for a regular expression to fit your requirements.

I added a new chapter in my recent book—the third edition of Cody's Data Cleaning Techniques Using SAS, to cover this topic.  As an example to show that regular expressions can be used to test for a valid pattern, the following program can be used to test Canadian Postal codes.  These codes are in the form of:

LDLDLD or LDL DLD

where 'L' is a letter and 'D' is a digit.  There are some rules about what letters can be used in the first, third, and fifth position, but we will ignore that detail for now.

A regular expression starts and ends with a delimiter (usually a forward slash /).  For example, a regular expression to match the word 'cat' would be: /cat/.  Pretty simple.  Of course, if you are looking for a cat (we couldn't find our cat Mickey the other day but we finally found him without using a regular expression) you would probably use the FIND function that searches for strings.  The power of a regular expression is that you can specify classes of character data such as all letters or all digits.  The regular expression to match a Canadian postal code (without the space in the middle) is:

/([A-Z]\d){3}/

The letters in the square bracket match all the uppercase letters—the expression '\d' matches any digit.  The {3} is a repetition operator that says to repeat the sequence of a letter and a digit three times.

How do you use a regular expression in a SAS program?  Take a look:

title "Testing for Valid Canadian Postal Codes";
data _null_;
   input Code $;
   file print;
   if prxmatch("/([A-Z]\d){3}/",Code) eq 0 then 
      put "Error: the code " Code "is not valid.";
datalines;
A1B2C3
123456
 B2N3M4
X7X6S5
ABCDEF
;

 

The PRXMATCH function is similar to the FIND function. The first argument is a regular expression, the second argument is the string you are testing. If a pattern is found that matches the regular expression, the function returns the starting position of the pattern—if a pattern is not found, the function returns a 0. Here is the output from running this DATA step:

Keep in mind, that if the pattern you are checking is all digits, or all letters, SAS has a group of functions that I call the NOT functions that offer tests of this kind. For example, you can use the NOTDIGIT function to test if there is a non-digit in a string. Likewise, two other useful NOT functions are NOTALPHA that searches for non-letters and NOTALNUM that is useful to determine if a value is not an alpha-numeric.

Share

About Author

Ron Cody

Private Consultant

Ron Cody, EdD is a retired professor from the Robert Wood Johnson Medical School. He now works as a private consultant and a national instructor for SAS Institute Inc. A SAS user since 1977, Ron's extensive knowledge and innovative style have made him a popular presenter at local, regional, and national SAS conferences. He has authored or co-authored numerous books, as well as countless articles in medical and scientific journals.

Related Posts

2 Comments

  1. Kevin DeBruhl on

    Hi Ron,

    I too have found the regular expression capabilities within SAS to be very helpful. Do you have any advice or suggestions on when it's better to compile the expression using prxparse versus letting the expression be compiled each iteration of the data step?

    -Kevin

Leave A Reply

Back to Top