Using regular expressions to verify the pattern of character data

2

Verify the pattern of character dataWait!  Don't close this window.  I understand that regular expressions can be very complicated (yes, there are many books on the subject), but some basic expressions to test patterns such as zip codes or telephone numbers are not that difficult.  In addition, you can sometimes use Google to search for a regular expression to fit your requirements.

I added a new chapter in my recent book—the third edition of Cody's Data Cleaning Techniques Using SAS, to cover this topic.  As an example to show that regular expressions can be used to test for a valid pattern, the following program can be used to test Canadian Postal codes.  These codes are in the form of:

LDLDLD or LDL DLD

where 'L' is a letter and 'D' is a digit.  There are some rules about what letters can be used in the first, third, and fifth position, but we will ignore that detail for now.

A regular expression starts and ends with a delimiter (usually a forward slash /).  For example, a regular expression to match the word 'cat' would be: /cat/.  Pretty simple.  Of course, if you are looking for a cat (we couldn't find our cat Mickey the other day but we finally found him without using a regular expression) you would probably use the FIND function that searches for strings.  The power of a regular expression is that you can specify classes of character data such as all letters or all digits.  The regular expression to match a Canadian postal code (without the space in the middle) is:

/([A-Z]\d){3}/

The letters in the square bracket match all the uppercase letters—the expression '\d' matches any digit.  The {3} is a repetition operator that says to repeat the sequence of a letter and a digit three times.

How do you use a regular expression in a SAS program?  Take a look:

title "Testing for Valid Canadian Postal Codes";
data _null_;
   input Code $;
   file print;
   if prxmatch("/([A-Z]\d){3}/",Code) eq 0 then 
      put "Error: the code " Code "is not valid.";
datalines;
A1B2C3
123456
 B2N3M4
X7X6S5
ABCDEF
;

 

The PRXMATCH function is similar to the FIND function. The first argument is a regular expression, the second argument is the string you are testing. If a pattern is found that matches the regular expression, the function returns the starting position of the pattern—if a pattern is not found, the function returns a 0. Here is the output from running this DATA step:

Keep in mind, that if the pattern you are checking is all digits, or all letters, SAS has a group of functions that I call the NOT functions that offer tests of this kind. For example, you can use the NOTDIGIT function to test if there is a non-digit in a string. Likewise, two other useful NOT functions are NOTALPHA that searches for non-letters and NOTALNUM that is useful to determine if a value is not an alpha-numeric.

Share

About Author

Ron Cody

Private Consultant

Dr. Ron Cody was a Professor of Biostatistics at the Rutgers Robert Wood Johnson Medical School in New Jersey for 26 years. During his tenure at the medical school, he taught biostatistics to medical students as well as students in the Rutgers School of Public Health. While on the faculty, he authored or co-authored over a hundred papers in scientific journals. His first book, Applied Statistics and the SAS Programming Language, was first published by Prentice Hall in 1985 and is now in its fifth edition. Since then, he has published over a dozen books on SAS programming and statistical analysis using SAS. His latest book, A Gentle Introduction to Statistics Using SAS Studio was published this year. Ron has presented numerous papers at SAS Global forums, regional conferences, as well as local user groups. He is presently a contract instructor for SAS Institute and continues to write books on SAS and statistical topics.

Related Posts

2 Comments

  1. Kevin DeBruhl on

    Hi Ron,

    I too have found the regular expression capabilities within SAS to be very helpful. Do you have any advice or suggestions on when it's better to compile the expression using prxparse versus letting the expression be compiled each iteration of the data step?

    -Kevin

Back to Top