Using a regular expression to validate a SAS variable name

Regular expressions provide a powerful method to find patterns in a string of text. However, the syntax for regular expressions is somewhat cryptic and difficult to devise. This is why, by my reckoning, approximately 97% of the regular expressions used in code today were copied and pasted from somewhere else. (Who knows who the original authors were? Some experts believe they were copied from ancient cave paintings in New Mexico.)

You can use regular expressions in your SAS programs, via the PRX* family of functions. These include PRXPARSE and PRXMATCH, among others. The classic example for regular expressions is to validate and standardize data values that might have been entered in different ways, such as a phone number or a zip code (with or without the plus-4 notation).

In this post I'll present another, less-trodden example. It's a regular expression that validates the syntax of a SAS variable name. (Now, I'm talking about the regular old traditional SAS variable names, and not those abominations that you can use with OPTIONS VALIDVARNAME=ANY.)

SAS variable names, as you know, can be 1 to 32 characters long, begin with a letter or underscore, and then contain letters, numbers, or underscores in any combination after that. If all you need is a way to validate such names, stop reading here and go learn about the NVALID function, which does exactly this. But if you want to learn a little bit about regular expressions, read on.

The following program shows the regular expression used in context:

data vars (keep=varname valid);
  length varname $ 50;
  input varname 1-50 ;
  re = prxparse('/^(?=.{1,32}$)([a-zA-Z_][a-zA-Z0-9_]*)$/');
  pos = prxmatch(re, trim(varname));
  valid = ifc(pos>0, "YES","NO");
cards;
var1
no space
1var
_temp
thisVarIsOver32charactersInLength
thisContainsFunkyChar!
_
yes_underscore_is_valid_name
run;

And the results:

Here's a closer look at the regular expression in "diagram" form, lightly annotated. (This reminds me a little of how we used to diagram sentences in 4th grade. Does anybody do that anymore?)

Among the more interesting aspects of this expression is the lookahead portion, which checks that the value is between 1 and 32 characters right out of the gate. If that test fails, the expression fails to match immediately. Of course, you could use the LENGTHN function to check that, but it's nice to include all of the tests in a single expression. I mean, if you're going to write a cryptic expression, you might as well go all the way.

Unless you live an alternate reality where normal text looks like cartoon-style swear words, there really isn't much that's "regular" about regular expressions. But they are extremely powerful as a data validation and cleansing tool, and they exist in just about every programming language, so it's a transferable skill. If you take the time to learn how to use them effectively, it's sure to pay off.

12 Comments

Eric Jackson on August 24, 2012 8:09 am

Any suggestions about resources to help learn more about regular expressions?
- Chris Hemedinger on August 27, 2012 1:26 pm
  
  Eric, regular-expressions.info is a great resource with lots of information and examples.
Edward Ballard on September 11, 2012 1:08 pm

And the regular expressions can be used to search or search and replace in the enhanced editor to clean up some code layout or datalines content.
- Chris Hemedinger on September 11, 2012 1:11 pm
  
  Yes, they sure can!
  
  And with SAS 9.3, regular expressions can also be used within SAS formats. They're popping up everywhere, so it's a great skill to learn!
Toby Dunn on September 11, 2012 2:24 pm

Chris,

One has to ask why do the many who don't know copy from the few who know how to write a RegEx? One possibility is the lack of good material on the specific flavor of RegEx in the language they are using.

Since the example is trying to validate you can skip the PrxParse, and jump right into a PrxMatch since the pattern doesnt vary from one observation to the next it will only be compiled once .

One can lose the look-ahead as that only waste cpu cycles. The RegEx optimizer will kick in when you use the beggining and end of line metacharacters and negate the need for the look-ahead.

Lose the caputure parens will speed it up as it wont need to store text in memory and not use it.

It can further be shortened down if you also use the /i pattern modifer to do a case insensitive match.

PrXMatch( '/^[a-z]\w{0,31}$/i' , Var )

Works faster on my unix box running 9.1.3 but should be faster on all boxes and its easier to type.
- Chris Hemedinger on September 11, 2012 2:36 pm
  
  Thanks Toby! I admit it: I don't know how to "think" like the regex parser -- clearly, you have a gift.
  
  Your suggested improvement left out the underscore as a possible starting character, but this fixes it, I think:
  
  prxmatch('/^[a-z_]\w{0,31}$/i', Var);
  
  Also, you make a good point about calling PrxParse too many times. The usual pattern I've seen in SAS is to run that statement only IF _N_ = 1 and then RETAIN the "re" result from the parse operation for use in subsequent observations.
  - Toby Dunn on September 14, 2012 5:02 pm
    
    Chris,
    
    Before my book deal got canceled I had written about this... When the SAS Perl RegEx's were introduced you had to use the If _N_ tehcnique... however, in the second generation of SAS Perl RegEx's Jason added the /o or compile once option as well as a good many of the Prx Functions if you hard code the Pattern it will only compile once.
    - Chris Hemedinger on September 17, 2012 4:47 pm
      
      Thanks Toby! That's great information. I hope you continue to share your RegExp expertise via SAS-L and via conference papers.
Chris (blog visitor) on September 11, 2012 6:26 pm

Toby's simplified syntax works fine as:
pos = prxmatch('/^[a-z_]\w{0,31}$/i', trim(varname));
Fmi, any way to remove the trim() and have the regexp handle the trimmimg? Should we?
Also why/when use PrxParse at all if PrxMatch accepts strings as the first parameter?
- Toby Dunn on September 14, 2012 5:05 pm
  
  I personally use the Strip Function.... that being said if you simply add a \s* to the end of your pattern it will specify 0 or more spaces... so regardless if one is there or not it will ensure a match....
  
  pos = prxmatch('/^[a-z_]\w{0,31}\s*$/i', varname ) ;
  
  The only down fall is it might add time onto your RegEx match.
Rick Wicklin on September 12, 2012 6:18 am

Did you know that approximately 97% of statistics are made up by blog authors? :-) If you use twitter and are interested in learning one regular expression tip a day, I encourage you to follow @RegexTip. If you don't use twitter, you can still view recent tips.
Pingback: Convert a text-based measurement to a number in SAS - The SAS Dummy

Blogs

Blogs

Using a regular expression to validate a SAS variable name

About Author

12 Comments