Since we added the new "Recommended by SAS" widget in the SAS Support Communities, I often find myself diverted to topics that I probably would not have found otherwise. This is how I landed on this question and solution from 2011 -- "How to convert 5ft. 4in. (Char) into inches (Num)". While the question was deftly answered by my friend (and SAS partner) Jan K. from the Netherlands, the topic inspired me to take it a step further here.
Jan began his response by throwing a little shade on the USA:
Short of moving to a country that has a decent metric system in place, I suggest using a regular expression.
On behalf of my nation I just want say that, for the record, we tried. But we did not get very far with our metrication, so we are stuck with the imperial system for most non-scientific endeavors.
Matching patterns with a regular expression
Regular expressions are a powerful method for finding specific patterns in text. The syntax of regular expressions can be a challenge for those getting started, but once you've solved a few pattern-recognition problems by using them, you'll never go back to your old methods.
Beginning with the solution offered by Jan, I extended this program to read in a "ft. in." measurement, convert to the component number values, express the total value in inches, and then convert the measurement to centimeters. I know that even with my changes, we can think of patterns that might not be matched. But allow me to describe the updates:
- Long-time users of PRXPARSE in SAS will notice that I did not apply the common practice of wrapping PRXPARSE in an "IF _N_=1" condition. For several releases of SAS, it's been the case that when using a constant parse pattern the DATA step will compile the regular expression just once. Parsing the expression hundreds or thousands of times across each DATA step iteration would be a performance concern -- but since in our case the pattern never changes, SAS compiles it just once.
- The PRXPOSN function returns the nth "capture buffer" from the pattern match. Capture buffers are identified in the pattern by parentheses -- you count each open parenthesis to arrive at the expected buffer. So the first buffer matches on the first sequence of digits: (\d*). The second buffer is the optional whitespace between ft. and in.: (\s*). Buffer 3 is the entire pattern for "n in.": ((\d?\.?\d?)in.). Finally, buffer 4 is an "inner" capture group of buffer 3, containing just the sequence of digits with optional decimal for inches: (\d?\.?\d?)
- The PRXPOSN function returns the text value of the match, so we have to use the INPUT function to convert that to a SAS numeric value.
- Finally, I added the calculations to convert to total inches, and then centimeters.
Here's my program, followed by the result:
data measure; length original $ 25 feet 8 inches 8 total_inches 8 total_cm 8; /* constant regex is parsed just once */ re = prxparse('/(\d*)ft.(\s*)((\d?\.?\d?)in.)?/'); input; original = _infile_; if prxmatch(re, original) then do; feet = input ( prxposn(re, 1, original), best12.); inches = input ( prxposn(re, 4, original), best12.); if missing(inches) and not missing(feet) then inches=0; end; else original = "[NO MATCH] " || original; total_inches = (feet*12) + inches; total_cm = total_inches * 2.54; drop re; cards; 5ft. 4in. 4ft 0in. 6ft. 10in. 3ft.2in. 4ft. 6ft. 1.5in. 20ft. 11in. 25ft. 6.5in. Soooooo big ; |
Other tools to help with regular expressions
The Internet offers a plethora of online tools to help developers build and test regular expression syntax. Here's a screenshot from RegExr.com, which I used to test some aspects of my program.
Tools like these provide wonderful insight into the capture groups and regex directives that will influence the pattern matching. They are part tutorial, part workbench for regex crafters at all levels.
Many of these tools also include community contributions for matching common patterns. For example, many of us often need to match/parse data with e-mail addresses, phone numbers, and other tokens as data fields. Sites like RegExr.com include the syntax for many of these common patterns that you can copy and use in your SAS programs.
See Also
- Using a regular expression to validate a SAS variable name
- In his book High-Performance SAS Coding, Christian Graffeuille (ChrisNZ on the communities) devotes a chapter to using regular expressions in SAS. (He also answers a TON of regex questions on the community.)
- Ron Cody also teaches about regular expressions for data cleansing -- see his blog post and even more in his book, Cody's Data Cleansing Techniques using SAS.
9 Comments
Thanks for the mention Chris! Good article. And I learned a new thing: you no longer need the "if _n_=1" construct any longer which will simplify my code from now on. It also shows that, to be a good coder, you need to be a life-long learner. Otherwise your knowledge will go stale without you realizing it. I am confronted with other peoples code on a daily basis and there is no shortage of examples where regular expressions would eliminate complex and lengthy sequences of other text functions. Spread the word!
I agree -- once you begin seeing solutions in "regex" -- it's difficult to stop seeing the possibilities. Pro tip: always pepper your regex code with comments. Deciphering even a moderately complex pattern later is challenging on its own.
In case the code is copied in order to convert heights in a productive envrionment, please note that the expression is not optimal. Consider the following cases...
6ft10in
6ft10.5in
6ft 10.5in
I knew there were some holes -- would love to see an improved version!
Maybe you can put that REGEX "stuff" into a user-defined function.
My attempt at a function to read fractions, no REGEX though ...
http://www.sascommunity.org/wiki/Create_an_Informat_from_a_User-Defined_Function
As well as user defined functions, Rock Langston also packaged regex so that these could be at the heart of a user defined informat.
That would look impressive - to create informats for feet and inches and pounds and ounces.....
For Rock, read Rick
Although Rick was always a rockstar
Excellent; thank you. 🙂
Now I need to extend the code to work with 5'10", 5 feet 10 inches, 5 ft 10 in, and the various abbreviations for feet, inches, and centimeters. And, of course do similarly for weights. 🤔
I leave that to you -- share it back when you have it working!