The Amazing COMPRESS Function

3

In the past, the COMPRESS function was useful. Since SAS version 9, it has become a blockbuster, and you might not have noticed. The major change was the addition of a new optional parameter called MODIFIERS.

The traditional use of the COMPRESS function was to remove blanks or a list of selected characters from a character string. The addition of a MODIFIER argument does two things. First, you can specify classes of characters to remove, such as all letters, all punctuation marks, or all digits. That is extremely useful, but the addition of the 'k' modifier is why I used the term blockbuster in my description. The 'k' modifier flips the function from one that removes characters from a string to one that keeps a list of characters and removes everything else. Let me show you some examples.

This first example stems from a real problem I encountered while trying to read values that contained units. My data looked something like this:

ID     Weight 
001    100lbs.
002     59Kgs.
003    210LBS
004    83kg

My goal was to create a variable called Wt that represented the person's weight in pounds as a numeric value.

First, let’s look at the code. Then, I’ll give an explanation.

data Convert;
   length ID $3 Weight $8;
   input ID Weight;
 
   Wt = input(compress(Weight,,'kd'),8.);
   /* The COMPRESS function uses two modifiers, 'k' and 'd'.  This means
      keep the digits, remove anything else.  The INPUT function does the
      character-to-numeric conversion.
   */
 
   If findc(Weight,'k','i') then Wt = Wt * 2.2;
 
   /* the FINDC function is looking for an upper or lowercase 'k' in the
      original character string.  If found, it converts the value in
      kilograms to pounds (note: 1 kg = 2.2 pounds).
   */
 
datalines;
001    100lbs.
002     59Kgs.
003    210LBS
004    83kg
;
title "Listing of Data Set Convert";
footnote "This program was run using SAS OnDemand for Academics";
proc print data=Convert noobs;
run;

The program reads the value of Weight as a character string. The COMPRESS function uses 'k' and 'd' as modifiers. Notice the two commas in the list of arguments. A single comma would interpret 'kd' as the second argument (the list of characters to remove). Including two commas notifies the function that 'kd' is the third argument (modifiers). You can list these modifiers in any order, but I like to use 'kd', and I think of it as "keep the digits." What remains is the string of digits. The INPUT function does the character-to-numeric conversion.

Your next step is to figure out if the original value of Weight contained an upper or lowercase 'k'. The FINDC function can take three arguments: the first is the string that you are examining, the second is a list of characters that you are searching for, and the third argument is the 'i' modifier that says, "ignore case" (very useful).

If the original character string (Weight) contains an uppercase or lowercase 'k', you convert from kilograms to pounds.

Here is the output:

There is one more useful application of the COMPRESS function that I want to discuss. Occasionally, you might have a text file in ASCII or EBCDIC that contains non-printing characters (usually placed there in error). Suppose you want just the digits, decimal points (periods), blanks, and commas. You need to read the original value as a text string. Let's call the original string Contains_Junk. All you need to convert these values is one line of code like this:

Valid = compress(Contains_Junk,'.,','kdas');

In this example, you are using all three arguments of the COMPRESS function. As in pre-9 versions of SAS, the second argument is a list of characters that you want to remove. However, because the third argument (modifiers) contains a 'k', the second argument is a list of characters that you want to keep. In addition to periods and commas, you use modifiers to include all digits, uppercase and lowercase letters (the 'a' modifier - 'a' for alpha), and space characters (these include spaces, tabs, and a few others such as carriage returns and linefeeds). If you did not want to include tabs and other "white space" characters, you could rewrite this line as:

Valid = compress(Contains_Junk,'., ','kda');

Here you are including a blank in the second argument and omitting the 's' in the modifier list.

You can read more about the COMPRESS function in any of the following books, available from SAS Press as an e-book or from Amazon in print form:

Or my latest programming book:

 

Questions and/or comments are welcome.

Share

About Author

Ron Cody

Private Consultant

Dr. Ron Cody was a Professor of Biostatistics at the Rutgers Robert Wood Johnson Medical School in New Jersey for 26 years. During his tenure at the medical school, he taught biostatistics to medical students as well as students in the Rutgers School of Public Health. While on the faculty, he authored or co-authored over a hundred papers in scientific journals. His first book, Applied Statistics and the SAS Programming Language, was first published by Prentice Hall in 1985 and is now in its fifth edition. Since then, he has published over a dozen books on SAS programming and statistical analysis using SAS. His latest book, A Gentle Introduction to Statistics Using SAS Studio was published this year. Ron has presented numerous papers at SAS Global forums, regional conferences, as well as local user groups. He is presently a contract instructor for SAS Institute and continues to write books on SAS and statistical topics.

3 Comments

  1. Kathleen Hanrahan on

    Ron,

    Helpful article. Thank you!

    Shouldn't the second code snippet include an "a" in the third argument - like so:
    Valid = compress(Contains_Junk,'., ','kda');

    Kathleen

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top