Need case-insensitive string comparisons? UPCASE!


String comparisons in SAS software are case-sensitive. For example, the uppercase letter "F" and lowercase letter "f" are treated as unique characters. When these two letters represent the same condition (for example, a female patient), the strings need to be handled in a case-insensitive manner, and a SAS programmer might write the following compound IF statement:

if sex="F" | sex="f" then do; 
   /** person is female **/

However, if you plan on comparing many strings (or long strings with mixed case), it is often more efficient to use the UPCASE function to convert the entire string variable to uppercase characters, as shown in the following statements

s = upcase(sex);
if s="F" then do; 
   /** person is female **/

This blog post presents two tips that enable you to handle inconsistent string data in a uniform manner. You can use the tips in the DATA step and in the SAS/IML language.

How to UPCASE in SAS/IML Software

The UPCASE function is part of Base SAS software, and functions in Base SAS software can be called from SAS/IML software. If you call UPCASE on a SAS/IML matrix, the function converts every element in the matrix to uppercase.

I recently needed to use the UPCASE function to process data related to parameter estimates for a set of statistical models. I had some data, similar to those generated by the following DATA step, which I read into SAS/IML vectors:

data Models;
input Distribution $12. Param;
ChiSquare   19
gamma       10
chisq       28 
t           9
GAMMA       12
proc iml;
use Models;
read all var {Distribution Param};
close Models;

I needed to extract all of the parameters for gamma distributions, but I knew that the values of the Distribution variable were inconsistent: some were indicated by "gamma" (lowercase), others by "GAMMA" (uppercase), and there might also be "Gamma" (mixed case).

Tip: Use the UPCASE function to compare strings in a case-insensitive manner.

You can use the UPCASE function to handle all these cases in a single IF clause or to call the LOC function. The following statements convert the names of distributions to uppercase for easy comparison, and use the LOC function to extract the parameters for the rows that correspond to the gamma distribution:

/** simpler and more robust **/
d = upcase(Distribution);
idx = loc(d = "GAMMA");
print idx, (Param[idx])[label="Gamma Parameters"];

Use SUBSTR to Compare Truncated Values

There is a related trick that you can use to handle truncated values.

Tip: Use the SUBSTR function to compare the first few characters of strings.

The Distribution variable contains the values "ChiSquare" and "chisq," and both of these situations need to be handled in a uniform way. You can use the SUBSTR function to truncate the values of Distribution to, say, five characters, as shown in the following statements:

/** find chi-square models **/
/** truncate to first 5 chars **/
d2 = substr(d, 1, 5); 
idx = loc(d2 = "CHISQ");

These two tips enable you to robustly handle abbreviations and mixed-case spellings of data values.


About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

Back to Top