The WHERE clause in SAS/IML

8

In SAS procedures, the WHERE clause is a useful way to filter observations so that the procedure receives only a subset of the data to analyze. The IML procedure supports the WHERE clause in two separate statements.

  • On the USE statement, the WHERE clause acts as a global filter. The where clause applies to all subsequent READ statements that read from the open data set.
  • On the READ statement, the WHERE clause further filters the data. Because you can use multiple READ statements, you can easily create matrices that contain disjoint or overlapping subsets of the data. However, be aware multiple READ statements might result in SAS reading the data multiple times, depending on the syntax that you specify.

An interesting fact about the WHERE clause in SAS/IML is that you can specify run-time expressions for the WHERE clause, which makes it a very powerful tool for data analysis.

The WHERE clause in SAS/IML: filter data with run-time expressions #SAStip Click To Tweet

The WHERE clause in the USE statement

The USE statement opens a SAS data set for read access. When you need only one filter, specify a WHERE clause on the USE statement. For example, suppose you want a matrix that contains the age, weight, and height of all females in the Sashelp.Class data. The following program reads the female observations into the matrix X and prints the average age, weight, and height:

proc iml;
varNames = {"Age" "Weight" "Height"};
use Sashelp.Class where(sex='F');       /* restrict to females */
   read all var varNames into X;
close Sashelp.Class;
 
avg = mean(X);
print avg[L="Mean Values for Females" colname=varNames format=5.2];
whereiml1

The WHERE clause in the READ statement

You can also put the WHERE clause in the READ statement. This technique is useful if you intend to read the data several times. For example, the following program reads data for females into the X matrix and data for males into the Y matrix:

use Sashelp.Class;
   read all var varNames into X where(sex='F');
   read all var varNames into Y where(sex='M');
close Sashelp.Class;

If you use a WHERE clause on both the USE and READ statements, the SAS log will include the NOTE
NOTE: WHERE clause has been augmented
to inform you that the data filter combines both WHERE clauses by using a "logical AND" operator.

Expressions in the WHERE clause

Beginning with SAS/IML 13.1 (released with SAS 9.4m1), you can use expressions in WHERE clauses. This means that you can call the READ statement in loop. During each iteration, you can read and analyze various subsets of the data during each iteration.

For example, suppose you have several grouping variables and you want to conduct a BY-group analysis. You can use the UNIQUEBY technique to conduct a BY-group analysis with several variables. However, the UNIQUEBY technique requires that the data be sorted and fit in RAM. It also requires a bit of "bookkeeping" because you need to keep track of indices. If you don't mind the inefficiency of reading the data multiple times, a WHERE clause approach is conceptually easier to program.

As an example, suppose you want to analyze the MPG_City variable in the Sashelp.Cars data set for each combinations of the Origin and Type variables. To keep it simple, suppose you want to compute the mean value of MPG_City for all pairwise combinations of the Origin and Type variables, excluding the observations for American-made vehicles. This analysis is simple by using PROC MEANS. (The output for PROC MEANS is not shown.)

proc means data=Sashelp.Cars Mean;
where Origin ^= "USA";
   class Origin Type;
   var MPG_City;
run;

In PROC IML, this computation requires looping over the valid combinations of Origin and Type. To make the analysis simpler, the following call to PROC FREQ writes the valid combinations to a SAS data set:

proc freq data=Sashelp.Cars noprint;
where Origin ^= "USA";
tables Origin*Type / nocum norow nocol nopercent
                     out=FreqOut;  /* unique combinations of Origin and Type */
run;

In PROC IML, you can read the FreqOut data to obtain the unique combinations of the Origin and Type variables. You can iterate over these combinations, reading the Sashelp.Cars data multiple times. During each iteration, you can analyze one of the BY groups, as follows:

proc iml;
use FreqOut;
   read all var {Origin Type};           /* read unique levels of BY groups */
close FreqOut;
NumGroups = nrow(Origin);
 
use Sashelp.Cars where(Origin ^= "USA"); /* open data set for reading */
Stats = j(NumGroups, 2);                 /* allocate vector for results */
do i = 1 to NumGroups;                   /* for each BY group... */
   /* read unsorted data to obtain the i_th BY group */
   /* Notice the EXPRESSIONS in the WHERE clause! */
   read all var {MPG_City} where(origin=(Origin[i]) & type=(Type[i])); 
   Stats[i, 1] = countn(MPG_City);       /* number of nonmissing obs */
   Stats[i, 2] = mean(MPG_City);         /* analyze this BY group */
end;
close Sashelp.Cars;
 
print Origin Type Stats[colname={"N" "Mean"}];
whereiml2

The result of the analysis is similar to the output from PROC MEANS. Notice the use of expressions in the WHERE clause in the READ statement. The expression origin=(Origin[i]) is interpreted as follows:

  • The left side of the equal sign (origin) specifies the name of a variable in the open data set.
  • The right side of the equal sign must be enclosed in parentheses unless it is a literal constant.
  • The expression inside the parentheses can be any matrix computation that results in a scalar value, including calls to built-in or user-defined functions.

The example program reads the data set 10 times, once for each unique combination of Origin and Type. Although re-reading data is inefficient, there are three advantages: the data set does not need to be sorted, only one BY group at a time is ever in RAM, and the program statements are easy to write. By using this method, you do not have to keep track of sorting, indexing, or extracting the data. The WHERE clause in SAS/IML does the work for you.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

8 Comments

  1. Peter Clemmensen on

    Hi Rick, thank you for another great post! In your first block of code, isn't the label suppose to say "Mean Values for Females", since you restrict to looking at only females using where(sex='F') in the use statement?

    Regards Peter.

  2. Rick,
    Why not use my start-end index skill which avoid to read table many times .

    proc sort data=sashelp.cars out=cars;
     by Origin Type;
    run;
     
    proc iml;
    use cars nobs nobs;
    read all var {Origin Type MPG_City};  
    close ;
    x=catx(' ',Origin,Type);
    start_end=t(loc(x^=t({' '}||remove(x,nobs)))) ||
              t(loc(x^=t(remove(x,1)||{' '})));
     
    O=Origin[start_end[,1]]; 
    T=Type[start_end[,1]];
    Stats=j(nrow(start_end),2,.);
    Stats[,1]=start_end[,2]-start_end[,1]+1;
    do i=1 to nrow(start_end);
    Stats[i,2]=mean(MPG_City[start_end[i,1]:start_end[i,2]]);
    end;
     
    print O T Stats[colname={"N" "Mean"}];
    quit;
    • Rick Wicklin

      Yes. I am not recommending that you read the data multiple times for this example. In the paragraph about "bookkeeping" I say that there are ways to analyze the BY groups without reading the data multiple times. I link to the UNIQUEBY technique, which is essentially what your code does. The point of the current example is to show that the right hand side of a WHERE clause can be a dynamic variable that changes during the course of the program.

    • I can see the start-end technique is quite useful, but I still feel the original blog post is an indispensable step in illustrating the straightforward usage of where-clause in use and read statements within SAS/IML. After the basic techniques introduced in Rick's post, the start-end technique comes in really handy. Both methods are great but come into play at different stages of learning. I can learn, and do enjoy, from both techniques.

  3. Rick,
    Rather than using 'print' statement in Proc IML, how to create a result saving as the sas dataset for further data manipulation just like "data;set;" method?

  4. Pingback: WHERE operators in SAS: Multiple comparisons and fuzzy matching - The DO Loop

Leave A Reply

Back to Top