Many people know that you can use "WHERE processing" in SAS to filter observations. A typical use is to process only observations that match some criterion. For example, the following WHERE statement processes only observations for male patients who have high blood pressure:
WHERE Sex='Male' & Systolic > 140;
In this statement, a character variable (SEX) is tested for equality with a specified value, and a numerical variable (SYSTOLIC) is tested for inequality. Most people know that you can use operators (such as '=' and '>') to filter observations, but did you know that SAS also supports using functions as part of a WHERE statement? This article gives a few quick examples and discusses a neat trick that you can use when you want to filter observations, but you cannot use a WHERE statement directly.
Where can you use WHERE?
In SAS, there are four ways to perform WHERE processing:
- The WHERE= data set option: This option is places after the name of the data set when you use the SET statement the DATA step or the DATA= option in a procedure. The WHERE= option reads only the observations that satisfy the criteria.
- The WHERE statement: This global statement filters observations that have been read into the DATA step or procedure. The WHERE statement filters those observations. Only the observations that satisfy the criteria are processed by the procedure.
- In a WHERE clause in the SQL procedure or in PROC IML. A WHERE clause might not support the full range of operators and functions that are supported in the WHERE statement.
- Use a DATA step view to create an indicator variable. In a subsequent call to a procedure, use a WHERE statement to process observations for which the indicator variable has a specific value. Technically, this is not truly WHERE-processing, but it is a very useful technique, so I want to include it in this list.
The WHERE statement and the WHERE= data set option support similar options. The examples in this article use the WHERE statement, but the examples work equally well if you use the WHERE= data set option. For brevity, I will primarily refer to "the WHERE statement" and will show only one example that uses the WHERE= data set option.
The SAS documentation includes a chapter about WHERE-expression processing, which I will refer to as "WHERE processing." The doc spends many pages describing many special-purpose operators that you can use in WHERE statements. However, it only briefly mentions that you can use many function calls as well.
Using operators and functions in a WHERE statement
Before discussing functions, I want to point out that you can use arithmetic operators: plus, minus, multiplication, and so forth) in a WHERE statement. For example, the following statements use arithmetic operators to include only certain observations:
data Have2; set sashelp.heart; where Systolic - Diastolic > 60 /* wide pulse pressure */ & weight / height**2 * 703 > 25; /* BMI > 25 */ run; |
The resulting data set contains only observations (patients in a heart study) for which the difference between the systolic and diastolic blood pressure is large, and for which the patient's body-mass index is also large.
The previous statements do not call any SAS functions in the WHERE statement. The most popular functions are probably string-manipulation functions such as the SUBSTR and SCAN functions. These functions enable you to filter observations according to the value of one or more character variables. The following example uses the UPCASE and SUBSTR functions to test whether the first two characters of the SEX variable indicate that the patient is male. The function calls will match strings like 'male', 'Male', and 'MALE'. For fun, I also show that you can use the SQRT function, which performs a square-root operation.
data Have3; set sashelp.heart; where upcase(substr(Sex, 1, 2))='MA' /* match first two characters of 'male' and 'Male' */ & sqrt(weight*703) / height > 5; /* alternative way to specify BMI */ run; |
Another popular character function is the SCAN function. The following example uses the WHERE= data set option to read only the observations for which the cause of death includes the word 'Disease' as the third word. Matching values include "Cerebral Vascular Disease" and "Coronary Heart Disease":
/* an example that uses the WHERE= data set option */ data Have4; set sashelp.heart( where=(scan(DeathCause, 3) = 'Disease') ); run; |
As mentioned earlier, the WHERE= data set option can be more efficient that the WHERE statement. The syntax requires additional parentheses but is otherwise similar to the WHERE statement.
WHERE clauses in SAS IML
A customer recently noticed that the WHERE clause in SAS IML supports only a small subset of the operators that the WHERE statement supports. Furthermore, it does not support calling functions like the SUBSTR or SQRT functions.
If you look at the documentation for the WHERE clause,
you will notice that the syntax is of the form
Variable operator Value
which means that you can only use variable names on the left side of the WHERE clause. You cannot use arithmetic operations on variables or functions of variables.
The following PROC IML statements show a valid syntax for a WHERE clause on the USE statement in PROC IML. An invalid syntax is shown in the comment:
proc iml; /* this WHERE clause works */ use sashelp.heart where( Sex='Male' & Systolic > 140 ); read all var {Systolic Diastolic}; close; /* The following statement is NOT supported in the WHERE clause: USE sashelp.heart WHERE(Systolic - Diastolic > 60 & weight / height**2 * 703 > 25); */ |
Although the WHERE clause on the USE statement does not support expressions that use arithmetic operations or function calls, there is a simple trick that enables you to use the power of the WHERE statement in the DATA step to filter data as you read observations into IML. The trick is to use the SUBMIT/ENDSUBMIT block to create a DATA step that filters the data. For example, the following statements create a DATA step view. The view is read by the USE and READ statements in the IML language. It filters the data at runtime while the observations are read:
submit; data _Filter / view=_Filter; set sashelp.heart; where upcase(substr(Sex, 1, 2))='MA' & sqrt(weight*703) / height > 5; run; endsubmit; use _Filter; read all var {Systolic Diastolic}; close; |
There might be times in which you want to subdivide data into two or more subsets but only process one of the subsets. In that case, you can use a DATA step view to create an indicator variable. You can then use the indicator variable in a WHERE clause in IML to read the data. For example, the following statements create a view that contains a binary indicator variable named _INCLUDE. The WHERE clause on the USE statement reads the observations for which _INCLUDE=1:
submit; data _Filter / view=_Filter; set sashelp.heart; _Include = (upcase(substr(Sex, 1, 2))='MA' & sqrt(weight*703) / height > 5); run; endsubmit; use _Filter WHERE(_Include=1); read all var {Systolic Diastolic}; close; |
Summary
This article serves as a reminder that the WHERE statement in SAS supports arithmetic operators and function calls, which means that you can use the WHERE statement to create sophisticated filters for your data. The WHERE= data set option is similar. However, some special-purpose languages in SAS support WHERE clauses that are more limited in their syntax. However, you can always use a DATA step view to filter data or to create an indicator variable that can be processed anywhere in SAS.
3 Comments
Hi Rick,
Thanks for good reminder about "the power of WHERE"!
Small add-on, if I may. The WHERE statement, as we all know, also supports index use. But, which is less popular, a function in WHERE may sometimes "break" index use.
For example, having dataset test:
1
2 options msglevel=I fullstimer;
3 data test(index=(date));
4 do date = 0 to 3e5;
5 output;
6 end;
7 run;
NOTE: The data set WORK.TEST has 300001 observations and 1 variables.
INFO: Multiple concurrent threads will be used to create the index.
NOTE: Simple index date has been defined.
we can write below and use index:
8
9 data test1;
10 set test;
11 where date = "25jul2022"d;
INFO: Index date selected for WHERE clause optimization.
12 run;
NOTE: There were 1 observations read from the data set WORK.TEST.
WHERE date='25JUL2022'D;
NOTE: The data set WORK.TEST1 has 1 observations and 1 variables.
But when we use function:
13
14 data test2;
15 set test;
16 where date = today();
17 run;
NOTE: There were 1 observations read from the data set WORK.TEST.
WHERE date=TODAY();
NOTE: The data set WORK.TEST2 has 1 observations and 1 variables.
index is not used. But(!) in that situation the %sysfunc() macro function may help:
18
19 data test3;
20 set test;
21 where date = %sysfunc(today());
INFO: Index date selected for WHERE clause optimization.
22 run;
NOTE: There were 1 observations read from the data set WORK.TEST.
WHERE date=22851;
NOTE: The data set WORK.TEST3 has 1 observations and 1 variables.
All the best
Bart
Hi Rick, I thought that using WHERE= and WHERE will have the same level of efficiency in a data step. Does WHERE= run faster?
Great question. If you are reading local data sets from your PC, it probably doesn't matter which option you use. However, if the data are on a remote server, in a database, or in the cloud, I recommend the WHERE= data set option. When you use the WHERE= data set option, SAS has the option to push the query to the server or database. Only those observations that satisfy the query are sent over the network to be processed. This is called "in-database processing."
In contrast, if you use the WHERE statement, all observations are sent across the network and those that do not satisfy the query are discarded. If the network is slow or transferring data is expensive, you should use the WHERE= data set option, which is more efficient.
In SAS Viya, this becomes even more important. If your data are in the cloud, you want the computations to occur in the cloud. You do not want to write a DATA step that brings a billion observations to the SAS client, only to discard many of them by using a WHERE statement.