Many programmers are familiar with "short-circuit" evaluation in an IF-THEN statement. Short circuit means that a program does not evaluate the remainder of a logical expression if the value of the expression is already logically determined. The SAS DATA step supports short-circuiting for simple logical expressions in IF-THEN statements and WHERE clauses (Polzin 1994, p. 1579; Gilsen 1997). For example, in the following logical-AND expression, the condition for the variable Y does not need to be checked if the condition for variable X is false:
data _null_; set A end=eof; if x>0 & y>0 then /* Y is evaluated only if X>0 */ count + 1; if eof then put count=; run; |
Order the conditions in a logical statement by likelihood
SAS programmers can optimize their IF-THEN and WHERE clauses if they can estimate the probability of each separate condition in a logical expression:
- In a logical AND statement, put the least likely events first and the most likely events last. For example, suppose you want to find patients at a VA hospital that are male, over the age of 50, and have kidney cancer. You know that kidney cancer is a rare form of cancer. You also know that most patients at the VA hospital are male. To optimize a WHERE clause, you should put the least probable conditions first:
WHERE Disease="Kidney Cancer" & Age>50 & Sex="Male"; - In a logical OR statement, put the most likely events first and the least likely events last. For example, suppose you want to find all patients at a VA hospital that are either male, or over the age of 50, or have kidney cancer. To optimize a WHERE clause, you should use
WHERE Sex="Male" | Age>50 | Disease="Kidney Cancer";
The SAS documentation does not discuss the conditions for which a logical expression does or does not short circuit. Polzin (1994, p. 1579) points out that when you put function calls in the logical expression, SAS evaluates certain function calls that produce side effects. Common functions that have side effects include random number functions and user-defined functions (via PROC FCMP) that have output arguments. The LAG and DIF functions can also produce side effects, but it appears that expressions that involve the LAG and DIF functions are short-circuited. You can force a function evaluation by calling the function prior to an IF-THEN statement. You can use nested IF-THEN/ELSE statements to ensure that functions are not evaluated unless prior conditions are satisfied.
Logical ligatures
The SAS/IML language does not support short-circuiting in IF-THEN statements, but it performs several similar optimizations that are designed to speed up your code execution. One optimization involves the ANY and ALL functions, which test whether any or all (respectively) elements of a matrix satisfy a logical condition. A common usage is to test whether a missing value appear anywhere in a vector, as shown in the following SAS/IML statement:
bAny = any(y = .); /* TRUE if any element of Y is missing */ /* Equivalently, use bAll = all(y ^= .); */ |
The SAS/IML language treats simple logical expressions like these as a single function call, not as two operations. I call this a logical ligature because two operations are combined into one. (A computer scientist might just call this a parser optimization.)
You might assume that the expression ANY(Y=.) is evaluated by using a two-step process. In the first step, the Boolean expression y=. is evaluated and the result is assigned to a temporary binary matrix, which is the same size as Y. In the second step, the temporary matrix is sent to the ANY function, which evaluates the binary matrix and returns TRUE if any element is nonzero. However, it turns out that SAS/IML does not use a temporary matrix. The SAS/IML parser recognizes that the expression inside the ANY function is a simple logical expression. The program can evaluate the function by looking at each element of Y and returning TRUE as soon it finds a missing value. In other words, it short-circuits the computation. If no value is missing, the expression evaluates to FALSE.
Short circuiting an operation can save considerable time. In the following SAS/IML program, the vector X contains 100 million elements, all equal to 1. The vector Y also contains 100 million elements, but the first element of the vector is a missing value. Consequently, the computation for Y is essentially instantaneous whereas the computation for X takes a tenth of a second:
proc iml; numReps = 10; /* run computations 10 times and report average */ N = 1E8; /* create vector with 100 million elements */ x = j(N, 1, 1); /* all elements of x equal 1 */ y = x; y[1] = .; /* the first element of x is missing */ /* the ALL and ANY functions short-circuit when the argument is a simple logical expression */ /* these function calls examine only the first elements */ t0 = time(); do i = 1 to numReps; bAny = any(y = .); /* TRUE for y[1] */ bAll = all(y ^= .); /* TRUE for y[2] */ end; t = (time() - t0) / numReps; print t[F=BEST6.]; /* these function calls must examine all elements */ t0 = time(); do i = 1 to numReps; bAny = any(x = .); bAll = all(x ^= .); end; t = (time() - t0) / numReps; print t[F=BEST6.]; |
Although the evaluation of X does not short circuit, it still uses the logical ligature to evaluate the expression. Consequently, the evaluation is much faster than the naive two-step process that is shown explicitly by the following statements, which require about 0.3 seconds and twice as much memory:
/* two-step process: slower */ b1 = (y=.); /* form the binary vector */ bAny = any(b1); /* TRUE for y[1] */ |
In summary, the SAS DATA step uses short-circuit evaluation in IF-THEN statements and WHERE clauses that use simple logical expressions. If the expression contains several subexpressions, you can optimize your program by estimating the probability that each subexpression is true. In the SAS/IML language, the ANY and ALL functions not only short circuit, but when their argument is a simple Boolean expression, the language treats the function call as a logical ligature and evaluates the call in an efficient manner that does not require any temporary memory.
Short circuits can be perplexing if you don't expect them. Equally confusing is expecting a statement to short circuit, but it doesn't. If you have a funny story about short-circuit operators, in any language, leave a comment.
6 Comments
Thank you for yet another nice article Rick.
"in the following logical-AND expression, the condition for the variable Y does not need to be checked if the condition for variable X is true".
Wouldn't this hold true only in a logical-OR expression?
Thanks for the sharp eyes. Yes, the sentence should read "the variable Y does not need to be checked if the condition for variable X is FALSE".
I will make that change to the text and reread the article to see if I made any other mistakes.
It seems that using the IN() operator disables short-circuit evaluation, sadly.
See https://communities.sas.com/t5/SAS-Programming/in-affecting-missing-values/m-p/595470/
Many thanks, Rick, for picking up this delicate topic and for the links to related articles.
Regarding Chris Hemedinger's blog post "Pitfalls of the LAG function" (2012): In a 2018 discussion in the SAS Support Communities evidence was presented that the behavior of the LAG function described in that post can be explained without assuming Boolean short circuiting and that the results would have been different had short circuiting occurred: https://communities.sas.com/t5/SAS-Programming/If-statement-Short-circuiting-and-Lag-function/m-p/473693 and https://communities.sas.com/t5/SAS-Programming/If-statement-Short-circuiting-and-Lag-function/m-p/473832. This would be consistent with Polzin's statement about side effects preventing short circuiting.
The LAG (and DIF) functions are handled differently than most other SAS functions because they need to maintain a memory queue and therefore (to work correctly) need to be called on every observation. If someone is trying to understand short circuiting, I would suggest they start with simple regular functions, not LAG.
When I use the LAG function and complex logical conditions, I assign the lagged value to a variable (L1 = lag(X)) and then perform logical operations with the variable. If you do this, I think both LAG and short-circuiting perform as you expect them to.
I'm puzzled. In this statement: "if 0 & yrdif(.,19000) then a=1; else a=2;" I get errors on the yrdif function which clearly not be evaluated at all?