In the immortal words of Britney Spears: Oops! I did it again.
At least, I'm afraid that I did. I think I might have helped a SAS student with a homework assignment, or perhaps provided an answer in preparation for a SAS certification exam. Or maybe it was a legitimate work-related question; I'd like to think so, anyway.
This time, the question came to me via LinkedIn. (By the way, LinkedIn contains a rich network of SAS professionals; in her blog post, Tricia provides some helpful guidance for making use of that network.)
The question pertains to some confusing behavior of the LAG function. Within a DATA step, the LAG function is thought to provide a peek into the value of a variable within a previous observation. But in this program, the LAG function didn't seem to be doing its job:
data test; infile datalines dlm=',' dsd; input a b c; datalines; 4272451,17878,17878 4272451,17878,17878 4272451,17887,17887 4272454,17878,17878 4272454,17881,17881 4272454,17893,17893 4272455,17878,17878 4272455,17878,18200 run; data testLags; retain e f ( 1 1); set test; if a=lag(a) and b>lag(b) then e=e+1; else if a^=lag(a) or lag(a)=. then e=1; if a^=lag(a) or lag(a)=. then f=1; else if a=lag(a) and b>lag(b) then f=f+1; run; proc print data=testLags; run;
The questioner thought that the e and f variables should have the same values in each record of output, but they don't. The two variables are calculated using the exact same statements, but with the seemingly-exclusive IF/THEN conditions reversed. Here's the output:
Obs e f a b c 1 1 1 4272451 17878 17878 2 1 1 4272451 17878 17878 3 2 2 4272451 17887 17887 4 1 1 4272454 17878 17878 5 2 1 4272454 17881 17881 6 3 2 4272454 17893 17893 7 1 1 4272455 17878 17878 8 1 1 4272455 17878 18200
There is a SAS note that warns of the effect of using the LAG function conditionally. But in this example, each set of LAG functions are used unconditionally (before the THEN clause). Or are they?
Let's review how the LAG function works. It draws values from a queue of previous values, and within each DATA step iteration that you call the LAG function, it draws a previous value from the queue. The trick here is that this program does not call the LAG function for both A and B with each iteration of the DATA step! Because the IF statements combine two conditions with an AND, if the first condition resolves to false, the second condition is not evaluated. After all, in logic-speak, FALSE AND (ANY value) is always FALSE, so the DATA step can save work by not bothering to evaluate the remainder of the expression.
if a=lag(a) /* if false*/ and b>lag(b) /* then this is not evaluated*/
And then the next time around, when the LAG(b) function is called again, it's "behind" one on the queue for the value of b.
One way to solve the issue (and remove the logical ambiguity): set two temporary variables to LAG(a) and LAG(b) at the start of the DATA step, and use those variables in the subsequent comparisons. With the LAG function now being called with each iteration no matter what, the results are consistent. Here's an example of the modified program:
data testlags2(drop=laga lagb); retain e f ( 1 1); set test; laga = lag(a); lagb = lag(b); if a=laga and b>lagb then e=e+1; else if a^=laga or laga=. then e=1; if a^=laga or laga=. then f=1; else if a=laga and b>lagb then f=f+1; run;
Here are the new results when printed:
Obs e f a b c 1 1 1 4272451 17878 17878 2 1 1 4272451 17878 17878 3 2 2 4272451 17887 17887 4 1 1 4272454 17878 17878 5 2 2 4272454 17881 17881 6 3 3 4272454 17893 17893 7 1 1 4272455 17878 17878 8 1 1 4272455 17878 18200