Pitfalls of the LAG function

In the immortal words of Britney Spears: Oops! I did it again.

At least, I'm afraid that I did. I think I might have helped a SAS student with a homework assignment, or perhaps provided an answer in preparation for a SAS certification exam. Or maybe it was a legitimate work-related question; I'd like to think so, anyway.

This time, the question came to me via LinkedIn. (By the way, LinkedIn contains a rich network of SAS professionals; in her blog post, Tricia provides some helpful guidance for making use of that network.)

The question pertains to some confusing behavior of the LAG function. Within a DATA step, the LAG function is thought to provide a peek into the value of a variable within a previous observation. But in this program, the LAG function didn't seem to be doing its job:

data test;
  infile datalines dlm=',' dsd;
  input a b c;
  datalines;
4272451,17878,17878 
4272451,17878,17878 
4272451,17887,17887 
4272454,17878,17878 
4272454,17881,17881 
4272454,17893,17893 
4272455,17878,17878 
4272455,17878,18200 
run;
 
data testLags;
  retain e f ( 1 1);
  set test;
  if a=lag(a) and b>lag(b) then
    e=e+1;
  else if a^=lag(a) or lag(a)=. then
      e=1;
  if a^=lag(a) or lag(a)=. then
      f=1;
  else if a=lag(a) and b>lag(b) then
      f=f+1;
run;
 
proc print data=testLags;
run;

The questioner thought that the e and f variables should have the same values in each record of output, but they don't. The two variables are calculated using the exact same statements, but with the seemingly-exclusive IF/THEN conditions reversed. Here's the output:

Obs    e    f       a         b        c

 1     1    1    4272451    17878    17878
 2     1    1    4272451    17878    17878
 3     2    2    4272451    17887    17887
 4     1    1    4272454    17878    17878
 5     2    1    4272454    17881    17881
 6     3    2    4272454    17893    17893
 7     1    1    4272455    17878    17878
 8     1    1    4272455    17878    18200

There is a SAS note that warns of the effect of using the LAG function conditionally. But in this example, each set of LAG functions are used unconditionally (before the THEN clause). Or are they?

Let's review how the LAG function works. It draws values from a queue of previous values, and within each DATA step iteration that you call the LAG function, it draws a previous value from the queue. The trick here is that this program does not call the LAG function for both A and B with each iteration of the DATA step! Because the IF statements combine two conditions with an AND, if the first condition resolves to false, the second condition is not evaluated. After all, in logic-speak, FALSE AND (ANY value) is always FALSE, so the DATA step can save work by not bothering to evaluate the remainder of the expression.

  if a=lag(a) /* if false*/ and b>lag(b) /* then this is not evaluated*/

And then the next time around, when the LAG(b) function is called again, it's "behind" one on the queue for the value of b.

One way to solve the issue (and remove the logical ambiguity): set two temporary variables to LAG(a) and LAG(b) at the start of the DATA step, and use those variables in the subsequent comparisons. With the LAG function now being called with each iteration no matter what, the results are consistent. Here's an example of the modified program:

data testlags2(drop=laga lagb);
  retain e f ( 1 1);
  set test;
  laga = lag(a);
  lagb = lag(b);
  if a=laga and b>lagb then
    e=e+1;
  else if a^=laga or laga=. then
      e=1;
  if a^=laga or laga=. then
      f=1;
  else if a=laga and b>lagb then
      f=f+1;
run;

Here are the new results when printed:

Obs    e    f       a         b        c

 1     1    1    4272451    17878    17878
 2     1    1    4272451    17878    17878
 3     2    2    4272451    17887    17887
 4     1    1    4272454    17878    17878
 5     2    2    4272454    17881    17881
 6     3    3    4272454    17893    17893
 7     1    1    4272455    17878    17878
 8     1    1    4272455    17878    18200

3 Comments

Chang Y. Chung on January 4, 2012 11:40 am

This posting is mixing up two different topics: (1) LAG function being a simple queue and (2) short-circuiting evaluation of logical expressions.

The LAG function has been talked/written about a lot already. The short-circuiting evaluation has not been and should be, since this does not happen always and the conditions under which the short-circuiting happens is rather complicated.

For an in depth discussion, as of back in 2003, see the sas-l posting thread starting at this one. Would love to know anything has been changed since.
- Chris Hemedinger on January 4, 2012 2:30 pm
  
  Yes, you're correct. This combines two issues: nuances of the LAG function and the way logic can be "short-circuited" in an IF/THEN statement. Many programming languages have similar behaviors for only partially evaluating a logical statement, and knowing how it works can help you to create more efficient programs. However, it's often unwise to include a function that can *change* a value within a logical test (ex in C/C++/Java: if x==null && c++>3). LAG is one of those functions that changes the state of the queue by invoking it, and that's not always obvious, is it?
Pingback: Short-circuit evaluation and logical ligatures in SAS - The DO Loop

Blogs

Blogs

Pitfalls of the LAG function

About Author

3 Comments