How to simulate a LEAD function with DATA step processing

Within the SAS DATA step, the LAG function is provided to return a variable’s value from a previous data set observation. With certain data criteria, sometimes there is a need to look ahead at the next observation and you would expect to use a LEAD function, but this does not exist. Why is that?

All SAS data sets are read sequentially one observation at a time into the PDV (Program Data Vector) for each DATA step iteration. When processing the current observation, you cannot read in the next observation without losing or replacing the current data in the PDV. It is kind of like listening to music. You cannot listen to one song track and jump ahead to the next song track to listen to both songs at the same time. Therefore, with sequential processing it is not possible to design what would be called a LEAD function, but the good news is you can simulate this functionality by using the DATA step, the MERGE statement, and a few data set options.

The main trick to simulating a LEAD function is by merging a data set with itself and using the FIRSTOBS=2 data set option on the second read of the data set. This will offset the observations read by one observation, allowing you to have both the current observation value and the next observation value in the PDV at the same time. The next trick to having the simulation of a LEAD function work is to use the KEEP= and RENAME= data set options on the second read of the data set. When reading the same data set for the second time, you only want to keep the variables needed for processing, and you must rename those variables to have unique names. This allows all values needed for the current observation and from the next observation to be in the PDV during the same DATA step iteration.

There are times when processing data you not only need to retain a previous variable value, but also need to look ahead at the next observation’s value. This means you need to lag and lead at the same time. Let’s look at three example programs to help see how to set this up and how the processing works.

Example 1: Replacing a missing value

The first example shows how to replace a missing value with the average of the most recent (previous) non-missing value and the next non-missing value. The MERGE statement is used to simulate a LEAD function. In the MERGE statement the EXAMPLE1 data set is listed once, and then again using the FIRSTOBS=2 data set option and the RENAME= data set option to rename VALUE to NEXTVAL. There is no need to use a KEEP= data set option since only one variable is being processed and only needs to be renamed. The LAG function is used to return the previous observation’s value of the VALUE variable to a new variable called LAGVAL. The last statement is an IF statement to check if the variable VALUE is missing, and when the condition is true, VALUE is replaced with the mean of the LAGVAL and NEXTVAL variables using the MEAN function.

data example1;                                                 
input value;                                                  
datalines;                                                    
10.0                                                          
.                                                             
15.0                                                          
18.0                                                          
.                                                             
12.0                                                          
17.0                                                          
29.0                                                          
.                                                             
20.0                                                          
;                                                             
 
data example1;                                                 
merge example1 example1(firstobs=2 rename=(value=nextval));     
lagval=lag(value);                                            
if value=. then value=mean(lagval,nextval);                   
run;                                                          
 
proc print data=example1;                                                   
run;

The output of the PROC PRINT for the EXAMPLE1 data set is:

          The SAS System
 
 Obs    value    nextval    lagval
 
   1     10.0        .         .
   2     12.5       15        10
   3     15.0       18         .
   4     18.0        .        15
   5     15.0       12        18
   6     12.0       17         .
   7     17.0       29        12
   8     29.0        .        17
   9     24.5       20        29
  10     20.0        .         .

Example 2: Creating a previous date

The second shows how to create a previous date and a next date for patient appointments. Since the data set is sorted to group patient data together, two DATA steps are needed because BY processing cannot be used in the step to merge the data set with itself. The first DATA step creates the PREVIOUS_DATE with the LAG function. This additional step is needed to first create the lagged values and then to use BY group processing in order to set the PREVIOUS_DATE to missing on the first observation of each PATIENT. This will prevent one patient’s date from being lagged into the next patient.

The second DATA step uses the MERGE statement to simulate the LEAD function. The EXAMPLE2 data set is listed once in the MERGE statement, and then again using the FIRSTOBS=2, KEEP=, and RENAME= data set options. The KEEP= data set option is used only to read the PATIENT and DATE variables for the second read of the data set. The RENAME= data set option renames the PATIENT variable to PATIENT2 and the DATE variable to NEXT_DATE. To prevent the NEXT_DATE from containing another patient’s date, an IF statement is used to check when the PATIENT value is not equal to the PATIENT2 value and sets NEXT_DATE to missing. The PATIENT2 variable is not needed in the output data set and therefore is listed in a DROP statement.

data example2;                                             
input patient $ date : date9.;                            
format date date9.;                                       
datalines;                                                
A 06jan2015                                               
A 12feb2015                                               
A 03mar2015                                               
A 10apr2015                                               
B 05jan2015                                               
B 18feb2015                                               
B 04may2015                                               
C 06jan2015                                               
C 12feb2015                                               
C 15apr2015                                               
C 07jun2015                                               
;                                                         
 
data example2;                                             
set example2;                                              
by patient;                                               
previous_date=lag(date);                                  
if first.patient then previous_date=.;                    
format previous_date date9.;                              
run;                                                      
 
data example2;                                             
merge example2 example2(firstobs=2 keep=patient date        
                        rename=(patient=patient2 date=next_date));  
if patient ne patient2 then next_date=.;                      
drop patient2;                                                
run;                                                      
 
proc print data=example2;                                  
run;

The output of the PROC PRINT for the EXAMPLE2 data set is:

The SAS System
 
                               previous_
Obs    patient         date      date       next_date
 
  1       A       06JAN2015            .    12FEB2015
  2       A       12FEB2015    06JAN2015    03MAR2015
  3       A       03MAR2015    12FEB2015    10APR2015
  4       A       10APR2015    03MAR2015            .
  5       B       05JAN2015            .    18FEB2015
  6       B       18FEB2015    05JAN2015    04MAY2015
  7       B       04MAY2015    18FEB2015            .
  8       C       06JAN2015            .    12FEB2015
  9       C       12FEB2015    06JAN2015    15APR2015
 10       C       15APR2015    12FEB2015    07JUN2015
 11       C       07JUN2015    15APR2015            .

Example 3: Merging a data set

The third example shows how to merge a data set with itself twice to be able to calculate the average amount, compare the amount to the average, and calculate the number of day’s difference from a previous date within each BY group. All BY groups of the data set contain three observations and the calculated average needs to be shown on each observation within each BY group. To be able to look ahead at the next two amount values in order to create the average, the data set is merged with itself twice. Since the data set is sorted by the group and date, two DATA steps are needed because BY processing cannot be used in the step to merge the data set with itself twice.

The first DATA step creates the PREVIOUS_DATE with the LAG function. This additional step is needed to first create the lagged date values and then to use BY group processing in order to set the PREVIOUS_DATE to missing on the first observation of each GROUP. This will prevent one group’s date from being lagged into the next group.

The second DATA step uses the MERGE statement to simulate the LEAD function. The EXAMPLE3 data set is listed once in the MERGE statement, a second time using the FIRSTOBS=2, KEEP=, and RENAME= data set options, and then a third time using the FIRSTOBS=3, KEEP=, and RENAME= data set options. The KEEP= data set option is used only to read the GROUP and AMOUNT variables for the second and third read of the data set. The RENAME= data set option renames the GROUP variable to GROUP2 and GROUP2, and the AMT variable to AMT2 and AMT3.

To prevent any amount values from being carried into the next group, two IF statements are used to reset the AMT2 and AMT3 variables to missing. A FLAG variable is initialized with a 0 (zero) value and both the FLAG and new AVG_AMT variables are listed in a RETAIN statement to retain their values. When the FLAG is equal to 0, the AVG_AMT is calculated using the MEAN function on AMT, AMT2, and AMT3 variables and then the FLAG is set to 1. A series of IF and ELSE IF conditions are used to compare the AMT value to the AVG_AMT and create a new variable AMT_COMPARED_TO_AVG with the result of “More”, “Less”, or the “Same”. The final DAYS_DIFFERENCE variable is created by first being set to 0 (zero) at the top of the group when the PREVIOUS_DATE is missing. The remaining observations calculate the DAYS_DIFFERENCE by subtracting the PREVIOUS_DATE from the DATE. The DROP statement is used to drop the GROUP2, GROUP3, AMT2, AMT3, and FLAG variables since they are not necessary for the output data set.

data example3;                                          
input group $ date : yymmdd8. amt;                      
format date yymmdd10.;                                  
datalines;                                              
a 20150227 10                                           
a 20150305 5                                            
a 20150309 15                                           
b 20150310 10                                           
b 20150313 15                                           
b 20150316 20                                           
c 20150223 15                                           
c 20150304 5                                            
c 20150312 10                                           
d 20150305 15                                           
d 20150310 10                                           
d 20150312 5                                            
;                                                       
 
data example3;                                          
set example3;                                           
by group date;                                          
previous_date=lag(date);                                
if first.group then previous_date=.;                    
format previous_date yymmdd10.;                         
run;                                                    
 
data example3;                                          
merge example3                                          
      example3(firstobs=2 keep=group amt                
               rename=(group=group2 amt=amt2))          
      example3(firstobs=3 keep=group amt                
               rename=(group=group3 amt=amt3));         
if group2 ne group3 then amt3=.;                        
if group ne group2 then do;                             
  amt2=.;                                               
  amt3=.;                                               
  flag=0;                                               
end;                                                    
retain flag 0 avg_amt;                                  
if flag=0 and group = group2 then do;                   
  avg_amt=mean(amt,amt2, amt3);                         
  flag=1;                                               
end;                                                    
format avg_amt 6.2;                                     
if amt &gt; avg_amt then amt_compared_to_avg="More";       
else if amt &lt; avg_amt then amt_compared_to_avg="Less";  
else amt_compared_to_avg="Same";                        
if previous_date=. then days_difference=0;              
else days_difference=date-previous_date;                
drop group2 group3 amt2 amt3 flag;                      
 
proc print;                                             
run;

The output of the PROC PRINT for the EXAMPLE3 data set is:

The SAS System                           
 
                                      previous_               amt_compared_       days_
Obs    group          date    amt          date    avg_amt       to_avg        difference
 
  1      a      2015-02-27     10             .     10.00         Same              0
  2      a      2015-03-05      5    2015-02-27     10.00         Less              6
  3      a      2015-03-09     15    2015-03-05     10.00         More              4
  4      b      2015-03-10     10             .     15.00         Less              0
  5      b      2015-03-13     15    2015-03-10     15.00         Same              3
  6      b      2015-03-16     20    2015-03-13     15.00         More              3
  7      c      2015-02-23     15             .     10.00         More              0
  8      c      2015-03-04      5    2015-02-23     10.00         Less              9
  9      c      2015-03-12     10    2015-03-04     10.00         Same              8
 10      d      2015-03-05     15             .     10.00         More              0
 11      d      2015-03-10     10    2015-03-05     10.00         Same              5
 12      d      2015-03-12      5    2015-03-10     10.00         Less              2

Of course there are usually several programming methods to get to the same result, but when you have the need to look back at a previous value as well as look ahead to see the next value, the above examples can give you a good coding start to help with the type of logic needed to lag and lead at the same time.

For other examples of simulating a LEAD function and using the LAG function within BY-Group processing see:

Sample 36199: Simulating a LEAD function using the POINT= option on the SET statement

Sample 24666: Simulating the LEAD function by comparing the value of a variable to its value in the next observation

Sample 24694: Obtaining the previous value of a variable within a BY-Group

11 Comments

TJR on August 25, 2020 2:29 pm

When using the first method (merging data sets to obtain a 'lead' variable), my resulting data set keeps deleting the very first observation. I'm not sure why. Any ideas?
- TJR on August 25, 2020 2:32 pm
  
  I forgot to also mention that even though it seems like it is deleting the first observation, the total number of observations in the resulting data set matches the original, so I'm not sure where the observation is haha
  - Leonid Batkhan on August 25, 2020 3:35 pm
    
    The "first method" as it's described in the blog post works as expected. Are you sure your code is the same as described here? Would you share your code so I can see what is wrong?
    - TJR on August 25, 2020 3:53 pm
      
      Sure!
      This is the code that I am running:
      
      data data2; merge data1 data1(firstobs=2 rename=(Date1=lead1_Date)); run;
      
      Where "data1" consists of an ID variable and 'Date1'.
      In "data1", the first ID has 3 observations and when I perform the data step above, they are only showing 2 observations, though the total number of observations in the "data2" matches that of "data1". I'm not sure how. I'm sure I'm missing something very fundamental here haha.
      - Leonid Batkhan on August 26, 2020 9:34 am
        
        Hi TJR,
        Since ID variable is on both merged data sets it's value from the second one overwrites ID value from the first data set. You should either drop it from the second data set or rename to something else, e.g. merge data1 data1(firstobs=2 rename=(Date1=lead1_Date ID=lead_ID)); or merge data1 data1(firstobs=2 rename=(Date1=lead1_Date) drop=ID);
        Please note fundamental difference between your use case and the one described in this post: you have more than one variable. You might want to explore a different approach described in my blog on calculations across SAS dataset observations.
        Hope this helps.
Shraddha on April 16, 2020 11:28 am

Excellent article!!
It helped me a great deal in understanding the theory underlying and your well explained examples are simply awesome 🙂
Leonid Batkhan on January 5, 2018 10:43 am

This is a very robust way to access and combine previous as well as consecutive observations in a data step. Here is a complementary approach combining lag() and lead() into single hop(): Hopping for the best - calculations across SAS dataset observations.
Tom B on July 21, 2017 9:12 am

if you have a date or other ordering variable my method would be to:
sort your dataset by descending ordering variable/apply the lag function/sort by ordering variable.
So basically, your lag becomes your lead.
Mark Keintz on February 14, 2017 10:32 pm

Using the merge technique is an excellent way to simulate leads, although you can't successfully use a BY statement linked to the MERGE.

But you can continue to enjoy the benefits of last.group and first.group through use of a BY statement associated with a separate SET. Just make sure to use a "keep=" option limited to the needed by variable(s):
SET example3 (keep=group);
by group;
Sunil on July 15, 2015 2:36 pm

Nice tip to complement the LAG() function. Now we have the best of both worlds!
Art Carpenter on June 21, 2015 6:32 pm

Great tip! A similar technique utilizes multiple SET statements instead of the MERGE (less overhead for big data sets). Also if using the MERGE without a BY, consider including the system option MERGENOBY=NOWARN.

Blogs

Blogs

Can you Lag and Lead at the same time? If using the SAS DATA step, yes you can

About Author

11 Comments