Within the SAS DATA step, the LAG function is provided to return a variable’s value from a previous data set observation. With certain data criteria, sometimes there is a need to look ahead at the next observation and you would expect to use a LEAD function, but this does not exist. Why is that?
All SAS data sets are read sequentially one observation at a time into the PDV (Program Data Vector) for each DATA step iteration. When processing the current observation, you cannot read in the next observation without losing or replacing the current data in the PDV. It is kind of like listening to music. You cannot listen to one song track and jump ahead to the next song track to listen to both songs at the same time. Therefore, with sequential processing it is not possible to design what would be called a LEAD function, but the good news is you can simulate this functionality by using the DATA step, the MERGE statement, and a few data set options.
The main trick to simulating a LEAD function is by merging a data set with itself and using the FIRSTOBS=2 data set option on the second read of the data set. This will offset the observations read by one observation, allowing you to have both the current observation value and the next observation value in the PDV at the same time. The next trick to having the simulation of a LEAD function work is to use the KEEP= and RENAME= data set options on the second read of the data set. When reading the same data set for the second time, you only want to keep the variables needed for processing, and you must rename those variables to have unique names. This allows all values needed for the current observation and from the next observation to be in the PDV during the same DATA step iteration.
There are times when processing data you not only need to retain a previous variable value, but also need to look ahead at the next observation’s value. This means you need to lag and lead at the same time. Let’s look at three example programs to help see how to set this up and how the processing works.
Example 1: Replacing a missing value
The first example shows how to replace a missing value with the average of the most recent (previous) non-missing value and the next non-missing value. The MERGE statement is used to simulate a LEAD function. In the MERGE statement the EXAMPLE1 data set is listed once, and then again using the FIRSTOBS=2 data set option and the RENAME= data set option to rename VALUE to NEXTVAL. There is no need to use a KEEP= data set option since only one variable is being processed and only needs to be renamed. The LAG function is used to return the previous observation’s value of the VALUE variable to a new variable called LAGVAL. The last statement is an IF statement to check if the variable VALUE is missing, and when the condition is true, VALUE is replaced with the mean of the LAGVAL and NEXTVAL variables using the MEAN function.
data example1; input value; datalines; 10.0 . 15.0 18.0 . 12.0 17.0 29.0 . 20.0 ; data example1; merge example1 example1(firstobs=2 rename=(value=nextval)); lagval=lag(value); if value=. then value=mean(lagval,nextval); run; proc print data=example1; run;
The output of the PROC PRINT for the EXAMPLE1 data set is:
The SAS System Obs value nextval lagval 1 10.0 . . 2 12.5 15 10 3 15.0 18 . 4 18.0 . 15 5 15.0 12 18 6 12.0 17 . 7 17.0 29 12 8 29.0 . 17 9 24.5 20 29 10 20.0 . .
Example 2: Creating a previous date
The second shows how to create a previous date and a next date for patient appointments. Since the data set is sorted to group patient data together, two DATA steps are needed because BY processing cannot be used in the step to merge the data set with itself. The first DATA step creates the PREVIOUS_DATE with the LAG function. This additional step is needed to first create the lagged values and then to use BY group processing in order to set the PREVIOUS_DATE to missing on the first observation of each PATIENT. This will prevent one patient’s date from being lagged into the next patient.
The second DATA step uses the MERGE statement to simulate the LEAD function. The EXAMPLE2 data set is listed once in the MERGE statement, and then again using the FIRSTOBS=2, KEEP=, and RENAME= data set options. The KEEP= data set option is used only to read the PATIENT and DATE variables for the second read of the data set. The RENAME= data set option renames the PATIENT variable to PATIENT2 and the DATE variable to NEXT_DATE. To prevent the NEXT_DATE from containing another patient’s date, an IF statement is used to check when the PATIENT value is not equal to the PATIENT2 value and sets NEXT_DATE to missing. The PATIENT2 variable is not needed in the output data set and therefore is listed in a DROP statement.
data example2; input patient $ date : date9.; format date date9.; datalines; A 06jan2015 A 12feb2015 A 03mar2015 A 10apr2015 B 05jan2015 B 18feb2015 B 04may2015 C 06jan2015 C 12feb2015 C 15apr2015 C 07jun2015 ; data example2; set example2; by patient; previous_date=lag(date); if first.patient then previous_date=.; format previous_date date9.; run; data example2; merge example2 example2(firstobs=2 keep=patient date rename=(patient=patient2 date=next_date)); if patient ne patient2 then next_date=.; drop patient2; run; proc print data=example2; run;
The output of the PROC PRINT for the EXAMPLE2 data set is:
The SAS System previous_ Obs patient date date next_date 1 A 06JAN2015 . 12FEB2015 2 A 12FEB2015 06JAN2015 03MAR2015 3 A 03MAR2015 12FEB2015 10APR2015 4 A 10APR2015 03MAR2015 . 5 B 05JAN2015 . 18FEB2015 6 B 18FEB2015 05JAN2015 04MAY2015 7 B 04MAY2015 18FEB2015 . 8 C 06JAN2015 . 12FEB2015 9 C 12FEB2015 06JAN2015 15APR2015 10 C 15APR2015 12FEB2015 07JUN2015 11 C 07JUN2015 15APR2015 .
Example 3: Merging a data set
The third example shows how to merge a data set with itself twice to be able to calculate the average amount, compare the amount to the average, and calculate the number of day’s difference from a previous date within each BY group. All BY groups of the data set contain three observations and the calculated average needs to be shown on each observation within each BY group. To be able to look ahead at the next two amount values in order to create the average, the data set is merged with itself twice. Since the data set is sorted by the group and date, two DATA steps are needed because BY processing cannot be used in the step to merge the data set with itself twice.
The first DATA step creates the PREVIOUS_DATE with the LAG function. This additional step is needed to first create the lagged date values and then to use BY group processing in order to set the PREVIOUS_DATE to missing on the first observation of each GROUP. This will prevent one group’s date from being lagged into the next group.
The second DATA step uses the MERGE statement to simulate the LEAD function. The EXAMPLE3 data set is listed once in the MERGE statement, a second time using the FIRSTOBS=2, KEEP=, and RENAME= data set options, and then a third time using the FIRSTOBS=3, KEEP=, and RENAME= data set options. The KEEP= data set option is used only to read the GROUP and AMOUNT variables for the second and third read of the data set. The RENAME= data set option renames the GROUP variable to GROUP2 and GROUP2, and the AMT variable to AMT2 and AMT3.
To prevent any amount values from being carried into the next group, two IF statements are used to reset the AMT2 and AMT3 variables to missing. A FLAG variable is initialized with a 0 (zero) value and both the FLAG and new AVG_AMT variables are listed in a RETAIN statement to retain their values. When the FLAG is equal to 0, the AVG_AMT is calculated using the MEAN function on AMT, AMT2, and AMT3 variables and then the FLAG is set to 1. A series of IF and ELSE IF conditions are used to compare the AMT value to the AVG_AMT and create a new variable AMT_COMPARED_TO_AVG with the result of “More”, “Less”, or the “Same”. The final DAYS_DIFFERENCE variable is created by first being set to 0 (zero) at the top of the group when the PREVIOUS_DATE is missing. The remaining observations calculate the DAYS_DIFFERENCE by subtracting the PREVIOUS_DATE from the DATE. The DROP statement is used to drop the GROUP2, GROUP3, AMT2, AMT3, and FLAG variables since they are not necessary for the output data set.
data example3; input group $ date : yymmdd8. amt; format date yymmdd10.; datalines; a 20150227 10 a 20150305 5 a 20150309 15 b 20150310 10 b 20150313 15 b 20150316 20 c 20150223 15 c 20150304 5 c 20150312 10 d 20150305 15 d 20150310 10 d 20150312 5 ; data example3; set example3; by group date; previous_date=lag(date); if first.group then previous_date=.; format previous_date yymmdd10.; run; data example3; merge example3 example3(firstobs=2 keep=group amt rename=(group=group2 amt=amt2)) example3(firstobs=3 keep=group amt rename=(group=group3 amt=amt3)); if group2 ne group3 then amt3=.; if group ne group2 then do; amt2=.; amt3=.; flag=0; end; retain flag 0 avg_amt; if flag=0 and group = group2 then do; avg_amt=mean(amt,amt2, amt3); flag=1; end; format avg_amt 6.2; if amt > avg_amt then amt_compared_to_avg="More"; else if amt < avg_amt then amt_compared_to_avg="Less"; else amt_compared_to_avg="Same"; if previous_date=. then days_difference=0; else days_difference=date-previous_date; drop group2 group3 amt2 amt3 flag; proc print; run;
The output of the PROC PRINT for the EXAMPLE3 data set is:
The SAS System previous_ amt_compared_ days_ Obs group date amt date avg_amt to_avg difference 1 a 2015-02-27 10 . 10.00 Same 0 2 a 2015-03-05 5 2015-02-27 10.00 Less 6 3 a 2015-03-09 15 2015-03-05 10.00 More 4 4 b 2015-03-10 10 . 15.00 Less 0 5 b 2015-03-13 15 2015-03-10 15.00 Same 3 6 b 2015-03-16 20 2015-03-13 15.00 More 3 7 c 2015-02-23 15 . 10.00 More 0 8 c 2015-03-04 5 2015-02-23 10.00 Less 9 9 c 2015-03-12 10 2015-03-04 10.00 Same 8 10 d 2015-03-05 15 . 10.00 More 0 11 d 2015-03-10 10 2015-03-05 10.00 Same 5 12 d 2015-03-12 5 2015-03-10 10.00 Less 2
Of course there are usually several programming methods to get to the same result, but when you have the need to look back at a previous value as well as look ahead to see the next value, the above examples can give you a good coding start to help with the type of logic needed to lag and lead at the same time.
For other examples of simulating a LEAD function and using the LAG function within BY-Group processing see: