Hopping for the best - calculations across SAS dataset observations

48

The purpose of this blog post is to demonstrate a SAS coding technique that allows for calculations with multiple variables across a SAS dataset, whether or not their values belong to the same or different observations. We are going to use values from multiple observations in a calculation.

calculations across observations of a SAS data table

What do we want?

As illustrated in the picture on the right, we want to be able to hop, or jump, back and forth, up and down across observations of a data table in order to implement calculations not just on different variables, but with their values from different observations of a data table.

In essence, we want to access SAS dataset variable values similar to accessing elements of a matrix (aij), where rows represent dataset observations, and columns represent dataset variables.

Combine and Conquer

In the spirit of my earlier post Combine and Conquer with SAS, this technique combines the functionality of the LAG function, which allows us to retrieve the variable value of a previous observation from a queue, with an imaginary, non-existent in SAS, LEAD function that reads a subsequent observation in a data set while processing the current observation during the same iteration of the DATA step.

LAG function

LAG<n> function in SAS is not your usual, ordinary function. While it provides a mechanism of retrieving previous observations in a data table, it does not work “on demand” to arbitrarily read a variable value from a previous observation n steps back. If you want to use it conditionally in some observations of a data step, you still need to call it in every iteration of that data step. That is because it retrieves values from a queue that is built sequentially for each invocation of the LAG<n> function. In essence, in order to use the LAG function even just once in a data step, you need to call it every time in each data step iteration until that single use.

Moreover, if you need to use each of the LAG1, LAG2, . . . LAGn functions just once, in order to build these queues, you have to call each of them in every data step iteration even if you are going to use them in some subsequent iterations.

LEAD function

The LEAD function is implemented in Oracle SQL and it returns data from the next or subsequent row of a data table. It allows you to query more than one row in a table at a time without having to join the table to itself.

There is no such function in SAS. However, the POINT= option of the SET statement in a SAS data step allows retrieving any observation by its number from a data set using random (direct) access to read a SAS data set. This will allow us to simulate a LEAD function in SAS.

HOP function

But why do we need two separate functions like LAG and LEAD in order to retrieve non-current observations. In essence, these two functions do the same thing, just in opposite directions. Why can’t we get by with just one function that does both backwards and forwards “hopping?”

Let’s combine and conquer.

Ideally, we would like to construct a new single function - let’s call it HOP(x, j) - that combines the best qualities of both LAG and LEAD functions. The two arguments of the HOP function would be as follows:

x – SAS variable name (numeric or character) the value of which we are retrieving;

j – hop distance (numeric) – an offset from the current observation; negative values being lagging (hopping back), positive values being leading (hopping forward), and a zero-value meaning staying within the current observation.

The sign of the second argument defines whether we lag (minus) or lead (plus). The absolute value of this second argument defines how far from the current observation we hop.

Alternatively, we could have the first argument as a column number, and the second argument as a row/observation number, if we wanted this function to deal with the data table more like with a matrix. But relatively speaking, the method doesn’t really matter as long as we can unambiguously identify a data element or a cell. To stay within the data step paradigm, we will stick with the variable name and offset from the current observation (_n_) as arguments.

Let’s say we have a data table SAMPLE, where for each event FAIL_FLAG=1 we want to calculate DELTA as the difference between DATE_OUT, one observation after the event, and DATE_IN, two observations before the event:

Calculations across observations of a SAS data table

That is, we want to calculate DELTA in the observation where FAIL_FLAG = 1 as

26MAR2017 18JAN2017 = 67 (as shown in light-blue highlighting in the above figure).

With the HOP() function, that calculation in the data step would look like this:

data SAMPLE;
   set SAMPLE;
   if FAIL_FLAG then DELTA = hop(DATE_OUT,1) - hop(DATE_IN,-2);
run;

It would be reasonable to suggest that the hop() function should return a missing value when the second argument produces an observation number outside of the dataset boundary, that is when

_n_ + j < 0 or _n_ + j > num, where _n_ is the current observation number of the data step iteration; num is the number of observations in the dataset; j is the offset argument value.

Do you see anything wrong with this solution? I don’t. Except that the HOP function exists only in my imagination. Hopefully, it will be implemented soon if enough SAS users ask for it. But until then, we can use its surrogate in the form of a %HOP macro.

%HOP macro

The HOP macro grabs the value of a specified variable in an offset observation relative to the current observation and assigns it to another variable. It is used within a SAS data step, but it cannot be used in an expression; each invocation of the HOP macro can only grab one value of a variable across observations and assign it to another variable within the current observation.

If you need to build an expression to do calculations with several variables from various observations, you would need to first retrieve all those values by invoking the %hop macro as many times as the number of the values involved in the expression.

Here is the syntax of the %HOP macro, which has four required parameters:

%HOP(d,x,y,j)

d – input data table name;

x – source variable name;

y – target variable name;

j – integer offset relative to the current observation. As before, a negative value means a previous observation, a positive value means a subsequent observation, and zero means the current observation. The parameter j value can be either integer constant or a numeric variable name; in the latter case that variable's value at the current observation will be used as the offset.

Using this %HOP macro we can rewrite our code for calculating DELTA as follows:

 data SAMPLE (drop=TEMP1 TEMP2);
   set SAMPLE;
   if FAIL_FLAG then
   do;
      %hop(SAMPLE,DATE_OUT,TEMP1, 1)
      %hop(SAMPLE,DATE_IN, TEMP2,-2)
      DELTA = TEMP1 - TEMP2;
   end;
run;

Note that we should not have temporary variables TEMP1 and TEMP2 listed in a RETAIN statement, as this could mess up our calculations if the j-offset throws an observation number out of the dataset boundary.

Also, the input data table name (d parameter value) is the one that is specified in the SET statement, which may or may not be the same as the name specified in the DATA statement.

In case you are wondering where you can download the %HOP macro from, here it is in its entirety:

%macro hop(d,x,y,j);
   _p_ = _n_ + &j;
   if (1 le _p_ le _o_) then set &d(keep=&x rename=(&x=&y)) point=_p_ nobs=_o_;
%mend hop;

Of course, it is “free of charge” and “as is” for your unlimited use.

Your turn

Please provide your feedback and share possible use cases for the HOP function/macro in the Comment section below. This is your chance for your voice to be heard!

Share

About Author

Leonid Batkhan

Leonid Batkhan, Ph.D. in Computer Science and Automatic Control Systems, has been a SAS user for more than twenty years. He came to work for SAS in 1995 and is currently a Senior Consultant with the SAS Federal Data Management and Business Intelligence Practice. During his career, Leonid has successfully implemented dozens of SAS applications and projects in various industries. All posts by Leonid Batkhan >>>

48 Comments

  1. I'm not sure if I would ever use this HOP() function, simply because it needs too many restrictive conditions. For questions like this I use a "range" merge in PROC SQL, where a match can be made where, say, value (date?) in data set B is between date and date+14 in data set A. I can also calculate the difference in these dates to find the closest record in B to date in A. The PROC SQL code would not be restricted to sequential, or even neighbouring, records.

    Or maybe I've not understood how HOP() could be useful?

    • Leonid Batkhan

      Thank you, Philip, for your feedback. What restrictive conditions do see are needed? I only see that you must be within the data table boundaries to be able to grab any value of any variable for calculations. The HOP() function described in this post relates to calculations within a single data table.

      • I would still use PROC SQL to match records in the same data set. I can now see its benefit for SAS programmers without SQL experience, but would request a "health warning" be attached to HOP(), as I can envision some pretty horrendous speed issues if misused. LAG() works well within the Data Step model, because it is used within sequential record input. HOP() will require asynchronous file reads, which could bring processing of a large data set to its knees.

  2. Niiice! I was expecting something FAR more complex at the end. Was thinking 'fcmp + open()' whilst reading it but the macro approach is definitely easier to absorb.

  3. FYI in your %HOP() macro I'm concerned about POINT=, which can be very dependent on record buffering for fast processing. There is an example in my 1st SAS book "Saving Time and Money Using SAS" (SAS Press) where I reduced a mainframe Data Step's elapsed time from 7 hours to 3 minutes just by sorting the data set with POINT=, and that 3 minutes included the PROC SORT step time!

    • Leonid Batkhan

      Philip, your concern is well-taken. However, you are switching to a different topic. In this post the accent is on effective programming, while your concern is efficient processing. In many cases, take clinical trials for example, SAS programmers deal with relatively small data tables, while very complex calculations and very tight deadlines. In such cases they are not at all concerned with the processing speed, but are very much concerned about the speed of program development.

    • Hi Philip,

      I do understand your performance concerns. But they also say that "premature optimization is the root of all evil". For example, I work mostly in pharma statistical programming and the analysis programs here typically run in production mode only a few times during the project, most datasets are not huge so programming time > run time. And if it is not, it is typically statistics what consumes most cycles. So personally I found Leonid's little macro to be clever, thought-provoking and potentially useful. But if I encounter hard performance limits, I can always pick up your book you so courteously referred to.

  4. Hi Leonid,

    Thank you for this little gem. Actually, I clipped this article to my snippet library. You taught us again that SET statement is much more versatile than its typical usage - once per a DATA step. My own solution used a couple SQL clauses with MONOTONIC(). Your four lines of code beat it hands down. Right on!

  5. The index key lookup (double set statement) also allows random access to a SAS dataset. So does a hash object, but that doesn't work for memory constrained conditions. But it requires an index to be built.

    • Leonid Batkhan

      Hi Scott, thank you for your feedback. I am not sure I fully understand how "index key lookup" relates to "double set statement", and why "double". A single set statement with point= option does allow for direct (random) access. I would love to see your hash object implementation of the HOP functionality.

  6. Here's a version using point that 1) doesn't require an index, and 2) is closer to the actual problem statement.

    data test;
       do SK=1 to 20;
          fail_flag=SK in (2,9,13,15,18,20);
          date_in=SK;
          date_out=SK+int(ranuni(123)*10)+3;
          output;
       end;
       format date: date9.;
       drop sk;
    run;
    
    data test3;
       set test nobs=nobs;
       if fail_flag then do;
          i=_n_-2;
          j=_n_+1;
          if i gt 0 then set test (keep=date_in rename=(date_in=_date_in)) point=i;
          if j lt nobs then set test (keep=date_out rename=(date_out=_date_out)) point=j;
          delta=_date_out-_date_in;
       end;
       output;
       call missing(of _date:);
    run;
    

    • Leonid Batkhan

      Hi Scott, thanks for sharing your code sample. However, I don't see how your code sample is an alternative as it is practically identical to the solution described in this blog post using %hop macro with the exception of you setting temporary variables to missing values while I just drop them.

  7. Nice out of the box thinking. Yes, would vote for it. It would surely add some things as some spreadsheet (Static) or streaming thinking.
    Seeing the LAG function as a limited queue (may be renaming is sensible ) it has value in combination with a HOP().

    The power of SAS is not in being a follower, doing the same things as others do.
    The power of SAS is in doing things others haven't thought of yet.
    Tell Jim if you like.

    • To explain the possible usage.

      When processing a record knowing information needed somewhere several records before or expecting it somewhere several records later, then hopping to those (smart limited) would often make things much easier and with a far better performance behavior.
      It would be even possible processing a large table in a big queued approach while sending out new messages.
      For this kind of solutions I normally spend a lot of time in designing the needed behavior and test it until it works as desired.

      The alternative of grouped SQL processing is a serial slow approach only having nice with the best tools/hardware.
      Still having those needed delays in wall-clock time for the split in those several steps being planned.

  8. Bartosz Jenner on

    Smart. Very, very smart.
    It opened my eyes for the possibility of writing my own macro functions that can be used within datastep. Thank you.

  9. Brilliant! The value of j is even not necessarily a constant integer. It could be any integer variable. So how far to hop is really based on the value of the variable at current observation.

  10. Tathabbai Pakalapati on

    Hi Leonid,

    This is very helpful! I have one naive question. In the %hop, the _o_ which will have number of observations in a dataset, is assigned a value at the end of SET statement but is referenced in the if condition before the SET statement. Can you please help me to understand how this works i.e. how the if condition gets executed with _o_ before it is even defined?

    Thanks,
    Tata

    • Leonid Batkhan

      Thank you, Tata, for your comment. You asked one very good question! The key here is that the SET statement option nobs= assigns a value to the _o_ variable during data step compilation, that is before execution. Therefore, it is available in the IF statement's condition during the data step execution. Try running this perfectly legitimate code:

      data _null_;
         call symputx('num',n);
         stop;
         set sashelp.air nobs=n;
      run;
      %put &=num;
      

      Even though SET statement never gets executed (it is placed after STOP statement that terminates data step), macro variable num is assigned its value in the CALL SYMPUTX based on the value of n.

  11. Anders Sköllermo on

    Interesting! I am (was) very keen on testing facilities, but Very restrictive in using them in HEAVY production.

    It turned out the other day, that I have never used DO.. WHILE and DO.. UNTIL in HEAVY production.
    Of course I have tested them, used it as a SAS teacher. But never in heavy production - for good reasons.

    • Leonid Batkhan

      Anders, thank you for your comment. Could you please clarify what you mean by "heavy production" vs. light production or just production. From my experience, there are so many different production environments with different optimization criteria - from optimizing program development cycle (when program runs just once) to optimizing processing time (when program operates on big data in interactive applications). Some optimization is done programmatically, some may require hardware solutions...

    • Leonid Batkhan

      Thank you, Rick, for your comment and complementary references. While I did not mean SAS/IML software suggesting "access SAS dataset variable values similar to accessing elements of a matrix (Aij)", you bring up an excellent angle by injecting IML into this discussion.

  12. Very clever, and very well written Leonid. There are obviously alternatives, including 1) extracting and creating variables needed, then merging back to the main data set; 2) using an actual matrix, either with SAS arrays or with SAS/IML. However, there are always at least two ways to do anything in SAS, and I think this adds a nice alternative that provides ease of use, easy to understand, and programmer efficiency.

    So, I vote for including it! Nice work. -Dave

  13. I think this can be a useful function. It solves a problem I once had when I needed to use the lag function in a somehow dynamic way. Then I needed to loop through the lag functions but could not do it because I could not dynamically change the name of the function (from lag to lag2 and so on) based on my data (without using a macro). With the Hop function, this problem would have been solved.
    Of course the Hop function should be used with caution, but that goes for other functions as well.

  14. I would like to add a vote in favor of the HOP function. I have never used the LAG function. If I have a simple need for accessing a value in another observation, a simple RETAIN works well. On the one occasion when I needed extreme flexibility to access values in a matrix-like way, then POINT= got the job done. (SAS/IML probably would have been perfect for the job, but I didn't know it.) I don't seen myself ever using the LAG function, but the HOP function would provide more functionality (no pun intended) and I could see myself using a HOP function if it existed. In the meantime, thanks for a clever macro.

  15. Having a separate LAG and LEAD function does not make sense. So I vote for the HOP function. However I want to warn for a (little) shortcoming in the solution. You are using the _N_ automatic variable. In nine out of ten cases it will indeed correspond with the observation number. But it needs not to be the same. It is only the same IF you start to read at observation 1 and do not use a FIRSTOBS option.

    • Leonid Batkhan

      Erik, thank you for your input and the warning. Indeed, FIRSTOBS=X option can be a distortion as _N_ will be equal to 1 for the actual X-th observation. I guess, for the %HOP macro that relies on the _N_ that is the limitation to not use FIRSTOBS= option.

  16. My previous comment is not entirely complete. It should also include ...and IF each iteration is in fact reading an observation from the data set.
    Check as an example following code and compare save_n_ with n:

    data test;
       do n=1 to 1000;
          output;
       end;
    run;
    
    options firstobs=10;
    data ruin_n_;
       save_n_ = _n_;
       set test(where=(mod(n,5) ne 0));
    run;
    

      • May be we should ask for new automatic variable: _CURROBS_ that contains the actual observation number that is processed. May be such variable exists, but I am not aware of it.

        • Leonid Batkhan

          Erik, thank you for your input. I doubt that such automatic variable exists. However, if it did, it could be very confusing especially when used with such "trouble makers" as FIRSTOBS= option and/or WHERE statement. Without them, the data step iteration counter _N_ will work just fine as the "actual observation number that is processed".

Leave A Reply

Back to Top