How to achieve repeatable results with distributed DATA Step BY Groups

Prior to SAS Viya

With the creation of SAS Viya, the ability to run DATA Step code in a distributed manner became a reality. Prior to distributed DATA Step, DATA Step programmers never had to think about achieving repeatable results when SAS7BDAT datasets were the sources to their DATA Step code that contains a BY statement. This is because prior to SAS Cloud Analytics Services (CAS), DATA Step ran single-threaded and the source SAS7BDAT dataset was stored on disk. Every time one would run the code we obtained repeatable results because the sequence of rows within the BY group were preserved between runs. To illustrate this, review figures 1, 2, and 3.

Figure 1 is the source SAS7BDAT dataset WORK.TEST1. Notice the sequence of VAR2, especially on row 1 and 4 (i.e., _N_ =1 and 4).

_n_	VAR1	VAR2
1	1	N
2	1	Y
3	1	Y
4	2	Y
5	2	Y
6	2	N

Figure 1. WORK.TEST1 the original SAS7BDAT dataset

In figure 2, we see a BY statement with variable VAR1. This will ensure VAR1 is in ascending order. We are also using FIRST. processing to identify the first occurrence of the BY group. Because this data is stored on disk and because the DATA Step is executed using a single thread, the result table will be repeatable no matter how many times we run the DATA Step code.

Figure 2. Focus on the IF statement, especially VAR2

In figure 3, we see the output SAS7BDAT dataset WORK.TEST2.

_n_	VAR1	VAR2
1	1	N

Figure 3. WORK.TEST2 result dataset from running the code in Figure 2

In figure 4, we are running the same DATA Step but this time our source and target tables are CAS tables. The source table CASLIB.TEST1 was created by lifting the original SAS7BDAT dataset WORK.TEST1 (review figure 1) into CAS.

Figure 4. DATA Step executing in CAS

In figure 5, we see that the DATA Step logic is being respected in runs 1, 2 and 3; but we are not achieving repeatable results. This is due to CAS running on multiple threads. Note that the BY statement – which will group the data correctly for each BY group – is done on the fly. Also, the BY statement will not preserve the sequence of rows within the BY group between runs.

For some processes, this is not a concern but for others it could be. If you need to obtain repeatable results in DATA Step code that runs distributed in CAS as well as match your SAS 9 single-threaded DATA Step results, I suggest the following workaround be used.

Figure 5. DATA Step logic is respected but yields different results with each run

With SAS Viya

The workaround is very simplistic to understand and implement. For each SAS7BDAT dataset being lifted into a CAS table, see figure 6, we need to add a new variable ROW_ID.

_n_	VAR1	VAR2
1	1	N
2	1	Y
3	1	Y
4	2	Y
5	2	Y
6	2	N

Figure 6. Original SAS7BDAT dataset source WORK.TEST1

To accomplish this, we will leverage the automatic variable _N_ that is available to all DATA Step programmers. _N_ is initially set to 1. Each time the DATA step loops past the DATA statement, the variable _N_ increments by 1. The value of _N_ represents the number of times the DATA step has iterated. In our case, the value for each row is the row sequence in the original SAS7BDAT dataset. Figure 7 contains the SAS code we ran on the SAS 9.4M5 workspace server or the SAS Viya compute server to add the new variable ROW_ID.

Figure 7. Creating the new variable ROW_ID

By reviewing figure 8 we can see the new variable ROW_ID in the SAS7BDAT dataset WORK.TEST1. Now that we have the new variable, we are ready to lift this dataset into CAS.

_N_	VAR1	VAR2	ROW_ID
1	1	N	1
2	1	Y	2
3	1	Y	3
4	2	Y	4
5	2	Y	5
6	2	N	6

Figure 8. WORK.TEST1 with the new variable ROW_ID

There are many ways to lift a SAS7BDAT dataset into CAS. One way is to use a DATA Step like we did in figure 9.

Figure 9. DATA Step code to create distributed CAS table CASLIB.TEST1

To obtain the repeatable results, we need to control the sequence of rows within each BY group. We accomplish this by adding the new variable ROW_ID as the last variable to the BY statement in our DATA Step code, see figure 10.

Figure 10. Add ROW_ID as last variable of the BY group

Figure 11 shows us the output CAS table created by the code in figure 10. By adding the new variable ROW_ID and using that variable as the last variable of the BY statement, we are controlling the sequencing of rows within the BY groups for all 3 runs.

VAR1	VAR2	ROW_ID
1	N	1

Figure 11. Distrusted CAS table CASLIB.TEST2

Conclusion

With distributed DATA Step comes great opportunities to improve runtimes. It also means we need to understand differences between single-threaded processing of SAS7BDAT datasets that are stored on disk and distributed processing of CAS tables store in-memory. To help you with that journey I suggest you read the SAS Global Forum paper, Parallel Programming with the DATA Step: Next Steps.

4 Comments

Rick Wicklin on November 19, 2018 8:36 am

I think there is a typo in your IF statements. The syntax "first.var1=" is not valid SAS syntax. I assume you meant to say
IF first.var1 AND var2 = 'N';

- Steven Sober on November 19, 2018 2:32 pm
  
  In reply to Rick Wicklin.
  
  Thank you for your comment, I have updated the post with the corrected syntax.
  
Thomas Billings on November 14, 2018 11:00 am

I did a 2018 SESUG paper on a related topic: Emulating FIRST. and LAST. SAS® DATA Step Processing in SQL? Concepts and Review
http://www.lexjansen.com/sesug/2018/SESUG2018_Paper-192_Final_PDF.pdf

SQL can be multi-threaded, and lacks the order relationships present in single-therad access of sequential data sets.

- Steven Sober on November 14, 2018 12:46 pm
  
  In reply to Thomas: Thank you for your comments and posting your SESUG 2018 paper on this topic. SQL programmers never bring this topic up, only DATA Step programmers notice this when running DATA Step in CAS.

Blogs