Cloud Analytic Services (CAS) is really exciting. It’s open. It’s multi-threaded. It’s distributed. And, best of all for SAS programmers, it’s SAS. It looks like SAS. It feels like SAS. In fact, you can even run DATA Step in CAS. But, how does DATA Step work in a multi-threaded, distributed context? What’s new? What’s different? If I’m a SAS programming wizard, am I automatically a CAS programming wizard?
While there are certain things you need to do to get your code to run in CAS (which we’ll largely ignore in this article), the major difference between DATA Step in SAS and DATA Step in CAS is threads. CAS spreads its data and its processing over multiple machines. On each of those machines, the data is divided further into blocks, while the processing is divided further into threads. Each thread grabs a subset of the data blocks on its machine and performs the prescribed logic on that subset. The results from all of the threads are then gathered together and output.
So how does this distributed data and processing affect the way you need to write your programs? As usual, it’s best to look at a few examples to understand.
Unique ID with DATA Step
First, let’s create a unique ID with DATA Step.
SAS DATA Step
Creating a unique ID in SAS DATA Step is quite simple. We just use the _n_ automatic variable as shown below:
DATA tableWithUniqueID; SET tableWithOutUniqueID; uniqueID = _n_; run; |
CAS DATA Step
Creating a unique ID in CAS DATA Step is a bit more complicated. Each thread maintains its own _n_. So, if we just use _n_, we’ll get duplicate IDs. Each thread will produce an uniqueID field value of 1, 2..and so on. …. When the thread output is combined, we’ll have a bunch of records with an uniqueID of 1 and a bunch with an uniqueID of 2…. This is not useful.
To produce a truly unique ID, you need to augment _n_ with something else. CAS has its own set of automatic variables for such purposes. The _threadID_ automatic variable can help us get our unique ID as shown below:
DATA tableWithUniqueID; SET tableWithOutUniqueID; uniqueID = put(_threadid_,8.) || || '_' || Put(_n_,8.); run; |
While there are surely other ways of doing it, concatenating _threadID_ with _n_ ensures uniqueness because the _threadID_ uniquely identifies a single thread and _n_ uniquely identifies a single row output by that thread.
Aggregation with DATA Step
Now, let’s look at “whole table” aggregation (no BY Groups).
SAS DATA Step
Aggregating an entire table in SAS DATA Step usually looks something like below. We create an aggregator field (totSalesAmt) and then add the detail records’ amount field (SaleAmt) to it as we process each record. Finally, when there are no more records (eof), we output the single aggregate row.
DATA aggregatedTable ; SET detailedTable end=eof; retain totSalesAmt 0; totSalesAmt = totSalesAmt + SaleAmt; keep totSalesAmt; if eof then output; run; |
CAS DATA Step
While the above code returns one row in single-engine SAS, the same code returns multiple rows in CAS — one per thread. When I ran this code against a table in my environment, I got 28 rows (because CAS used 28 threads in this example).
As with the unique ID logic, producing a total aggregate is just a little more complicated in CAS. To make it work in CAS, we need a post-process step to bring the results together. So, our code would look like this:
DATA aggregatedTable ; SET detailedTable end=eof; retain threadSalesAmt 0; threadSalesAmt = threadSalesAmt + SaleAmt; keep threadSalesAmt; if eof then output; run; DATA aggregatedTable / single=yes; SET aggregatedTable end=eof; retain totSalesAmt 0; totSalesAmt = totSalesAmt + threadSalesAmt; if eof then output; run; |
In the first data step in the above example, we ran basically the same code as in the SAS DATA Step example. In that step, we let CAS do its distributed, multi-threaded processing because our table is large. Spreading the work over multiple threads makes the aggregation much quicker. After this, we execute a second DATA Step but here we force CAS to use only one thread with the single=yes option. This ensures we only get one output row because CAS only uses one thread. Using a single thread in this case is optimal because we’ll only have a few input records (one per thread from the previous step).
BY-GROUP Aggregation
Using a BY statement in your DATA Step fundamentally changes how CAS processes data. When a BY statement is used, CAS automatically re-distributes the data among the CAS machines so that each individual BY-Group is wholly contained on a single worker. The data is still distributed across multiple machines but it’s done using the BY-Groups. Individual threads are then assigned to individual BY-Groups. Since each BY-Group is processed by one and only one thread, when we aggregate, we won’t see multiple output rows for a BY-Group. So, there shouldn’t be a need to consolidate the thread results like there was with “whole table” aggregation above.
Consequently, BY-Group aggregation DATA Step code should look exactly the same in CAS and SAS (at least for the basic stuff).
Concluding Thoughts
Coding DATA Step in CAS is very similar to coding DATA Step in SAS. If you’re a wizard in one, you’re likely a wizard in the other. The major difference is accounting for CAS’ massively parallel processing capabilities (which manifest as threads). For more insight into data processing with CAS, check out the DATA Step Basics section of the external documentation, as well as this SAS Global Forum paper.