Important caveats to consider when you run a DATA step on CAS tables

1

Even with the all the new and powerful tools that are available with the release of SAS® Viya® and SAS® Cloud Analytic Services (CAS), the DATA step remains the same programming powerhouse that it has always been. However, there are some considerations that you need to be aware of when you run a DATA step on CAS tables. This post provides a brief overview of some of these considerations, which include these topics:

  • Running the DATA step in your SAS® session versus running it in CAS
  • Using elements within the DATA step that are not available when you run the step in in CAS
  • Using the CASDATALIMIT system option and the DATALIMIT data-set option to prevent a size-limit error

Running the DATA step on the SAS® client versus running it in CAS

This question is one of the most common questions asked of SAS Technical Support:  "Is it always faster to run my DATA step in CAS?" The answer is no. CAS is designed to be used with big data. In general, the larger the table, the better the performance when you use CAS. CAS obtains this increased performance by running the code in multiple threads on multiple workers. When you execute a DATA step in CAS, the code and data are distributed across the different workers and threads. This strategy enables the code to run in parallel and to achieve better performance.

However, there is a cost that is associated with running code in parallel. A certain amount of overhead is required to distribute the code and data to multiple threads on multiple workers. The overhead that is required can result in your DATA step running slower in CAS than it does in SAS. The following basic example illustrates this concept.

The two code examples below contain a DATA step that creates a 5000-row sample data set and CAS table, respectively. The data is read into the DATA step for processing. In this case, the processing involves a simple multiplication of values.

I ran these steps successively.  As you can see in the output below, the DATA step runs faster in SAS than it does in CAS.

Now, let’s try an example that uses a much larger data source. This example is exactly the same as the previous example. However, it uses a table that contains 10 million rows.


A table that contains  ten million rows and 5 variables is not really considered big data, but a table of that size does show that CAS is faster when the data is larger. Just imagine what CAS can do with input tables that contain 100 million rows and 50 variables!

Using DATA step language elements that are not available in CAS

To take advantage of the faster processing times using CAS, you must run the DATA step directly in CAS. There are certain criteria that must be met in order for the DATA step to run on the CAS server. First, the input and output tables must be CAS tables. If one of your tables is a SAS data set, then the step runs in SAS. Second, all language elements that you use in the DATA step must be available in CAS. Most language elements from the DATA step are available in CAS. However, a number of DATA step elements are not available in CAS. Most elements are there, but a handful are not. For details, see Language Element Support in SAS® Cloud Analytic Services 3.5: DATA Step Programming.

What happens if my DATA step uses a CAS table for input and is creating a CAS table, but my code contains an unsupported language element? If you attempt to use an unsupported language element in your DATA step, the code and data are sent back to the SAS client to be processed. Once the processing is done, the data is moved back to the CAS server. All of this action takes place in the background.   Moving the data back and forth creates some performance issues in that you lose the ability to run your DATA step in parallel. In certain situations, you cannot avoid this issue. In these cases, there are a couple of methods that can help you determine whether your DATA step is running in CAS or in SAS, as explained in this section.

There is an unmistakable sign that your DATA step is running in CAS. The following note is always generated by a DATA step that is running in CAS.

NOTE: Running DATA step in Cloud Analytic Services.

If you do not see this note in the notes that are generated by your DATA step, then that step did not run in CAS. You can also use the MSGLEVEL= system option to help determine whether your DATA step ran in CAS. Setting this option provides more detail in the log about where the DATA step is running.

Here is an example. The RANUNI function is not supported in CAS, and it causes the step to run on the SAS client.

90   data casuser.b;
91      set casuser.a;
92      random=ranuni(1000);
93   run;
NOTE: There were 100 observations read from the data set CASUSER.A.
NOTE: The data set CASUSER.B has 100 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.32 seconds
      cpu time            0.09 seconds

The note indicating that this step ran in CAS is absent in the output above, but it does not say specifically where the step ran.

If you add the MSGLEVEL= system option in your code, as shown below, a note is added in the output that explains where the DATA step ran.

89   options msglevel=I;
90   data casuser.b;
91      set casuser.a;
92      random=ranuni(1000);
93   run;
NOTE: Could not execute DATA step code in Cloud Analytic Services. Running DATA step in the SAS client.
NOTE: There were 100 observations read from the data set CASUSER.A.
NOTE: The data set CASUSER.B has 100 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.29 seconds
      cpu time            0.09 seconds

Unfortunately, the MSGLEVEL= system option does not tell you the nature of the problem. To resolve that issue, you can use the SESSREF= system option in the DATA statement. This option associates the DATA step with a CAS session. In addition, if the step does not run in CAS, the option generates an error and explains which language element causes the issue. Here is an example.

90   data casuser.b / sessref=casauto;
91      set casuser.a;
92      random=ranuni(1000);
93   run;
NOTE: Running DATA step in Cloud Analytic Services.
ERROR: The function RANUNI is unknown, or cannot be accessed.
ERROR: The action stopped due to errors.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: DATA statement used (Total process time):
      real time           0.15 seconds
      cpu time            0.01 seconds

In this output, you can see why the step failed to execute—the RANUNI function is not available in CAS.

In case you are wondering, the RAND function is an improved version of RANUNI, and RAND is available in CAS.

Using the CASDATALIMIT= system option and the DATALIMIT= data set option to override the default table size that can be moved to SAS

Both of these options perform the same task. They specify the number of bytes of data from a single CAS table that can be transferred from the CAS server to SAS. These options are important when these conditions are true:

  • You run a DATA step that uses a CAS table as input and that creates a CAS table as output.
  • The DATA step contains a language element that prevents the step from running in CAS.

In some programming tasks, the code must use a language element that is not available in CAS. In this case, the code and data are sent back to the SAS client. It is important to know that there is a limit for the size of the table that can be moved to the SAS client. The default limit for this is 100 MB. When a table is larger than the limit, the following error is generated by the DATA step:

ERROR: The maximum allowed bytes (nnnn) of data have been fetched from Cloud Analytic Services. Use the DATALIMIT option to increase the maximum value.

You can prevent this error by setting the CASDATALIMIT= system option in an OPTIONS statement or the DATALIMIT= option in a DATA statement. That is, the values for these options override the default limit of 100 MB. Here are examples of each option in their respective statements:

options casdatalimit=20g;
data casuser.test(datalimit=20g);

Summary

The DATA step continues to be a programming powerhouse in CAS. However, you must consider certain caveats when you execute the DATA step in CAS. The size of your data and the language elements that you use in your code are two of the most important considerations when you write a DATA step that uses a CAS table to create a new CAS table. Our online documentation provides the most complete discussion about these topics. For details, see SAS® Cloud Analytic Services 3.5: DATA Step Programming.

See also:

 

Share

About Author

Kevin Russell

SAS Technical Support Engineer, CAS and Open Source Languages

Kevin Russell is a Technical Support Engineer in the CAS and Open Source Languages group in Technical Support. He has been a SAS user since 1994. His main area of expertise is the macro language, but provides general support for the DATA step and Base procedures. He has written multiple papers and presented them at various SAS conferences and user events.

1 Comment

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top