Challenge accepted: Learning data prep in CASL

As a long-time SAS 9 programmer, I typically accomplish my data preparation tasks through some combination of the DATA Step, Proc SQL, Proc Transpose and some housekeeping procs like Proc Contents and Proc Datasets. With the introduction of SAS Viya, SAS released a new scripting language called CASL – a language that interacts with SAS Cloud Analytics Services (CAS).

CASL statements include actions, logically organized into action sets based on common functionality. For example, the Table action set allows you to load a table in CAS, view table metadata, change table metadata such as drop or rename a column, fetch (print) sample rows, save or drop a table from CAS, among other things. Steven Sober provides a great overview of CASL in his 2019 SAS Global Forum paper.

Learning CASL is a good idea assuming you want to leverage the power of CAS, because CASL is the language of CAS. While you can continue to use Viya-enabled procs for many of your data processing needs, certain new functionality is only available through CASL. CAS actions also provide a more granular access to options which otherwise may not be available as procedure options. But old habits die hard, and for a while I found myself bouncing between SAS 9.4 and CASL. I'd pull the data down from CAS just to get it to process in the SAS Programming Runtime Environment because it took less effort than figuring out how to get it done properly in CAS.

Then I started a project with a seriously large data set and quickly hit the limit on how much data I could pull down to process in SAS Programming Runtime Environment. And although I could adjust the DATALIMIT option to retrieve more data than the default limit, I was wasting time and server resources unnecessarily moving the data between CAS and SAS Programming Runtime Environment. All this, just so I could process the data “old school.”

I decided to challenge myself to do ALL my data preparation in CASL. I love a good challenge! I started collecting various useful CASL code snippets. In this post, I am sharing the tidbits I’ve accumulated, along with some commentary. Note, you can execute CAS actions from multiple clients, including SAS, Python, R, Lua and Java. Since my objective was to transition from traditional SAS code to CASL, I’ll focus solely on CAS actions from the SAS client perspective. While I used SAS Viya 3.5 for this work, most of the code snippets should work on prior versions as well.

The sections below cover: how to submit CASL code; loading, saving, dropping and deleting data; exploring data; table metadata management; and data transformation. Feel free to jump ahead to any section of interest.

How do you submit CASL code?

You use PROC CAS to submit CASL code from a SAS client. For example:

proc cas;
   ;
   ;
   …;
quit;

Similarly to other interactive procs that use run-group processing, separate CAS actions by run; statements. For example:

proc cas;
   ;
   run;
   ;
   run;
quit;

In fact, you can have the entire data preparation and analysis pipeline wrapped inside a single PROC CAS, passing data and results in the form of CASL variables from one action to the next. It can really be quite elegant.

Moving Data Using PROC CAS

Loading SASHDAT data in CAS

Your data must be in the SASHDAT format for CAS to process it. To load a SASHDAT table into CAS, use the table.loadtable CAS action. The code below assumes your SASHDAT table is saved to a directory on disk associated with your current active caslib, and you are loading it into the same caslib. (This usually occurs when you already performed the conversion to SASHDAT format, but the data has been unloaded. If you are just starting out and are wondering how to get your data into the SASHDAT format in the first place, the next session covers it, so keep reading.)

proc cas; 
     table.loadtable / path="TABLE_NAME.sashdat" casOut="TABLE_NAME"; 
     table.promote /name="TABLE_NAME" drop=true; 
quit;

The table.promote action elevates your newly loaded CAS table to global scope, making it available to other CAS sessions, including any additional sessions you start, or to other users assuming they have the right privileges. I can’t tell you how many times I forgot to promote my data, only to find that my hard-earned output table disappeared because I took a longer coffee break than expected! Don’t forget to promote or save off your data (or both, to be safe).

If you are loading from a directory other than the one associated with your active caslib, modify the path= statement to include the relative path to the source directory – relative to your active caslib. If you are looking to load to a different caslib, modify the casOut= statement by placing the output table name and library in curly brackets. For example:

proc cas;
    table.loadtable / path="TABLE_NAME.sashdat" casOut={name="TABLE_NAME" caslib="CASLIB2"};
    table.promote /name="TABLE_NAME" drop=true;
quit;

You can also place a promote=true option inside the casOut= curly brackets instead of calling the table.promote action, like so:

proc cas;
    table.loadtable / path="TABLE_NAME.sashdat" 
                      casOut={name="TABLE_NAME" caslib="CASLIB2" promote=true};
quit;

Curly brackets are ubiquitous in CASL (and quite unusual for SAS 9.4). If you take away one thing from this post, make it “watch your curly brackets.”

Loading SAS7BDAT, delimited data, and other file formats in CAS

If you have a SAS7BDAT file already on disk, load it in CAS with this code:

proc cas;
    table.loadtable /path="TABLE_NAME.sas7bdat" casout="TABLE_NAME" 
                     importoptions={filetype="basesas"};
quit;

Other file formats load similarly – just use the corresponding filetype= option to indicate the type of data you are loading, such as CSV, Excel, Document (.docx, .pdf, etc.), Image, Video, etc. The impressive list of supported file types is available here.

proc cas;
    table.loadtable / path="TABLE_NAME.csv" casout="TABLE_NAME" 
                      importoptions={filetype="csv"};
    run;
quit;

You can include additional parameters inside the importOptions= curly brackets, which differ by the file type. If you don’t need any additional parameters, use the filetype=”auto” and let CAS determine the best way to load the file.

When loading a table in SAS7BDAT, delimited or some other format, the table.loadtable action automatically converts your data to SASHDAT format.

Loading data in CAS conditionally

Imagine you are building a script to load data conditionally – only if it’s not already loaded. This is handy if you have a reason to believe the data might already be in CAS. To check if the data exists in CAS and load conditionally, you can leverage the table.tableExists action in combination with if-then-else logic. For example:

proc cas;
    table.tableExists result =r / name="TABLE_NAME";
    if r=0  then do;
        table.loadtable / path="TABLE_NAME.sashdat" casOut={name="TABLE_NAME"};
        table.promote /name="YOUR_TABLE_NAME" drop=true;
    end;
    else print("Table already loaded");
quit;

Notice that the result=r syntax captures the result code from the tableExists action, which is evaluated before the loadtable and promote actions are executed. If the table is already loaded in CAS, “Table already loaded” is printed to the log. Otherwise, the loadtable and promote actions are executed.

The ability to output CAS action results to a CASL variable (such as result=r in this example) is an extremely powerful feature of CASL. I include another example of this further down, but you can learn more about this functionality from documentation or this handy blog post.

Saving your CAS data

Let’s pretend you’ve loaded your data, transformed it, and promoted it to global scope. You or your colleagues can access it from other CAS sessions. You finished your data preparation, right? Wrong. As the header of this section suggests, you also need to save your prepared CAS data. Why? Because up to this point, your processed and promoted data exists only in memory. You will lose your work if your SAS administrator reboots the server or restarts the CAS controller. If you need to quickly reload prepared data, you must back it up to a caslib’s data source. See the CAS data lifecycle for more details.

To save off CAS data, naturally, you use the table.save action. For example:

proc cas;
    table.save / table="TABLE_NAME" name="TABLE_NAME.sashdat" replace=true;
quit;

In this example, you save off the CAS table to disk as a SASHDAT file, defaulting to the location associated with your active caslib. You can modify the table.save parameters to save or export the data to an alternative data storage solution with full control over the file format (including but not limited to such popular options as HDFS, Oracle, SQL Server, Salesforce, Snowflake and Teradata), compression, partitioning and other options.

Dropping and deleting data

To drop a table from CAS, execute a table.droptable action. For example:

proc cas;
    table.droptable / name="TABLE_NAME" quiet=true;
quit;

The quiet=true option prevents CAS from generating an error if the table does not exist in CAS. Dropping a table deletes it from memory. It’s a good practice to drop tables you no longer need, particularly the one you have promoted. Local-scope tables disappear on their own when the session expires, whereas global tables will stay in memory until they are unloaded.

Dropping a table does not delete the underlying source data. To delete the source of a CAS table, use the table.deleteSource action. For example:

proc cas;
    table.deletesource / source="TABLE_NAME.sashdat" quiet=true;
quit;

Exploring Data Using PROC CAS

After taking a close look at moving the data using PROC CAS, let’s look at some useful ways to start exploring and manipulating CAS data.

Fetching sample data

When preparing data, I find it useful to look at sample data. The table.fetch action is conceptually similar to PROC PRINT and, by default, outputs the first 20 rows of a CAS table:

proc cas;
table.fetch / table="Table_Name";
quit;

You can modify the table.fetch options to control which observations and variables to display and how to display them. For example:

proc cas;
table.fetch / table={name="TABLE_NAME" where="VAR1 in ('value1','value2')"},              /*1*/
	      orderby={{name="VAR1"},                                                     /*2*/
                             {name="VAR2", order="descending"}
                             },	
	     fetchvars={{name="VAR1", label="Variable 1"},                                /*3*/
	                     {name="VAR2", label="Variable 2"}, 
 		             {name="VAR3", label="Variable 3", format=comma12.1}
                             },
	     to=50,                                                                       /*4*/
	     index=false;							          /*5*/
quit;

In the code snippet above:

#1 – where= statement limits the records to those meeting the where criteria.
#2 – orderby= option defines the sort order. Ascending is the default and is not required. If sorting by more than one variable, put them in a list inside curly brackets, as shown in this example. If a list item has a subparameter (such as order= here), encase each item in curly brackets.
#3 – fetchvars= option defines the variables to print as well as their display labels and formats. If you select more than one variable, put them in a list inside curly brackets, as shown here. And again, if a list item includes a subparmeter, then enclose each list item in curly brackets.
#4 – to= option defines the number of rows to print.
#5 – index= false option deactivates the index column in the output (the default is index=true). This is similar to the noobs option in PROC PRINT.

As mentioned earlier, make sure to watch your curly brackets!

Descriptive statistics and variable distributions

The next step in data exploration is looking at descriptive statistics and variable distributions. I would need a separate blog post to cover this in detail, so I only touch upon a few of the many useful CAS actions.

To look at statistics for numeric variables, use the simple.summary action, which computes standard descriptive statistics, such as minimum, maximum, mean, standard deviation, number missing, and so on. For example:

proc cas;
    simple.summary / table="TABLE_NAME";
quit;

Among its other features, the simple.summary action allows analysis by one or more group-by variables, as well as define the list of desired descriptive statistics. For example:

proc cas;
simple.summary / table={name="TABLE_NAME", groupBy="VAR1", vars={"NUMVAR1","NUMVAR2”}},
                 subSet={"MAX", "MIN", "MEAN", "NMISS"};
quit;

Another useful action is simple.topK, which selects the top K and bottom K values for variables in a data set, based on a user-specified ranking order. The example below returns the top 5 and bottom 5 values for two variables based on their frequency:

proc cas;
simple.topk / table="TABLE_NAME" 
              aggregator="N",                        
              inputs={"VAR1","VAR2"},
              topk=5,
              bottomk=5;
quit;

Simple is a rich action set with heaps of useful options covered in the documentation.

You may be wondering – what about crosstabs and frequency tables? The simple action set includes freq and crosstab actions. In addition, the action closely imitating the functionality of the beloved PROC FREQ is freqTab.freqTab from the freqTab action set. For example, the code snippet below creates frequency tables for VAR1, VAR2 and a crosstab of the two.

proc cas;
freqtab.freqtab / table="TABLE_NAME"
                  tabulate={"VAR1","VAR2",
                  {vars={"VAR1","VAR2"}}
                  };
quit;

Managing CAS Table Variables

Changing table metadata

One of the basic tasks after exploring your data is changing table metadata, such as dropping unnecessary variables, renaming tables and columns, and changing variable formats and labels. The table.altertable action helps you with these housekeeping tasks. For example, the code snippet below renames the table, drops two variables and renames and changes labels for two variables:

proc cas;
    table.altertable / table="TABLE_NAME" rename="ANALYTIC_TABLE"
                       drop={"VAR1",”VAR2”}
                       columns={{name="VAR3" rename="ROW_ID" label="Row ID"},
                                {name="VAR4" rename="TARGET" label="Outcome Variable"}
                                }
                       ;
quit;

Outputting variable list to a data set

Another useful trick I frequently use is extracting table columns as a SAS data set. Having a list of variables as values in a data set makes it easy to build data-driven scripts leveraging macro programming. The code snippet below provides an example. Here we encounter another example of capturing action result as a CASL variable and using it in further processing – I can’t stress enough how helpful this is!

proc cas;
    table.columninfo r=collinfo / table={name="TABLE_NAME"};       /*1*/
    collist=collinfo["ColumnInfo"];                                /*2*/
    saveresult collist casout="collist";                           /*3*/
quit;

In the snippet above:

#1 - the columninfo action collects column information. The action result is passed to a CASL variable collinfo. Notice, instead of writing out result=, I am using an alias r =.
#2 - the portion of the a CASL variable collinfo containing column data is extracted into another CASL variable collist.
#3 - the saveresult statement sends the data to a CAS table collist. If you want to send the results to a SAS7BDAT data set, replace casout= with dataout=, and provide the library.table_name information.

Transforming the Data

Lastly, let’s look at some ways to use CAS actions to transform your data. Proc SQL and DATA step are the two swiss-army knives in SAS 9 developers’ toolkit that take care of 90% of the data prep. The good news is you can execute both DATA Step and SQL directly from PROC CAS. In addition, call the transpose action to transpose your data.

Executing DATA Step code

The dataStep.runCode action enables you to run DATA step code directly inside PROC CAS. You must enclose your DATA step code in quotation marks after the code= statement. For example:

proc cas;
    dataStep.runCode /
    code="
        data table_name;
        set table_name;
        run;
        ";
quit;

Running DATA step code in CAS allows access to sophisticated group-by processing and the use of such popular programming techniques as first- and last-dot. Refer to the documentation for important nuances related to processing in a distributed, multi-threaded environment of CAS.

Executing FedSQL

To run SQL in CASL, use the fedSQL.execDirect action. Enclose the SQL query in quotation marks following the query= statement. Optionally, you can use the casout= statement to save the results to a CAS table. For example:

proc cas;
    fedsql.execDirect/
    query=
         "
          select  *
          from TABLE1 a  inner join TABLE2 b
          on a.VAR1 = b.VAR1
         "
    casout={name="TABLE3", replace=True};
quit;

Similarly to DATA step, be aware of the many nuances when executing SQL in CAS via FedSQL. Brian Kinnebrew provides an excellent overview of FedSQL in his SAS Communities article, and the documentation has up-to-date details on the supported functionality.

Transposing data

Transposing data in PROC CAS is a breeze. The example below uses transpose.transpose action to restructure rows into columns.

proc cas;
    transpose.transpose /
          table={name="TABLE_NAME", groupby={"VAR1"}}
          transpose={"VAR2"}
          id={"VAR3"}
          prefix="Prefix"
    casout={name="TRANSPOSED" replace=true};
run;

You can transpose multiple variables in the same transpose action. Simply place additional variables inside the curly brackets following transpose=, in quotes, separated by a comma.

Conclusion

PROC CAS is a wrapper procedure enabling you to leverage SAS’ new programming language - CASL. CASL enables you to submit CAS actions directly to SAS Cloud Analytic Services engine from a SAS client. This post provided examples of loading, managing, exploring and transforming your data through CAS actions. Certain new functionality in CAS is only available through CAS actions, so getting comfortable with CASL makes sense. Fear not, and let the curly brackets guide the way 😊.

Acknowledgement

I would like to thank Brian Kinnebrew for his thoughtful review and generous help on my journey learning CASL.

5 Comments

Jim Leone on November 20, 2020 5:10 pm

Hello, Daria. One of the first things you say in this post "Your data must be in the SASHDAT format for CAS to process it" does not appear to be accurate. Wouldn't it be more accurate to say "your data must be loaded to memory for CAS to process it"? This appears to be the case because later in the post you describe loading non-sashdat data to memory. In my personal experience trying to learn CASL, I found the single most confusing and frustrating thing to be the use of libnames, caslibs and libnames associated with caslibs. I'm surprised at how little commentary there is on this. Different portions of the language use different references. It appears to be nearly random. As I've helped people start using CASL, my recommendation is to make sure you have all 3 references defined for your data sources (libname, caslib, libname associated) and if one reference is not working, try the others.

I am interested in your final comment in reply to Ashok's question. As far as I have been able to determine, the data step merge is not really supported by CAS. As you point out, data in memory is not sortable, due to the distributed nature. Therefore, a merge requiring sorted data will almost definitely deliver unexpected/undesired results. Do you know of a situation/example where this is NOT the case? As much as I appreciate the power of CAS and CASL, I find the syntax bewildering, which makes it much harder to ramp up, but that's a rant for another time.

Thanks for the post. I know several people who have found it helpful, including me.

- Daria Rostovtseva on November 20, 2020 6:35 pm
  
  Hi Jim, thank you for your insightful comments! Yes you are correct that the data must be loaded in memory for CAS to process it. I will work on clarifying the language. You are right that the differences between a caslib and a traditional SAS library can be tricky to explain. Great idea for a future post.
  
  Regarding the data step merge, I have done a lot of experimentation with it since I wrote this post and I am seeing that as long as you are merging on a unique key, the results are predictable and as expected. The bonus is that you don't have to sort (and really, can't sort) the data in CAS. In fact, I am finding the data step merge to be much faster compared to FedSQL join. I will update my reply to Ashok's comment because I think it is misleading. Thank you for pointing it out.
  
Daria Rostovtseva on August 17, 2020 1:26 pm

Hi Ashok, thanks for your question. If you are using the fedsql.execDirect action, you can use the standard SQL joins like inner, left, right, full. For example:

proc cas;
fedsql.execDirect/
query=
"
select *
from TABLE1 a left join TABLE2 b
on a.VAR1 = b.VAR1
"
casout={name="TABLE3", replace=True};
quit;

You can also use the data step merge syntax inside Proc CAS, for example:
proc cas;
dataStep.runCode /
code="
data casuser.merged;
merge casuser.test1 casuser.test2;
by var1;
run;
";
quit;

Ashok on August 15, 2020 2:34 pm

Hi Daria,
Thanks for the information, can you also share code for all types of joins using proc cas.

johnny starling on March 17, 2020 8:28 am

great blog Daria!!

Johnny

Blogs