Read a CAS data table by using the iml action

5

A previous article compares a SAS/IML program that runs in PROC IML to the same program that runs in the iml action. (You can read an overview of the iml action.) The example in the previous article was very simple and did not read or write data. This article compares a PROC IML program that reads a SAS data set to a similar program that runs in the iml action and reads CAS tables. The iml action was introduced in SAS Viya 3.5.

PROC IML and SAS data sets

An important task for statistical programmers is reading data from a SAS data set into a SAS/IML matrix. SAS/IML programs are often used to pre-process or post-process data that are created by using other SAS procedures, and SAS data sets enable the output from one procedure to become the input to another procedure.

To simplify matters, let's focus on reading numerical data into a SAS/IML matrix. In PROC IML, the USE and READ statements are used to read data into a matrix. The following program is typical. Several numerical variables from the SasHelp.Cars data are read into the matrix X. The call to the CORR function then computes the correlation matrix for those variables:

proc iml;
varNames = {'EngineSize' 'Horsepower' 'MPG_City' 'MPG_Highway' 'Weight'};
use Sashelp.Cars;              /* specify the libref and data set name */
read all var varNames into X;  /* read all specified variables into columns of X */
close;
 
corr = corr(X);
print corr[r=varNames c=varNames format=7.4];
quit;

The USE and READ statements read SAS data sets. However, there are no SAS data sets in CAS, so these statements are not supported in the iml action. Instead, the iml action supports reading a CAS table into a matrix. Whereas the READ statement in PROC IML reads data sequentially, the iml action reads data from a CAS table in parallel by using multiple threads.

The rows in a CAS table do not have a defined order

Before showing how to read a CAS table into a matrix, let's discuss a characteristic of CAS tables that might be unfamiliar to you. Namely, the order of observations in a CAS table is undefined.

Recall that one purpose of CAS is to process massive amounts of data. Large data tables might be stored in a distributed fashion across multiple nodes in a cluster. When rows of data are read, they are read by using multiple threads that execute in parallel. A consequence of this fact is that CAS data tables do not have an inherent order. If this sounds shocking, realize that the order of the observations does not matter for most data analysis. For example, order does not affect descriptive statistics such as the mean or percentiles. Furthermore, any analysis that uses the sum-of-squares crossproduct matrix (X`*X) is unaffected by reordering the observations. This includes correlation, regression, and a lot of multivariate statistics (for example, principal component analysis).

Of course, for some analyses (such as time series) and for some matrix computations, the order of rows is important. Therefore, the iml action has a way to ensure that the rows of a matrix are in a specific order. If the CAS table has a variable with the name _ROWID_ (or _ROWORDER_), then the data rows are sorted by that variable before the IML matrix is created. This happens automatically when you use the MatrixCreateFromCAS function, which is discussed in a subsequent section. (The _ROWID_ is used only to sort the rows; it does not become a column in the matrix.)

You can use the iml action to read any CAS table. If you read a table that does not contain a _ROWID_ variable, the order of rows in the data matrix might change from run to run.

Upload a data set from SAS to CAS

There are several ways to upload a SAS data set into a CAS table. This example uses the DATA step. By using the DATA step, you can also add a _ROWID_ variable to the data, in case you want the rows of the IML matrix to be in the same order as the data set.

I assume you have already established a CAS session. The following statements use the LIBNAME statement in SAS to create a libref to the active caslib in the current CAS session. You can use this libref to upload a data set into a CAS table:

libname mycaslib cas;    /* libref to the active caslib in the current CAS session */
data mycaslib.cars;      /* create CAS table named CARS in the active caslib */
   set Sashelp.Cars;
   _ROWID_ + 1;          /* add a sort variable */
run;
 
/* Optional: verify that the CAS table includes the _ROWID_ variable */
proc cas;
   columnInfo / table='cars';
run;

The output of the columnInfo action shows that the cars data table was created and includes the _ROWID_ variable.

How to read a CAS data table in the iml action

To read variables from a CAS data table into a matrix, you can use the CreateMatrixFromCAS function. The syntax of the function is
X = MatrixCreateFromCAS(caslib, TableName, options);
where

  • caslib is a caslib that specifies the location of the table. You can specify a blank string if you want to use the default caslib. Or you can specify a caslib as a string. For example, if the CAS table is in your personal caslib, specify 'CASUSER(userName)', where userName is your login name. For more information, see the CAS documentation about caslibs.
  • TableName is a string that specifies the name of the CAS table.
  • options is string that enables you to read only part of the data. The most common option is the KEEP statement, which you can use to specify all numeric values ('KEEP=_NUMERIC_', which is the default), all character variables ('KEEP=_CHARACTER_'), or to specify certain variables ('KEEP=X1 X1 Y Z').

Recall that matrices in the SAS/IML language are either numeric or character. If you want to read mixed-type data, you can use the TableCreateFromCAS function.

Read a CAS table in the iml action

In the previous sections, we created a CAS table ('cars') that contains the Sashelp.Cars data. You can use iml action and the MatrixCreateFromCAS function to read that data table into a matrix, as follows:

proc cas;
   loadactionset 'iml';    /* load the iml action set (only once per session) */
run;
 
proc cas;
source ReadData;
   KeepStmt = 'KEEP=EngineSize Horsepower MPG_City MPG_Highway Weight';
   X = MatrixCreateFromCas(' ', 'cars', KeepStmt);
 
   corr = corr(X);
   varNames = {'EngineSize' 'Horsepower' 'MPG_City' 'MPG_Highway' 'Weight'};
   print corr[r=varNames c=varNames format=7.4];
endsource;
iml / code=ReadData;
run;

The output is identical to the output from PROC IML and is not shown.

The KEEP= option specifies five numeric variables to read. The first argument to the MatrixCreateFromCAS function is a blank string, which means "use the default caslib." Alternatively, you could specify a caslib such as 'CASUSER(userName)'. (For more about caslibs, see "Caslibs and librefs in SAS." The matrix X is the data matrix. Because the CARS table contains a _ROWID_ variable, the rows of X are in the same order as the rows of the Sashelp.Cars data set.

After you read the data, you can write standard SAS/IML programs to manipulate or analyze X. For example, I used the CORR function and the PRINT statement to reproduce the results of the previous PROC IML program.

In the next article, I will show how to use the iml action to write a CAS table that contains data in an IML matrix.

Further reading

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

5 Comments

  1. Adalbert Thomalla on

    Hi Rick,

    in the classic world of proc iml it is possible to read in the variable names of a sas dataset into matrix in the same time of reading the dataset:

    example:
    use work.Cashflows;
    read all var _NUM_ into CF_num_m[colname=CF_num_names_v];

    Then you can use the matrix columns through the variable names instead of column number:

    example:
    print(CF_num_m[, "CASHFLOW_AMT"]);
    instead of CF_num_m[, 4]);

    We are missing the same option in the function MatrixCreateFromCas.
    Do we have the same possibility using proc cas instead of proc iml?

    Best Regards,
    David Korinth and Adalbert Thomalla

    • Rick Wicklin

      Use the TableCreateFromCAS function, followed by a call to TableGetVarname. Here is an example and two links to the documentation:

      cas;
      proc casutil;
         load data=Sashelp.Iris casout="Iris" replace;
      run;
       
      /*
      TableCreateFromCAS function:
      https://go.documentation.sas.com/doc/en/pgmsascdc/v_051/casactiml/casactiml_langref_sect106.htm
      Example of reading CAS table and getting variable names:
      https://go.documentation.sas.com/doc/en/pgmsascdc/v_051/casactiml/casactiml_iml_examples36.htm
      */
      proc cas;
      loadactionset "iml";
      source tblcreate;
         /* Read numeric columns of CAS table */
         opt = "KEEP=_NUMERIC_";
         tbl = TableCreateFromCAS("", "Iris", opt);
         varNames = TableGetVarName(tbl);
         print varNames;
         call TablePrint(tbl) colheader="Names" numobs=5;
         /* extract some (or all) columns into a matrix */
         X = TableGetVarData(tbl, varNames);
         /* if you don't need the table, you can delete it and just keep the matrix */
         free tbl;
         print (X[1:5,])[colname=varNames];
      endsource;
      iml / code=tblcreate;
      run;
      quit;
  2. Adalbert Thomalla on

    Hi Rick, in Proc IML it is possible to STORE Matrices and Modules in each Proc IML Step in an additive way. So store Module1 in first Proc IML, Module2 in second Proc Iml. and load all Modules in third Proc IML in the same session without necessity to make load statement in the second proc iml. But using cas this leads to an error: Module1 not found in case of invoking Module1 in the third cas procedure. Ist there a solution to solve this without making load statement in every proc cas procedure (because it is resource intensive if we have a lot of modules in seperate files and big matrices and if we want to just to add another module into to overall store)? Thanks, David and Adalbert!

    Example Code:
    /*Store and Load with PROC IML*/

    proc iml;
    start func_1(m);
    measure = 2#m;
    return (measure);
    finish func_1;
    store ;*module = cashflow_v;
    quit;

    proc iml;
    start func_2(m);
    measure = 4#m;
    return (measure);
    finish func_2;
    store ;*module = cashflow_v;
    quit;

    proc iml;
    load;
    m1 = func_1({1 1});
    print(m1);
    quit;

    /*Store and Load with CAS loadactionset iml*/

    proc cas;
    loadactionset "iml";
    source iml_code;
    start func_1(m);
    measure = 2#m;
    return (measure);
    finish func_1;
    store ;*module = cashflow_v;
    endsource;
    iml / code=iml_code;
    run;
    quit;

    proc cas;
    loadactionset "iml";
    source iml_code;
    start func_2(m);
    measure = 4#m;
    return (measure);
    finish func_2;
    store ;*module = cashflow_v;
    endsource;
    iml / code=iml_code;
    run;
    quit;

    proc cas;
    loadactionset "iml";
    source iml_code;
    load;
    m1 = func_1({1 1});
    print(m1);
    endsource;
    iml / code=iml_code;
    run;
    quit;

    ERROR:

    135
    136 proc cas;
    137 loadactionset "iml";
    138 source iml_code;
    139 load;
    140 m1 = func_1({1 1});
    141 print(m1);
    142 endsource;
    143 iml / code=iml_code;
    144 run;
    NOTE: Active Session now MYSESSION.
    NOTE: Added action set 'iml'.
    ERROR: Invocation of unresolved module FUNC_1.
    statement : ASSIGN at line 2 column 2
    ----Source lines near the error----
    1: load;
    * 2: m1 = func_1({1 1});
    -----------------------------------
    ERROR: The action stopped due to errors.
    145 quit;
    NOTE: PROZEDUR CAS used (Total process time):
    real time 0.18 Sekunden
    cpu time 0.03 Sekunden

    • Rick Wicklin

      If you have programming questions, you should submit them to the SAS Support Communities. I cannot provide personal assistance.

      The documentation for the STORE statement states that "The iml action does not append to an existing storage table. If you open an existing table and write to it, all the previous contents are overwritten."

      Usually, define all modules in one file and use STORE MODULE=_ALL_ to store them. I can then use LOAD MODULE=_ALL_ to load the modules.

      If the module definitions must be spread across multiple files, you can try to use %INCLUDE statements to read all the definitions, then use a single STORE statement. Depending where you are running the program (in the cloud versus locally), the path to the files might be hard to specify, which is why I prefer the former solution.
      See also the article "How to use Git to share SAS programs."

Leave A Reply

Back to Top