How to use SAS® Viya® and R for dynamic data selection

0

When you begin to work within the SAS Viya ecosystem, you learn that the central piece is SAS Cloud Analytic Services (CAS). CAS allows all clients in the SAS Viya ecosystem to communicate and run analytic methods. The great part about SAS Viya is that the R client can drive CAS directly using familiar objects and constructs for R programmers.

The SAS Scripting Wrapper for Analytics Transfer (SWAT) package is an R interface to CAS. With this package, you can load data into memory and apply CAS actions to transform, summarize, model and score the data. You can still retain the ease-of-use of R on the client side to further post process CAS result tables.

But before you can do any analysis in CAS, you need some data to work with and a way to get to it. There are two data access components in CAS:

  1. Caslibs, definitions that give access to a resource that contains data.
  2. CASTables, for analyzing data from a caslib resource. You load the data into a CASTable, which contains information about the data in the columns.

Other references you may find of interest include this GitHub repository where you can find more information on installing and configuring CAS and SWAT. Also available is this article on using RStudio with SAS Viya.

The following excerpt from SAS® Viya® : the R Perspective, the book I co-authored with my SAS colleague Xiangxiang Meng, demonstrates the way the R client in SAS Viya allows you to select data with precision. The examples come from the iris flower data set, which is available in the SASHELP library, in all distributions of SAS. The CASTable object sorttbl is sorted by the Sepal.Width column.

Rather than using fixed values of rows and columns to select data, we can create conditions that are based on the data in the table to determine which rows to select. The specification of conditions is done using the same syntax as that used by data.frame objectsCASTable objects support R’s various comparison operators and build a filter that subsets the rows in the table. You can then use the result of that comparison to index into a CASTableIt sounds much more complicated than it is, so let’s look at an example.

This expression creates a computed column that is used in a where expression on the CASTable. This expression can then be used as an index value for a CASTable. Indexing this way essentially creates a Boolean mask. Wherever the expression values are true, the rows of the table are returned. Wherever the expression is false, the rows are filtered out.

> newtbl <- sorttbl[expr,] > head(newtbl) 
 
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species 
1          7.7         2.6          6.9         2.3 virginica 
2          7.7         2.8          6.7         2.0 virginica 
3          7.6         3.0          6.6         2.1 virginica 
4          7.7         3.8          6.7         2.2 virginica

These two steps are commonly entered on one line.

> newtbl <- sorttbl[sorttbl$Petal.Length > 6.5,]
> head(newtbl) 
 
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species 
1          7.7         2.6          6.9         2.3 virginica 
2          7.7         2.8          6.7         2.0 virginica 
3          7.6         3.0          6.6         2.1 virginica 
4          7.7         3.8          6.7         2.2 virginica

We can further filter rows out by indexing another comparison expression.

> newtbl2 <- newtbl[newtbl$Petal.Width < 2.2,] > head(newtbl2) 
 
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species 
1          7.7         2.8          6.7         2.0 virginica 
2          7.6         3.0          6.6         2.1 virginica

Comparisons can be joined using the bitwise comparison operators & (and) and | (or). You must be careful with these operators though due to operator precedence. Bitwise comparison has a lower precedence than comparisons such as greater-than and less-than, but it is still safer to enclose your comparisons in parentheses.

> newtbl3 <- sorttbl[(sorttbl$Petal.Length > 6.5) & (sorttbl$Petal.Width < 2.2),] > head(newtbl3) 
 
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species 
1          7.7         2.8          6.7         2.0 virginica 
2          7.6         3.0          6.6         2.1 virginica

In all cases, we are not changing anything about the underlying data in CAS. We are simply constructing a query that is executed with the CASTable when it is used as the parameter in a CAS action. You can see what is happening behind the scenes by displaying the attributes of the resulting CASTable objects.

> attributes(newtbl3) 
 
$conn 
CAS(hostname=server-name.mycompany.com, port=8777, username=username, session=11ed56e2-f9dd-9346-8d01-44a496e68880, protocol=http) 
 
$tname
[1] "iris" 
 
$caslib 
[1] "" 
 
$where 
[1] "((\"Petal.Length\"n > 6.5) AND (\"Petal.Width\"n < 2.2))" 
 
$orderby 
[1] "Sepal.Width" 
 
$groupby 
[1] "" 
 
$gbmode 
[1] "" 
 
$computedOnDemand 
[1] FALSE 
 
$computedVars 
[1] "" 
 
$computedVarsProgram 
[1] "" 
 
$XcomputedVarsProgram 
[1] "" 
 
$XcomputedVars 
[1] "" 
 
$names 
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  
[5] "Species"      
 
$class 
[1] "CASTable" 
attr(,"package") 
[1] "swat"

You can also do mathematical operations on columns with constants or on other columns within your comparisons.

> iris[(iris$Petal.Length + iris$Petal.Width) * 2 > 17.5,] 
 
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species 
118          7.7         3.8          6.7         2.2 virginica 
119          7.7         2.6          6.9         2.3 virginica

The list of supported operators is shown in the following table:

Operator Numeric Data Character Data
+ (add)
- (subtract)
* (multiply)
/ (divide)
%% (modulo)
%/% (integer division)
^ (power) [✔

The supported comparison operators are shown in the following table.

Operator Numeric Data Character Data
== (equality)
!= (inequality)
< (less than)
> (greater than)
<= (less than or equal to)
>= (greater than or equal to)

 

As you can see in the preceding tables, you can do comparisons on character columns as well. In the following example, all of the rows in which Species is equal to "virginica" are selected and saved to a new CASTable object virginica. Note that in this case, data is still not duplicated.

> tbl <- defCasTable(conn, 'iris') > virginica <- tbl[tbl$Species == 'virginica',] > dim(virginica) 
 
[1] 50  5 
 
> head(virginica) 
 
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species 
1          7.7         3.0          6.1         2.3 virginica 
2          6.3         3.4          5.6         2.4 virginica 
3          6.4         3.1          5.5         1.8 virginica 
4          6.0         3.0          4.8         1.8 virginica 
5          6.9         3.1          5.4         2.1 virginica 
6          6.7         3.1          5.6         2.4 virginica

It’s easy to create powerful filters that are executed in CAS while still using the R syntax. However, the similarities to dataframe don’t end there. CASTable objects can also create computed columns and by groups using similar techniques.

Want to learn more? Get your copy of SAS Viya: The R Perspective
Share

About Author

Kevin D. Smith

Sr Software Developer at SAS and SAS Press author

Kevin D. Smith has been a software developer at SAS since 1997. He has supported PROC TEMPLATE and other underlying ODS technologies for most of that time. Smith has spoken at numerous SAS Global Forums as well as at regional and local SAS users groups with the "From Scratch" series of presentations that were created to help users of any level master various ODS technologies. He is the author of ODS Techniques: Tips for Enhancing Your SAS Output and PROC TEMPLATE Made Easy: A Guide for SAS Users.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top