Getting Started with Python Integration to SAS® Viya® - Part 14 - CAS Table to DataFrame

0

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss how to bring a distributed CAS table back to your Python client as a DataFrame.

In this example, I'm using Python on my laptop (Python client) to connect to the CAS server. Bringing a distributed CAS table back to the Python client as a local DataFrame is typically appropriate for smaller data. If the data sets become larger, it is more efficient to process the data in the CAS server, or bring a smaller subset of data back to the client. The big question is, why would I want to bring a CAS table out of the CAS server's massively parallel processing (MPP) environment back to my client as a DataFrame?

 

The CAS server in SAS Viya is setup to connect to a variety of data sources throughout your organization like databases, folder paths, cloud data and more. Typically these connections are setup by your administrator. Having these connections setup makes it easy to access data. In this example, maybe the data you need is available to the CAS server and is small enough where you don't need to use the power of the CAS server's MPP environment. Maybe you simply want to pull a CAS table back to your client as a DataFrame and use a familiar Python package like numpy, pandas, seaborn, matplotlib, or scikit-learn for data exploration, preprocessing, visualization or machine learning.

Use caution when bringing data back to your Python client

With the to_frame SWAT method you can easily transfer a CAS table to your Python client as a DataFrame. When using the to_frame method YOU MUST USE CAUTION. It will attempt to pull all of the data down from the CAS server regardless of size. If the data is large, this can be time consuming or overwhelm your Python client depending on your available memory.

Load the demonstration data into memory

I've created a connection to my CAS server in the variable conn. I'll use my conn connection object and the SWAT read_csv method to load the cars.csv file from the Example Data Sets for the SAS® Viya® Platform website into memory on the CAS server. I'll add some formats and labels to the CAS table. This is a small table for demonstration purposes.

fileurl = 'https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/cars.csv'
castbl = conn.upload_file(fileurl, 
                          casout = {'name':'cars_cas_table', 
                                    'caslib':'casuser', 
                                    'replace':True},
                          importoptions = {
                              'fileType':'csv',
                              'guessRows':100,
                              'vars': {
                                  'Invoice':{'format':'dollar16.2','label':'Invoice Price'},
                                  'MSRP':{'format':'dollar16.2', 'label':'Manufacturer Suggested Retail Price'},
                                  'Weight':{'format':'comma16.'},
                                  'EngineSize':{'label':'Engine Size (L)'},
                                  'Length':{'label':'Length (IN)'},
                                  'MPG_City':{'label':'MPG (City)}'},
                                  'MPG_Highway':{'label':'MPG (Highway)'}
                              }
                          })
 
# and the results
NOTE: Cloud Analytic Services made the uploaded file available as table CARS_CAS_TABLE in caslib CASUSER(Peter).
NOTE: The table CARS_CAS_TABLE has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.

Next, I'll execute the tableInfo action to confirm the table was loaded into memory.

conn.tableInfo(caslib = 'casuser')


Lastly I'll execute the columnInfo action to view the column information of the CAS table.

castbl.columnInfo()

 

The results above show the CARS_CAS_TABLE has 15 columns. Some of the columns contain SAS labels and SAS formats.

Pull the entire CAS table to the client as a DataFrame

We saw earlier that the CARS_CAS_TABLE is a small in-memory table on the CAS server. Let's pretend the data was connected to some cloud data storage and was loaded into memory. I noticed that the data is small, and I want to use familiar Python packages to process it.

The first thing I can do is use the to_frame SWAT method on my castbl object to convert the CAS table to a DataFrame. I'll store the result in df and then view it's type.

df = castbl.to_frame()
display(type(df))
 
# and the results
swat.dataframe.SASDataFrame

 

The results show I now have a SASDataFrame. A SASDataFrame lives on the Python client and is a subclass of pandas.DataFrame. Therefore, anything you can do with a pandas.DataFrame will also work with a SASDataFrame. The only difference is that a SASDataFrame object contain extra metadata from the CAS table. For more information on CASTable vs DataFrame vs SASDataFrame check out the SWAT documentation.

You can view the extra metadata of a SASDataFrame with the colinfo attribute.

df.colinfo

The results show the SASDataFrame contains a variety of information about the columns like the data type, name and width. If the column contains a SAS label or format, that is also stored.

Once you have the SASDataFrame on your Python client, you can use pandas. For example, here I'll execute the pandas plot method to create a scatter plot of EngineSize by MPG_City.

df.plot.scatter(x = 'EngineSize', y = 'MPG_City', title = "Engine Size by MPG City");y');

The results show a simple scatter plot using pandas.

Apply column labels to a SASDataFrame

Sometimes you might want to use column labels, as they are more descriptive. Traditional pandas DataFrames don't have the concept of labels, but CAS tables do.  CAS enables you to add more descriptive labels to columns. Since a SASDataFrame stores information from the CAS table, you can use the SWAT apply_labels method to apply labels to the SASDataFrame. For more information on adding labels to a CAS table, check on my previous post Renaming columns.

df.apply_labels()

Notice the apply_labels method applies the SAS column labels (Manufacturer Suggested Retail Price, Invoice Price, Engine Size (L), Length (IN), MPG (City) and MPG (Highway)) to the SASDataFrame.

Pull a sample of the CAS table to the client as a DataFrame

What if the CAS table is big, but you want to work with a sample of the data using a specific Python package? Well, you can use the sample_pct parameter in the to_frame SWAT method to pull a smaller subset of data to your Python client.

Here I'll specify I want ten percent of the data from the CAS table. I'll also add the sample_seed parameter to pull the same random subset of data. Then I'll execute the shape method to count the number of rows and columns in the new SASDataFrame.

df = castbl.to_frame(sample_pct = .1, sample_seed = 99)
df.shape
 
# and the results
(43, 15)

The results show that only 43 rows of data were returned from the CAS table as a SASDataFrame on the Python client.

Apply SAS formats to the SASDataFrame

Lastly, what if you want to apply the SAS formats of a CAS table to your SASDataFrame? You can do that using the format parameter in the SWAT to_frame method, as seen in the code below. I'll also display five rows of the SASDataFrame and the data types of the columns.

df_formats = castbl.to_frame(format=True)
 
display(df_formats.head(), df_formats.dtypes)

The results show the DOLLAR format was applied to the MSRP and Invoice columns, and the COMMA format was applied to the Weight column.

The dtype method shows that when using the format parameter in the to_frame method all columns are returned to your Python client as objects, even if they do not contain a format.

Summary

The SWAT package blends the world of Pandas and CAS. It enables you to use the CAS server's massively parallel processing engine for data exploration, preprocessing and analytics. It also enables you to easily transfer your distributed CAS tables back to your Python client as SASDataFrames for additional processing using other Python packages.

The one thing you must remember when using the to_frame method is that transferring large data can be time consuming or take up all of your Python client's resources. USE IT WITH CAUTION.

Additional and related resources

Share

About Author

Peter Styliadis

Technical Training Consultant

Peter Styliadis is a Technical Training Consultant at SAS on the Foundations team in the Education division. The team focuses on course development and customer training. Currently Peter spends his time working with SAS Programming, Structured Query Language (SQL), SAS Visual Analytics, Programming in Viya and Python. Peter is from the Rochester/Buffalo New York area and does not miss the cold weather up north. After work, Peter can be found spending time with his wife, child and three dogs, as well as hiking and spending time at home learning something new. Prior to joining SAS, Peter worked at his family's restaurant (think My Big fat Greek Wedding), worked in customer service, then became a High School Math and Business teacher. To connect with Peter, feel free to connect on LinkedIn.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top