Getting Started with Python Integration to SAS® Viya® - Part 9 - Summarize Columns

0

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS serverhow to execute CAS actions, and how to filter CAS tables. Now it's time to focus on how to summarize columns.

Load and explore data

Let's first load some data. In SAS Viya we can use the Samples caslib for demonstration data. I'll load the data into the distributed CAS server using my CAS connection object conn, followed by the loadTable action. The loadTable action loads a server-side file into memory. Then I'll reference the CAS table in the tbl object and preview 5 rows using the SWAT head method. Remember, the data in this example is small for training purposes. Processing data in the CAS server's massively parallel processing environment is typically reserved for larger data.

conn.loadTable(path = 'WATER_CLUSTER.sashdat', caslib = 'samples',
                            casOut = dict(caslib = 'casuser'))
 
tbl = conn.CASTable('water_cluster', caslib='casuser')
 
tbl.head()

The water_cluster CAS table contains daily water usage for a variety properties. The Daily_W_C_M3 column displays the water used per location in cubic meters (m3) .

Simple summarizations

Next, let's execute methods for simple summarizations on CAS table columns. If you've used Pandas before these techniques will be very familiar.

I'll start by getting the total water consumption for all the data using the sum method in the SWAT package on the CAS table column Daily_W_C_M3.

(tbl
 .Daily_W_C_M3
 .sum()
)
# and the results
401407.88800000015

Next, I'll view the maximum and minimum water consumption value in the data using the max and min methods in the SWAT package.

(tbl
 .Daily_W_C_M3
 .max()
)
# and the results
11910.0
 
(tbl
 .Daily_W_C_M3
 .min()
)
# and the results
0.0

Lastly, I'll find the average water consumption value in the CAS table using the mean method.

(tbl
 .Daily_W_C_M3
 .mean()
)
# and the results
8.591778424657537

Instead of manually specifying each method to view summary information, I could of used the summary CAS action or the describe method from the SWAT package. For more information check out my previous post Descriptive Statistics.

Find the property with the max water consumption

What if I want to find the property or properties with the max water consumption? One technique is to find the max water consumption value, then filter the CAS table for the row (or rows) with that value. Here I will store the max water consumption value in the variable maxWaterConsumption. Then I'll use the variable in the query method to filter the CAS table. For more information on filtering CAS tables, check out my previous blog post Filtering CAS Tables.

## Store the max water consumption value
maxWaterConsumption = (tbl
                      .Daily_W_C_M3
                      .max()
)
 
## Filter the CAS table for the property with the max water usage
(tbl
 .query(f"Daily_W_C_M3 = {maxWaterConsumption }")
 .head()
)

The results show that the property address 1660 S DAIRY ASHFORD ST had the largest daily water consumption rate.

Find the top 10 daily water consumption values and properties

Lastly, to find the top 10 daily water consumption properties and values I can use the nlargest method in the SWAT package. This method is similar to the Pandas nlargest method.  One difference is the SWAT nlargest method does not implement the keep parameter.

I'll specify the SWAT nlargest method and store the results in the variable df_top10. Then I'll display the results.

df_top10 = tbl.nlargest(10, 'Daily_W_C_M3')
display(df_top10)

The CAS server will execute the request and return a SASDataFrame to the client. The SASDataFrame shows that a single property contains the top 10 largest water consumption values.

Remember, a SASDataFrame is a subclass of a Pandas DataFrame. As a result, you can work with them as you normally would a Pandas DataFrame. To learn more about working with CAS objects check out my previous post Working with CAS Actions and CASResults Objects.

Summary

The SWAT package blends the world of Pandas and SAS, making available many familiar Pandas techniques to work with your distributed CAS tables. This enables you to utilize the massively parallel processing power of the CAS server in SAS Viya using Python. Here we learned about using familiar Pandas methods in the SWAT package like sum, min, mean, max and nlargest.

Additional and related resources

Share

About Author

Peter Styliadis

Technical Training Consultant

Peter Styliadis is a Technical Training Consultant at SAS on the Foundations team in the Education division. The team focuses on course development and customer training. Currently Peter spends his time working with SAS Programming, Structured Query Language (SQL), SAS Visual Analytics, Programming in Viya and Python. Peter is from the Rochester/Buffalo New York area and does not miss the cold weather up north. After work, Peter can be found spending time with his wife, child and three dogs, as well as hiking and spending time at home learning something new. Prior to joining SAS, Peter worked at his family's restaurant (think My Big fat Greek Wedding), worked in customer service, then became a High School Math and Business teacher. To connect with Peter, feel free to connect on LinkedIn.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top