Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss how to count missing values in a CAS table using the Python SWAT package.
Load and prepare data
First, I connect my Python client to the distributed CAS server and named my connection conn. Next, I create a small pandas DataFrame with missing values for the demonstration. Further, I load the DafaFrame to the CAS server using the upload_frame method and and name the CAS table MISSING_DATA and place it in the Casuser caslib. The upload_frame method returns a reference to the CAS table in the variable castbl.
For more information about uploading client-side data to the CAS server, check out my previous post.
conn = ## your CAS connection information ## Create a simple dataframe df = pd.DataFrame([ [np.nan, 2, 45, 0, 'A'], [3, 4, np.nan, 1,'A'], [np.nan, np.nan, 50, np.nan,'B'], [np.nan, 3, np.nan, 4,], [2, 2, np.nan, 0, 'A'], [3, 4, np.nan, 1,'A'], [np.nan, np.nan, 75, np.nan,'B'], [np.nan, 3, 60, 4,] ], columns=['col1','col2','col3','col4','col5']) ## Upload the dataframe to the CAS server as a CAS table castbl = conn.upload_frame(df, casout = {'name':'missing_data', 'caslib':'casuser', 'replace':True}) # and the results NOTE: Cloud Analytic Services made the uploaded file available as table MISSING_DATA in caslib CASUSER(Peter). NOTE: The table MISSING_DATA has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services. |
The results show that the MISSING_DATA CAS table has been loaded to the CAS server.
Lastly, I'll preview the distributed CAS table using the SWAT package head method.
castbl.head(10) |
Count missing values in a CAS table
There are a variety of ways to count the number missing values in a CAS table. Counting missing values in CAS tables is not exactly the same in the SWAT package as it is in pandas. However, it's just as easy. Let's look at a few different methods.
Using the SWAT package nmiss method
For users experienced with pandas, you traditionally run the isna and sum methods to find the number of missing values in each column of a DataFrame. For CAS tables it's even easier. You can use the SWAT package nmiss method. The nmiss method returns the total number of missing values in each column.
Here I'll specify the CAS table object castbl, then the SWAT nmiss method.
castbl.nmiss() |
The distributed CAS server counts the number of missing values in each column and returns a Series to the Python client. Once you have the Series object you can use traditional pandas for additional processing. For example, I'll chain the pandas plot method after the SWAT nmiss method to plot the series. The summarization occurs in the distributed CAS server and the plotting occurs on the client.
(castbl ## CAS table .nmiss() ## SWAT method .plot(kind = 'bar')) ## pandas plot |
You can also use the nmiss method to find the number of missing values in specific columns. Simply specify the CASTable object, the columns, then the method.
colNames = ['col1','col5'] castbl[colNames].nmiss() |
Using the Distinct CAS action
The distinct CAS action is one of my favorite actions. It not only finds the number of missing values, but it gives you the number of distinct values in each column. Now, depending on the size of your data and the number of distinct values, this one can be a bit more resource intensive since finding the number of distinct values in a column can be more time consuming.
To use the distinct CAS action specify the CASTable object, castbl, then the distinct CAS action.
castbl.distinct() |
The distinct action returns a CASResults object (dictionary) back to the Python client with the number of distinct and missing values.
In the distinct action you can use the inputs parameter to specify the columns you want to summarize.
castbl.distinct(inputs = colNames) |
The results show the number of distinct and missing values in the specified columns. Now with the summarized data form the CAS server in a CASResults object you can use your Python client to continue working with the data. For more information on working with CASResults objects, check out my previous post.
Using the summary CAS action
You can also use the summary CAS action to find the number of missing values for numeric columns. The summary action will also generate a variety of descriptive statistics such as the mean, variance, size, sum of squares and more.
castbl.summary() |
The results show the descriptive statistics. The NMiss column contains the number of missing values for the numeric columns.
Within the summary action you can also specify the columns to analyze with the inputs parameter, and the summary statistics to generate with the subSet parameter.
castbl.summary(inputs = ['col1','col2'], subSet = ['min','max','nmiss']) |
Summary
The SWAT package blends the world of pandas and CAS to process your distributed data. In this example I focused on using specific SWAT methods and CAS actions to count the number of missing values in CAS table columns.
Additional and related resources
- Getting Started with Python Integration to SAS® Viya® - Index
- SWAT API Reference
- nmiss SWAT method
- simple.distinct CAS action
- simple.summary CAS action
- SAS® Cloud Analytic Services: Fundamentals
- CAS Action Documentation
- CAS Action! - a series on fundamentals
- SAS Course - SAS® Viya® and Python Integration Fundamentals
1 Comment
Thanks for sharing! Quite informative and useful.