Python Integration to SAS® Viya® - Part 21 - Impute Missing Values

0

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll show how to impute missing values in a distributed CAS table using the fillna method from the Pandas API in the SWAT package and the impute CAS action.

Load and prepare data in the CAS server

First, I'll create some fake data to use. This script will create a simple pandas DataFrame and then upload it to the distributed CAS server.  The data in this example is small for training purposes. Processing data in the CAS server's massively parallel processing environment is typically reserved for larger data. In the script below make sure to add your specific CAS connection information.

## Packages
import swat
import pandas as pd
import numpy as np
 
## CAS connection information
conn = swat.CAS(Enter your connection information)
 
## Create a simple pandas DataFrame
df = pd.DataFrame([
                [np.nan, 2, 45, 0, 'A'],
                [3, 4, np.nan, 1,'A'],
                [np.nan, np.nan, 50, np.nan,'B'],
                [np.nan, 3, np.nan, 4,],
                [2, 2, np.nan, 0, 'A'],
                [3, 4, np.nan, 1,'A'],
                [np.nan, np.nan, 75, np.nan,'B'],
                [np.nan, 3, 60, 4,]
            ],
            columns=['col1','col2','col3','col4','col5'])
 
## Upload the dataframe to the CAS server as a CAS table
castbl = conn.upload_frame(df,
                           casout = {'name':'missing_data', 
                                     'caslib':'casuser', 
                                     'replace':True})
 
## View the CAS table
castbl.head(20)
 
 
## and the results
NOTE: Cloud Analytic Services made the uploaded file available as table MISSING_DATA in caslib CASUSER(Peter.Styliadis@sas.com).
NOTE: The table MISSING_DATA has been created in caslib CASUSER(Peter.Styliadis@sas.com) from binary data uploaded to Cloud Analytic Services.

The results show that the MISSING_DATA CAS table has missing values. For more information on counting missing values in a CAS table view my previous post.

Lastly, I'll run the tableInfo action to view available in-memory CAS tables.

conn.tableInfo(caslib = 'casuser')

The results show that the MISSING_DATA CAS table is loaded into memory in the Casuser caslib on the CAS server.

Impute missing values using the fillna method

One way to impute missing values is to use the fillna method from the SWAT package. This method is similar to using the fillna pandas method. However, not all parameters are implemented in the SWAT method.

In this example, I'll impute the missing values in col1 and col2 using the values 100 and 200 respectively. Then, I'll use the inplace=True parameter to update the CAS table. If you do not specify inplace=True a new CAS table will be created with a randomly generated name. Lastly, I'll execute the head method on the CASTable object to view the data.

castbl.fillna(value = {'col1':100, 'col2':200}, inplace = True)
castbl.head(10)

The results show that all missing values in col1 and col2 have been replaced with specified the values.

Impute missing values using the impute CAS action

You can also use the impute CAS action to fill missing values in a CAS table. The impute action provides a variety of additional parameters. To use the impute action you first have to load the dataPreprocessing CAS action set.

conn.loadActionSet('dataPreprocess')

After the action set is loaded you can use the impute action. In this example, I want to impute the numeric missing values in column col3 and col4 using the median, and the character values in col5 with the mode. I also want to create a new CAS table with my imputed columns. One thing to keep in mind with the impute action is that it creates a new imputed column, it does not overwrite the existing column.

To complete my task I'll use the following parameters:

  • The inputs parameter specifies the columns to impute (col3, col4 and col5).
  • The copyAllVars parameter will output the original columns and the imputed columns to a new CAS table. If you want to output specific columns from the original table use the copyVars parameter with a list of the columns to keep.
  • The methodInterval specifies the imputation technique for interval variables.
  • The methodNominal parameter specifies the imputation technique for nominal variables.
  • The casout parameter specifies output CAS table information. Here I'll create a new table named IMPUTED_DATA in the Casuser caslib. You can overwrite the existing CAS table if you choose to.
colsToImpute = ['col3', 'col4', 'col5']
 
castbl.impute(inputs = colsToImpute,
              copyAllVars = True,               
              methodInterval = "MEDIAN",       
              methodNominal = "MODE",          
              casout = {
                  'name':'imputed_data',
                  'caslib':'casuser',
                  'replace':True
              })

The impute action returns a dictionary of information to the client. The first DataFrame shows information about the imputed columns. The second DataFrame returns information about the newly created table.

Next, I'll run the tableInfo action to view available CAS tables.

conn.tableInfo(caslib = 'casuser')


The results show the original MISSING_DATA table and the new IMPUTED_DATA table are both in-memory on the CAS server.

Lastly, I'll reference the IMPUTED_DATA CAS table in the variable imp_tbl and run the head method to view the data on the client.

imp_tbl = conn.CASTable('imputed_data', caslib = 'casuser')
imp_tbl.head(20)

The results show that the new CAS table contains the new imputed columns and all of the original columns.

Summary

The SWAT package blends the world of pandas and CAS to process your distributed data. In this example, I focused on showing both the Pandas API fillna method and the impute CAS action from the SWAT package to impute missing values in a distributed CAS table.

Additional and related resources

Share

About Author

Peter Styliadis

Technical Training Consultant

Peter Styliadis is a Technical Training Consultant at SAS on the Foundations team in the Education division. The team focuses on course development and customer training. Currently Peter spends his time working with SAS Programming, Structured Query Language (SQL), SAS Visual Analytics, Programming in Viya and Python. Peter is from the Rochester/Buffalo New York area and does not miss the cold weather up north. After work, Peter can be found spending time with his wife, child and three dogs, as well as hiking and spending time at home learning something new. Prior to joining SAS, Peter worked at his family's restaurant (think My Big fat Greek Wedding), worked in customer service, then became a High School Math and Business teacher. To connect with Peter, feel free to connect on LinkedIn.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top