Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll show how to impute missing values in a distributed CAS table using the fillna method from the Pandas API in the SWAT package and the impute CAS action.
Load and prepare data in the CAS server
First, I'll create some fake data to use. This script will create a simple pandas DataFrame and then upload it to the distributed CAS server. The data in this example is small for training purposes. Processing data in the CAS server's massively parallel processing environment is typically reserved for larger data. In the script below make sure to add your specific CAS connection information.
## Packages import swat import pandas as pd import numpy as np ## CAS connection information conn = swat.CAS(Enter your connection information) ## Create a simple pandas DataFrame df = pd.DataFrame([ [np.nan, 2, 45, 0, 'A'], [3, 4, np.nan, 1,'A'], [np.nan, np.nan, 50, np.nan,'B'], [np.nan, 3, np.nan, 4,], [2, 2, np.nan, 0, 'A'], [3, 4, np.nan, 1,'A'], [np.nan, np.nan, 75, np.nan,'B'], [np.nan, 3, 60, 4,] ], columns=['col1','col2','col3','col4','col5']) ## Upload the dataframe to the CAS server as a CAS table castbl = conn.upload_frame(df, casout = {'name':'missing_data', 'caslib':'casuser', 'replace':True}) ## View the CAS table castbl.head(20) ## and the results NOTE: Cloud Analytic Services made the uploaded file available as table MISSING_DATA in caslib CASUSER(Peter.Styliadis@sas.com). NOTE: The table MISSING_DATA has been created in caslib CASUSER(Peter.Styliadis@sas.com) from binary data uploaded to Cloud Analytic Services. |
The results show that the MISSING_DATA CAS table has missing values. For more information on counting missing values in a CAS table view my previous post.
Lastly, I'll run the tableInfo action to view available in-memory CAS tables.
conn.tableInfo(caslib = 'casuser') |
The results show that the MISSING_DATA CAS table is loaded into memory in the Casuser caslib on the CAS server.
Impute missing values using the fillna method
One way to impute missing values is to use the fillna method from the SWAT package. This method is similar to using the fillna pandas method. However, not all parameters are implemented in the SWAT method.
In this example, I'll impute the missing values in col1 and col2 using the values 100 and 200 respectively. Then, I'll use the inplace=True parameter to update the CAS table. If you do not specify inplace=True a new CAS table will be created with a randomly generated name. Lastly, I'll execute the head method on the CASTable object to view the data.
castbl.fillna(value = {'col1':100, 'col2':200}, inplace = True) castbl.head(10) |
The results show that all missing values in col1 and col2 have been replaced with specified the values.
Impute missing values using the impute CAS action
You can also use the impute CAS action to fill missing values in a CAS table. The impute action provides a variety of additional parameters. To use the impute action you first have to load the dataPreprocessing CAS action set.
conn.loadActionSet('dataPreprocess') |
After the action set is loaded you can use the impute action. In this example, I want to impute the numeric missing values in column col3 and col4 using the median, and the character values in col5 with the mode. I also want to create a new CAS table with my imputed columns. One thing to keep in mind with the impute action is that it creates a new imputed column, it does not overwrite the existing column.
To complete my task I'll use the following parameters:
- The inputs parameter specifies the columns to impute (col3, col4 and col5).
- The copyAllVars parameter will output the original columns and the imputed columns to a new CAS table. If you want to output specific columns from the original table use the copyVars parameter with a list of the columns to keep.
- The methodInterval specifies the imputation technique for interval variables.
- The methodNominal parameter specifies the imputation technique for nominal variables.
- The casout parameter specifies output CAS table information. Here I'll create a new table named IMPUTED_DATA in the Casuser caslib. You can overwrite the existing CAS table if you choose to.
colsToImpute = ['col3', 'col4', 'col5'] castbl.impute(inputs = colsToImpute, copyAllVars = True, methodInterval = "MEDIAN", methodNominal = "MODE", casout = { 'name':'imputed_data', 'caslib':'casuser', 'replace':True }) |
The impute action returns a dictionary of information to the client. The first DataFrame shows information about the imputed columns. The second DataFrame returns information about the newly created table.
Next, I'll run the tableInfo action to view available CAS tables.
conn.tableInfo(caslib = 'casuser') |
The results show the original MISSING_DATA table and the new IMPUTED_DATA table are both in-memory on the CAS server.
Lastly, I'll reference the IMPUTED_DATA CAS table in the variable imp_tbl and run the head method to view the data on the client.
imp_tbl = conn.CASTable('imputed_data', caslib = 'casuser') imp_tbl.head(20) |
The results show that the new CAS table contains the new imputed columns and all of the original columns.
Summary
The SWAT package blends the world of pandas and CAS to process your distributed data. In this example, I focused on showing both the Pandas API fillna method and the impute CAS action from the SWAT package to impute missing values in a distributed CAS table.