Building Machine Learning Models by Integrating Python and SAS® Viya®

Image by 동철 이 from Pixabay

SAS® Scripting Wrapper for Analytics Transfer (SWAT), a powerful Python interface, enables you to integrate your Python code with SAS® Cloud Analytic Services (CAS). Using SWAT, you can execute CAS analytic actions, including feature engineering, machine learning modeling, and model testing, and then analyze the results locally.

This article demonstrates how you can predict the survival rates of Titanic passengers with a combination of both Python and CAS using SWAT. You can then see how well the models performed with some visual statistics.



To get started, you will need the following:

  1. 64-bit Python 2.7 or Python 3.4+
  2. SAS® Viya®
  3. Jupyter Notebook
  4. SWAT (if needed, see the Installation page in the documentation)

After you install and configure these resources, start a Jupyter Notebook session to get started!


Step 1. Initialize the Python packages

Before you can build models and test how they perform, you need to initialize the different Python libraries that you will use throughout this demonstration.

Submit the following code and insert the specific values for your environment where needed:

# Import SAS SWAT Library
import swat

# Import OS for Local File Paths
import os
for dirname, _, filenames in os.walk('Desktop/'):
     for filename in filenames:
     print(os.path.join(dirname, filename))

# Import Numpy Library for Linear Algebra
import numpy as np
from numpy import trapz

# Import Pandas Library for Panda Dataframe
import pandas as pd

# Import Seaborn & Matplotlib for Data Visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

Step 2. Create the connection between Python and CAS

After you import the libraries, now you want to connect to CAS by using the SWAT package. In this demonstration, SSL is enabled in the SAS Viya environment and the SSL certificate is stored locally. If you have any connection errors, see Encryption (SSL).

Use the command below to create a connection to CAS:

# Create a Connection to CAS
conn = swat.CAS("", PORT "UserID", "Password")


This command confirms the connection status:

# Verify the Connection to CAS
connection_status = conn.serverstatus()


If the connection to CAS is working (if it is, you would see information similar to the above status), you can begin to import and explore the data.

Step 3. Import data into CAS

To gather the data needed for this analysis, run the following code in SAS and save the data locally.

This example saves the data locally in the Documents folder. With a CAS action, you can import this data on to the Viya CAS server.

# Import Titanic Data on to Viya CAS Server
titanic_cas = conn.read_csv(r"C:\Users\krstob\Documents\Titanic\titanic.csv", 
casout = dict(name="titanic", replace=True))

Step 4. Explore the data loaded in CAS

Now that the data is loaded into CAS memory, use the SWAT interface to interact with the data set. Using CAS actions, you can look at the shape, column information, records, and descriptions of the data. A machine learning engineer should review data before loading the data locally, in order to dive deeper on certain features.

If any of the SWAT syntax looks familiar to you, it is because SWAT is integrated with pandas. Here is a high-level look at the data:

  • The shape (rows, columns):
  • The column information:
  • The first three records:
  • An in-depth feature description:

Take a moment to think about this data set. If you were just using a combination of pandas and scikit-learn, you would need a good amount of data preprocessing. There are missing entries, character features that need to be converted to numeric, and data that needs to be distributed correctly for efficient and accurate processing.

Luckily, when SWAT is integrated with CAS, CAS does a lot of this work for you. CAS machine learning modeling can easily import character variables, impute missing values, normalize the data, partition the data, and much more.

The next step is to take a closer look at some of the data features.


Step 5. Explore the data locally

There is great information here from CAS about your 14 features. You know the data types, means, unique values, standard deviation, and others. Now, bring the data back locally into essentially a pandas data frame and create some graphs on what you believe might be variables to predict on.

# Use the CAS To_Frame Action to Bring the CAS Table Locally into a Data Frame
titanic_pandas_df = titanic_cas.to_frame()

With data loaded locally, examine the numerical distributions:
# How Is the Data Distributed? (Numerical)
distribution_plot = titanic_pandas_df.drop('survived', axis=1).hist(bins = 15, figsize = (12,12), alpha = 0.75)

The pclass variable represents the passenger class (first class, second class, third class). Does the passenger class have any effect on the survival rate? To look more into this, you can plot histograms that compare pclass, age, and the number who survived.

# For Seaborne Facet Grids, Create an Empty 3 by 2 Graph to Place Data On
pclass_survived = sns.FacetGrid(titanic_pandas_df, col='survived', row = 'pclass', height = 2.5, aspect = 3)

# Overlay a Histogram of Y(Age) = Survived, 'age', alpha = 0.75, bins = 25)

# Add a Legend for Readability

Note: 1 = survived, 0 = did not survive


As this graph suggests, the higher the class, the better the chance that someone survived. There is also a low survival rate for the approximately 18–35 age range for the lower class. This information is great, because you can build new features later that focus on pclass.

Another common predictor feature is someone’s sex. Were women and children saved first in the case of the Titanic crash? The following code creates graphs to help answer this question:

# Create a Graph Canvas - One for Female Survival Rate - One for Male
Survived = 'Survived'
Not_Survived = 'Not Survived'
fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(10,4))

# Initialize Women and Male Variables to the Data Set Value
Women = titanic_pandas_df[titanic_pandas_df['sex'] == 'female']
Male = titanic_pandas_df[titanic_pandas_df['sex'] == 'male']

# For the First Graph, Plot the Amount of Women Who Survived Dependent on Their Age
Female_vs_Male = sns.distplot(Women[Women['survived']==1].age.dropna(), 
               bins=25, label = Survived, ax = axes[0], kde = False)

# For the First Graph, Layer the Amount of Women Who Did Not Survive Dependent on Their Age
Female_vs_Male = sns.distplot(Women[Women['survived']==0].age.dropna(), 
               bins=25, label = Not_Survived, ax = axes[0], kde = False)

# Display a Legend for the First Graph

# For the Second Graph, Plot the Amount of Men Who Survived Dependent on Their Age
Female_vs_Male = sns.distplot(Male[Male['survived']==1].age.dropna(), 
               bins=25, label = Survived, ax = axes[1], kde = False)

# For the Second Graph, Layer the Amount of Men Who Did Not Survive Dependent on Their Age
Female_vs_Male = sns.distplot(Male[Male['survived']==0].age.dropna(), 
                bins=25, label = Not_Survived, ax = axes[1], kde = False)

# Display a Legend for the Second Graph

This graph confirms that both women and children had a better chance at survival.

These feature graphs are nice visuals, but a correlation matrix is another great way to compare numerical features and the correlation to survival rate:

# Heatmap Correlation Matrix That Compares Numerical Values and Survived
Heatmap_Matrix = sns.heatmap(
              annot = True,
              fmt = ".3f",
              cmap = "coolwarm",
              center = 0,
              linewidths = 0.1

The heat map shows that the fare, surprisingly, has a significant correlation with survival rate (or seems to, at least). Keep this in mind when you build the models.

You now have great knowledge of the features in this data set. Age, fare, and sex do affect someone’s survival chances. For a general machine learning problem, you would typically explore each feature in more detail, but, for now, it is time to move on to some feature engineering in CAS.


Step 6. Check for missing values

Now that you have a general idea about which features are important, you should clean up any missing values quickly by using CAS. By default, CAS replaces any missing values with the mean, but there are many other modes to choose from.

For this test case, you can keep the default because the data set is quite small. Use the CAS impute action to perform a data matrix (variable) imputation that fills in missing values.

First, check to see how many missing values are in the data set:

# Check for Missing Values

Both fare and age are important numeric variables that have missing values, so run the impute action with SWAT to fill in any missing values:

# Impute Missing Values (Replace with Substituted Values [By Default w/ the Mean])
     table            = 'titanic',
     inputs          = ['age','fare'],
     copyAllVars = True,
     casOut        = dict(name = 'titanic', replace = True)

And, just like that, CAS takes care of the missing numeric values and creates a new variable, IMP_variable. This action would have taken more time to do with pandas or scikit-learn, so CAS was a nice time saver.

Step 7. Load the data locally to create new features

You now have great, clean data that you can model in CAS. Sometimes, though, for a machine learning problem, you want to create your own custom features.

It is easy to do that by using a data frame locally, so bring the data back to your local machine to create some custom features.

Using the to_frame() CAS action, convert the CAS data set into a local data frame. Keep only the variables needed for modeling:

<u></u><span style="font-size: 14px;"># Use the CAS To_Frame Action to bring the CAS Table Locally into a Data Frame</span>
titanic_pandas_df = titanic_cas.to_frame()

# Remove Some Features That Are Not Needed for Predictive Modeling
titanic_pandas_df = titanic_pandas_df[['embarked',

After the predictive features are available locally, confirm that the CAS statistical imputation worked:
# Check How Many Values Are Null by Using the isnull() Function
total_missing = titanic_pandas_df.isnull().sum().sort_values(ascending=False)

# Find the Total Values
total = titanic_pandas_df.notnull().sum().sort_values(ascending=False)

# Find the Percentage of Missing Values per Variable
Percent = titanic_pandas_df.isnull().sum()/titanic_pandas_df.isnull().count()*100

# Round to One Decimal Place for Less Storage
Percent_Rounded = (round(Percent,1)).sort_values(ascending=False)

# Plot the Missing Data [Total Missing, Percentage Missing] with a Concatenation of Two Columns
Missing_Data = pd.concat([total, total_missing, Percent_Rounded], axis = 1,
                                            keys=['Non Missing Values', 'Total Missing Values', '% Missing'], sort=True)

As you can see from the output above, all features are clean, and no values are missing! Now you can create some new features.

Step 8. Create new features

With machine learning, there are times when you want to create your own features that combine useful information to create a more accurate model. This action can help with overfitting, memory usage, or many other reasons.

This demo shows how to build four new features:

  • Relatives
  • Alone_on_Ship
  • Age_Times_Class
  • Fare_Per_Person


Relatives and Alone_on_Ship

The sibsp feature is the number of siblings and spouses, and the parch variable is the number of parents and children. So, you can combine these two for a Relatives feature that indicates how many people someone had on the ship in total. If a passenger traveled completely alone, you can flag that by creating the categorical variable Alone_on_Ship:

# Create a Relatives Variable / Alone on Ship
data = [titanic_pandas_df]
for dataset in data:
     dataset['Relatives'] = dataset['sibsp'] + dataset['parch']
     dataset.loc[dataset['Relatives'] &gt; 0, 'Alone_On_Ship'] = 0
     dataset.loc[dataset['Relatives'] == 0, 'Alone_On_Ship'] = 1
     dataset['Alone_On_Ship'] = dataset['Alone_On_Ship'].astype(int)


As discussed earlier in this demo, both age and class had an effect on survivability. So, create a new Age_Times_Class feature that combines a person’s age and class:

data = [titanic_pandas_df]

# For Loop That Creates a New Variable, Age_Times_Class
for dataset in data:
     dataset['Age_Times_Class']= dataset['IMP_age'] * dataset['pclass']


The Fare_Per_Person variable is created by dividing the IMP_fare variable (cleaned by CAS) by the Relatives variable and then adding 1, which accounts for the passenger:

# Set the Training &amp; Testing Data for Efficiency
data = [titanic_pandas_df]

# For Loop Through Both Data Sets That Creates a New Variable, Fare_Per_Person
for dataset in data:
     dataset['Fare_Per_Person'] = dataset['IMP_fare']/(dataset['Relatives']+1)
     dataset['Sib_Div_Spouse'] = dataset['sibsp']
     dataset['Parents_Div_Children'] = dataset['parch']

# Drop the Parent Variable
titanic_pandas_df = titanic_pandas_df.drop(['parch'], axis=1)

# Drop the Siblings Variable
titanic_pandas_df = titanic_pandas_df.drop(['sibsp'], axis=1)

With these new features that you created with Python, here is how the data set looks:
# Look at How the Data Is Distributed

Step 9. Load the data back into CAS for model training

Now you can make some models! Load the clean data back into CAS and start training:

# Upload the Data Set
titanic_cas = conn.upload_frame(titanic_pandas_df, casout = dict(name='titanic', replace='True'))

These code examples display the tables that the data is in:
# The Data Frame Type of the CAS Data

# The Data Frame Type of the Local Data

As you can see, the Titanic CAS table has a CASTable data frame and the local table has a SAS data frame.

Here is a final look at the data that you are going to train:

If you were training a model using scikit-learn, it would not really achieve great results without more preparation. Most of the values are in float format, and there are a few categorical variables. Luckily, CAS can handle these issues when it builds the models.


Step 10. Create a testing and training set

One of the awesome things about CAS machine learning is that you do not have to manually separate the data set. Instead, you can run a partitioning function and then model and test based on these parameters. To do this, you need to load the Sampling action set. Then, you can call the srs action, which can quickly partition a data table.

# Partitioning the Data
     table = 'titanic',
     samppct = 80,
     partind = True,
     output = dict(casout = dict(name = 'titanic', replace = True), copyVars = 'ALL')

This code partitions the data and adds a unique identifier to the data row to indicate whether it is testing or training. The unique identifier is _PartInd_. When a data row has this identifier equal to 0, it is part of the testing set. Similarly, when it is equal to 1, the data row is part of the training set.


Step 11. Build various models with CAS

One of my favorite parts about machine learning with CAS is how simple building a model is. With looping, you can dynamically change the targets, inputs, and nominal variables. If you are trying to build an extremely accurate model, it would be a great solution.

Building a model with CAS requires you to do a few things:

  • Load the action set (in this demo, a forest model, decision tree, and gradient boosting model)
  • Set your model variables (targets, inputs, and nominals)
  • Train the model


Forest Model

What does the data look like with a forest model? Here is the relevant code:

# Load the decisionTree CAS Action Set

# Set Out Target for Predictive Modeling
target = 'survived'

# Set Inputs to Use to Predict Survived (Numerical Variable Inputs)
inputs = ['sex', 'pclass', 'Alone_On_Ship', 
                'Age_Times_Class', 'Relatives', 'IMP_age', 
                'IMP_fare', 'Fare_Per_Person', 'embarked']

# Set Nominal Variables to Use in Model (Categorial Variable Inputs)
nominals = ['sex', 'pclass', 'Alone_On_Ship', 'embarked', 'survived']

# Train the Forest Model
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_forest_model', replace = True)

Why does the input table have a WHERE clause? That is because you are looking at rows that contain the training flag, which was created with the srs action.

After running that block of code, you also get a response from CAS detailing how the model was trained, including great parameters like Number of Trees, Confidence Level for Pruning, and Max Number of Tree Nodes. If you wanted to do hyper-parameter tuning, this output shows you what the model currently looks like before you even adjust how it executes.

Decision Tree Model

The code to train a decision tree model is similar to the forest model example:

# Train the Decision Tree Model
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_decisiontree_model', replace = True)

Gradient Boosting Model

Lastly, you can build a gradient boosting model in this way:

# Train the Gradient Boosting Model
     table = dict(name = 'titanic', where = '_PartInd_ = 1'),
     target = target,
     inputs = inputs,
     nominals = nominals,
     casOut = dict(name = 'titanic_gradient_model', replace = True)

Step 12. Score the models

How do you score the models with CAS? CAS has a score function for each model that is built. This function generates a new table that contains how the model performed on each data input.

Here is how to score the three models:

titanic_forest_score = conn.decisionTree.forestScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_forest_model',
     casout = dict(name='titanic_forest_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True

titanic_decisiontree_score = conn.decisionTree.dtreeScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_decisiontree_model',
     casout = dict(name='titanic_decisiontree_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True

titanic_gradient_score = conn.decisionTree.gbtreeScore(
     table = dict(name = 'titanic', where = '_PartInd_ = 0'),
     model = 'titanic_gradient_model',
     casout = dict(name='titanic_gradient_score', replace = True),
     copyVars = target,
     encodename = True,
     assessonerow = True

When the scoring function is running, it creates the new variables P_survived1, which is the prediction for whether a passenger survived or not, and P_survived0, which is the prediction for whether the passenger did not survive. With this scoring function, you can see how accurately a model could correctly classify passengers on the testing set.

If you dive deeper into the Python object, this scoring function is set as equal to, so you can actually see the misclassification rate!

For example, examine how the forest model did by running this code:


The scoring function read in all of the model-tested set and told you the misclassification error. By calculating [1 – Misclassification Error], you can see that the model was approximately 85% accurate. For barely exploring the data and testing and training on a small data set, this score is good. These scores can be misleading though, as they do not tell the entire story. Do you have more false positives or false negatives? When it comes to predicting human survival, those parameters are important to investigate.

To analyze that, load the percentile CAS action set. This action set provides actions for calculating percentiles and box plot values. In your case, it also assesses models. With this information, CAS has an assess function to determine a final assessment of how the model did.

prediction = 'P_survived1'

titanic_forest_assessed = conn.percentile.assess(
     table = 'titanic_forest_score',
     inputs = prediction,
     casout = dict(name = 'titanic_forest_assessed', replace = True),
     response = target,
     event = '1'

titanic_decissiontree_assessed = conn.percentile.assess(
     table = 'titanic_decisiontree_score',
     inputs = prediction,
     casout = dict(name = 'titanic_decissiontree_assessed', replace = True),
     response = target,
     event = '1'

titanic_gradient_assessed = conn.percentile.assess(
     table = 'titanic_gradient_score',
     inputs = prediction,
     casout = dict(name = 'titanic_gradient_assessed', replace = True),
     response = target,
     event = '1'

This CAS action returns three types of assessments: lift-related assessments, ROC-related assessments, and concordance statistics. Python is great at graphing data, so now you can move the data locally and see how it did with the new assessments.


Step 13. Analyze the results locally

You can plot the receiver operating characteristic (ROC) curve and the cumulative lift to determine how the models performed. Using the ROC curve, you can then calculate the area under the ROC curve (AUC) to see overall how well the models predicted the survival rate.

What exactly is a ROC curve or lift?

  • A ROC curve is determined by plotting the true positive rate (TPR) against the false positive rate. The true positive rate is the proportional observations that were correctly predicted to be positive. The false positive rate is the proportional observations that were incorrectly predicted to be positive.
  • A lift chart is derived from a gains chart. The X axis acts as a percentile, but the Y axis is the ratio of the gains value of our model and the gains value of a model that is choosing passengers randomly. That is, it details how many times the model is better than the random choice of cases.

Before you can plot these tables locally, you need to create a connection to them. CAS created some new assessed tables, so create a connection to these CAS tables for analysis:

# Assess Forest
titanic_assess_ROC_Forest = conn.CASTable('titanic_forest_assessed_ROC')
titanic_assess_Lift_Forest = conn.CASTable('titanic_forest_assessed')

titanic_ROC_pandas_Forest = titanic_assess_ROC_Forest.to_frame()
titanic_Lift_pandas_Forest = titanic_assess_Lift_Forest.to_frame()

# Assess Decision Tree
titanic_assess_ROC_DT = conn.CASTable('titanic_decisiontree_assessed_ROC')
titanic_assess_Lift_DT = conn.CASTable('titanic_decisiontree_assessed')

titanic_ROC_pandas_DT = titanic_assess_ROC_DT.to_frame()
titanic_Lift_pandas_DT = titanic_assess_Lift_DT.to_frame()

# Assess GB
titanic_assess_ROC_gb = conn.CASTable('titanic_gradient_assessed_ROC')
titanic_assess_Lift_gb = conn.CASTable('titanic_gradient_assessed')

titanic_ROC_pandas_gb = titanic_assess_ROC_gb.to_frame()
titanic_Lift_pandas_gb = titanic_assess_Lift_gb.to_frame()

Now that there is a connection to these tables, you can use the Matplotlib library to plot the ROC curve. Plot each model on this graph to see which model performed the best:
# Plot ROC Locally
plt.figure(figsize = (10,10))
             titanic_ROC_pandas_Forest['_Sensitivity_'], 'bo-', linewidth = 3)
            titanic_ROC_pandas_DT['_Sensitivity_'], 'ro-', linewidth = 3)
             titanic_ROC_pandas_gb['_Sensitivity_'], 'go-', linewidth = 3)
plt.plot(pd.Series(range(0,11,1))/10, pd.Series(range(0,11,1))/10, 'k--')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.legend(['Forest', 'Decision Tree', 'Gradient Boosting'])

You can also take the depth and cumulative lift scores from the assessed data set and plot that information:

# Plot Lift Locally
plt.figure(figsize = (10,10))
plt.plot(titanic_Lift_pandas_Forest['_Depth_'], titanic_Lift_pandas_Forest['_CumLift_'], 'bo-', linewidth = 3)
plt.plot(titanic_Lift_pandas_DT['_Depth_'], titanic_Lift_pandas_DT['_CumLift_'], 'ro-', linewidth = 3)
plt.plot(titanic_Lift_pandas_gb['_Depth_'], titanic_Lift_pandas_gb['_CumLift_'], 'go-', linewidth = 3)
plt.ylabel('Cumulative Lift')
plt.title('Cumulative Lift Curve')
plt.legend(['Forest', 'Decision Tree', 'Gradient Boosting'])

Although these curves work well for exploring the model success and perhaps how you can better tune the data, typically most people just want to see overall how well the model performed. To get a general idea, you can integrate the ROC curve to get this overview. This area is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

# Forest Scores
x_forest = np.array([titanic_ROC_pandas_Forest['_Specificity_']])
y_forest  = np.array([titanic_ROC_pandas_Forest['_Sensitivity_']])

# Decision Tree Scores
x_dt = np.array([titanic_ROC_pandas_DT['_Specificity_']])
y_dt  = np.array([titanic_ROC_pandas_DT['_Sensitivity_']])

# GB Scores
x_gb = np.array([titanic_ROC_pandas_gb['_Specificity_']])
y_gb  = np.array([titanic_ROC_pandas_gb['_Sensitivity_']])

# Calculate Area Under Curve (Integrate)
area_forest = trapz(y_forest ,x_forest)
area_dt = trapz(y_dt ,x_dt)
area_gb = trapz(y_gb ,x_gb)

# Table For Model Scores
Model_Results = pd.DataFrame({
'Model': ['Forest', 'Decision Tree', 'Gradient Boosting'],
'Score': [area_forest, area_dt, area_gb]})


With the AUC ROC score, you can now see how well the model performs at distinguishing between positive and negative outcomes.



Any machine learning engineer should take time to further investigate integrating SAS Viya with their normal programming environment. When it works through the SWAT Python interface, CAS excels at quickly building and scoring a model. You can do this even with large data sets, because the data is stored in CAS memory. If you want to go further in-depth with using ensemble methods, I would recommend using SAS® Model Studio on SAS Viya, or perhaps one of the many great open-source libraries, like scikit-learn on Python.

The ability of CAS to quickly clean, model, and score a prediction is quite impressive. If you would like to take a further look at what SWAT and CAS can do, check out the different action sets that can be completed.

If you would like some more information about SWAT and SAS® Viya®, see these resources:

Would you like to see SWAT machine learning work with a larger data set, or perhaps use SWAT to build a neural network? Please leave a comment!


About Author

Kris Stobbe

Technical Support Engineer

Kris Stobbe is a Technical Support Engineer with a strong passion for data analytics, complex problem solving, and service-oriented work. In addition to his SAS experience, Kris is also an expert in areas such as Machine Learning, Git, Python, and APIs.

1 Comment

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top