Fitting a Support Vector Machine (SVM) Model - Learn how to fit a support vector machine model and use your model to score new data
In Part 6, Part 7, Part 9, Part 10, and Part 11 of this series, we fit a logistic regression, decision tree, random forest, gradient boosting and neural network model to the Home Equity data we saved in Part 4. In this post we will fit a support vector machine (SVM) model to the same data to predict who is likely to go delinquent on their home equity loan and we will score data with this model.
What is Support Vector Machine (SVM)?
Support Vector Machines (SVM) are supervised machine learning models used for classification and regression tasks. They work by finding a boundary, known as a hyperplane, that best separates data points into different classes. In a classification context, the SVM identifies the hyperplane that maximizes the margin between the closest data points from each class, known as support vectors. This maximized margin helps the model make predictions with greater confidence and accuracy.
SVMs are particularly effective when working with data that is not linearly separable. This ability to handle both linear and non-linear relationships makes SVMs a versatile and powerful tool for predictive modeling tasks in a variety of domains.
What is the Support Vector Machine (SVM) Action Set?
The Support Vector Machine Action Set in SAS Viya offers an action for building support vector machine models.
Load the Modeling Data into Memory
Let’s start by loading our data we saved in Part 4 into CAS memory. I will load the sashdat file for my example. The csv and parquet file can be loaded using similar syntax.
conn.loadTable(path="homeequity_final.sashdat", caslib="casuser", casout={'name':'HomeEquity', 'caslib':'casuser', 'replace':True}) |
The home equity data is now loaded and ready for modeling.
Fit a Support Vector Machine
The support vector machine svm action set needs to be loaded before we can fit a model.
conn.loadactionset(actionset="svm") |
The svm actionset contains only one action for fitting support vector machine models. Let’s check for the name of this action.
conn.help(actionSet='svm') |
The action name is svmTrain. Let’s fit a support vector machine model using this action on the HomeEquity training data set (i.e. where _PartInd_ =1). Save the model to an aStore file named svm_astore. The svmTrain action does not have the option to save the model to a model data file like the modeling types discussed in previous blog posts.
Also, specify the nominal columns in a list and kernel as polynomial with degree of 2.
conn.svm.svmTrain( table = dict(name = 'HomeEquity', where = '_PartInd_ = 1'), target = "Bad", inputs = ['LOAN','IMP_REASON','IMP_JOB','REGION','IMP_CLAGE', 'IMP_CLNO','IMP_DEBTINC','IMP_DELINQ','IMP_DEROG', 'IMP_MORTDUE','IMP_NINQ','IMP_VALUE','IMP_YOJ'], nominals = ['BAD','IMP_REASON','IMP_JOB','REGION'], kernel="POLYNOMIAL", degree=2, id={"BAD", "_partind_"}, saveState= dict(name = 'svm_astore', replace = True) ) |
Looking at the output shows the inner product of weights is 88.6, the bias is -0.94, and the number of support vectors is 1499, where 1386 of those vectors are on the margin.
The Misclassification Matrix table in the output shows among the total of 4,172 observations, 3340 observations are classified as good (value=0) and 832 observations are classified as bad (value=1). The number of correctly predicted good observations is 3308, and the number of correctly predicted bad observations is 572. Thus the accuracy is 85.5%, as indicated in the Fit Statistics table.
Like with gradient boosting and neural network models we can score using the aStore file. Using the head method we can see the data in the aStore is a binary representation.
conn.CASTable('svm_astore').head() |
Before scoring the validation data using the aStore file in SAS Viya, we need to load the aStore action set.
conn.loadActionSet('aStore') |
The actions in the aStore action set are listed below:
conn.builtins.help(actionSet='aStore') |
Using the score action in the aStore action set, score the validation data from the home equity data file. Use _PartInd_ = 0 to identify the validation rows from the data file. Create a new data set with the scored values named svm_astore_scored.
conn.aStore.score( table = dict(name = 'HomeEquity', where = '_PartInd_ = 0'), rstore = "svm_astore", out = dict(name="svm_astore_scored", replace=True), copyVars = 'BAD' ) |
The new data svm_astore_scored has 1788 rows and 7 columns.
Use the head method to view the first five rows of svm_astore_scored file.
conn.CASTable('svm_astore_scored').head() |
In this data file, you'll find seven columns, each serving a distinct purpose:
- _PartInd_: Is the partition indicator representing the data is in the validation data with the value of 0.
- BAD: This column represents our target variable, acting as the label or dependent variable (Y).
- _P_: The decision function, if less than or equal to zero the prediction is an event; otherwise, it is a nonevent.
- P_BAD1: The predicted probability of an individual becoming delinquent on a home equity loan.
- P_BAD0: The complement of P_BAD1 (1 - P_BAD1), representing the predicted probability of an individual not becoming delinquent on a home equity loan.
- I_BAD: Here, you'll find the predicted classification value derived from the model.
- _WARN_: Indicates why the model could not be applied because of things like missing values or invalid input values.
Access Model
We’ll evaluate the fit of the support vector machine model using the same metrics as we used with logistic regression, decision tree, random forest, gradient boosting, and neural network models: confusion matrix, misclassification rates, and ROC plot. These metrics will help us gauge the model’s performance and allow us to compare it to other models. To compute these metrics, we’ll use the percentile action set along with the assess action. Start by loading the percentile action set and displaying the available actions.
conn.loadActionSet('percentile') conn.builtins.help(actionSet='percentile') |
Now use the aStore file svm_astore_scored file to create two data sets named svm_assess and svm_assess_ROC. We can use these data sets to create the graphs and metrics for the support vector machine model.
conn.percentile.assess( table = "svm_astore_scored", inputs = 'P_BAD1', casout = dict(name="svm_assess",replace=True), response = 'BAD', event = "1" ) |
We'll now use the fetch action list the first five rows of the svm_assess dataset for inspection. You can see the values at each of the depths of data from 5% incremented by 5.
conn.table.fetch(table='svm_assess', to=5) |
Using the fetch action again on the svm_assess_ROC data, take a look at the first five rows. Here if we select cutoff of .03 the predicted true positives for our validation data would be 357 and true negatives 0. Typically, the default cutoff value is .5 or 50%.
conn.table.fetch(table='svm_assess_ROC', to=5) |
Let’s bring the output data tables to the client as data frames to calculate the confusion matrix, misclassification rate, and ROC plot for the forest model.
svm_assess = conn.CASTable(name = "svm_assess").to_frame() svm_assess_ROC = conn.CASTable(name = "svm_assess_ROC").to_frame() |
Confusion Matrix
The confusion matrix is calculated using the columns generated in svm_assess output file.
This matrix compares the predicted values to the actual values, offering a detailed breakdown of true positives, true negatives, false positives, and false negatives. These metrics provide critical insights into the model’s ability to accurately predict the target outcome.
For this analysis, a cutoff value of 0.5 will be used. If the predicted probability is 0.5 or higher, the svm model predicts delinquency on a home equity loan. If the predicted probability is below 0.5, it predicts no delinquency. This threshold helps to evaluate the model's predictive performance systematically.
cutoff_index = round(svm_assess_ROC['_Cutoff_'],2)==0.5 conf_mat = svm_assess_ROC[cutoff_index].reset_index(drop=True) conf_mat[['_TP_','_FP_','_FN_','_TN_']] |
At the .05 cutoff value the true positives are 107 and the true negatives are 1409.
Misclassification Rate
Next, calculate the misclassification rate using the columns generated for the confusion matrix. The misclassification rate shows how frequently the model makes incorrect predictions.
conf_mat['Misclassification'] = 1-conf_mat['_ACC_'] miss = conf_mat[round(conf_mat['_Cutoff_'],2)==0.5][['Misclassification']] miss |
The misclassification rate is 15.2% which is slightly lower than the 15.7% we calculated for the neural network in the previous post.
ROC Plot
Let’s calculate and graph our final assessment metric a ROC (Receiver Operating Characteristic) Chart, which illustrates the trade-off between sensitivity and specificity. This chart provides a comprehensive overview of the support vector machine model's performance, with a curve closer to the top left corner representing a more effective model.
To generate the ROC curve, use the matplotlib package in Python.
plt.figure(figsize=(8,8)) plt.plot() plt.plot(svm_assess_ROC['_FPR_'],svm_assess_ROC['_Sensitivity_'], label=' (C=%0.2f)'%svm_assess_ROC['_C_'].mean()) plt.xlabel('False Positive Rate', fontsize=15) plt.ylabel('True Positive Rate', fontsize=15) plt.legend(loc='lower right', fontsize=15) plt.show() |
The curve generated by the support vector model does not fit the data as well as the gradient boosting model, which achieved a more effective fit with an AUC of 0.96. While the support vector curve is still above the diagonal, it is farther from the top left corner, indicating that there is significant room for improvement.
The AUC (Area Under the Curve) value for the support vector machine model is 0.81. This suggests the model performs better than a random classifier, which would have an AUC of 0.5. However, it falls short of the ideal scenario of an AUC of 1, which would represent a perfect model. Compared to the gradient boosting model, the support vector machine model’s performance highlights the importance of selecting the right algorithm for the dataset and problem at hand.
The Wrap-Up: Fitting a Gradient Boosting Model
In this post, we demonstrated how to fit a support vector machine (SVM) model using the SWAT package in SAS Viya to predict the likelihood of delinquency on a home equity loan. By leveraging the SVM action set, we successfully trained the model, scored the validation dataset, and assessed its performance using key metrics such as the confusion matrix, misclassification rate, and ROC plot. These steps provided a comprehensive evaluation of the SVM model's effectiveness.
While the SVM model achieved an AUC of 0.81, indicating it performs better than random guessing, it did not outperform the gradient boosting model we previously evaluated, which had an AUC of 0.96. This highlights an important takeaway: the choice of algorithm significantly impacts predictive performance. Each algorithm, from logistic regression to neural networks to SVMs, has strengths and weaknesses depending on the dataset and the problem being solved. In upcoming posts, we’ll continue exploring how to refine predictive modeling techniques and make data-driven decisions with SAS Viya.