The post Explaining complex models in SAS® Viya® with programmatic interpretability appeared first on The SAS Data Science Blog.
]]>In the preceding two posts, we looked at issues around the interpretability of modern blackbox machinelearning models and introduced techniques available in SAS® Model Studio within SAS® Visual Data Mining and Machine Learning. Now we turn our attention to programmatic interpretability using SAS® Cloud Analytic Services directly.
The actions in the explainModel action set can be used to explain a model whose score code is saved in SAS DATA Step code or a SAS analytic store. The action set includes three actions:
For implementation details about these techniques and user options, see the chapter “Explain Model Action Set” in SAS Visual Data Mining and Machine Learning 8.5: Programming Guide.
Diagnosing illness is a frequent and difficult task for doctors in the healthcare industry. An incorrect diagnosis can mean severe consequences. To help doctors, many studies employ machine learning by training a model on historical clinical cases. Unfortunately, lacking adequate interpretations, many of the most complex machine learning models cannot participate in some diagnostic procedures.
This section uses a healthcare application to demonstrate how to access interpretability through a programmatic interface. The full SAS Viya code for this example can be found on github. We will uses the interpretability actions available in SAS Viya to train a model that predicts the malignancy of potential breast cancer biopsies and to explain the predictions of this model.
Data for this section is from the University of Wisconsin Hospitals in Madison, WI, from Dr. William H. Wolberg, as described by Mangasarian and Wolberg (1990). The data contain nine observed quantities that are calculated by a physician upon the collection of a fine needle aspirate (FNA) from a potentially malignant area in a patient and are labeled according to their malignancy.
The data contains nine input variables and a target variable. The input variables are clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. Each of these variables ranges from 1 to 10, with larger numbers generally indicating a greater likelihood of malignancy according to the examining physician. The target variable indicates whether the region was malignant or benign. Not one of these input variables by itself is enough to determine whether the region is malignant, so they must be aggregated in some way. A random forest model is trained to predict sample malignancy from these variables.
The data contains 699 observations. The bare nuclei variable contains a few instances of missing data, which were imputed to have value 0. The data was partitioned into a training set (70%) and a test set (30%). The following code trains a random forest model on the training data by using the decisionTree action set in SAS Visual Data Mining and Machine Learning:
proc cas; inputs = &inputs; decisionTree.forestTrain result = forest_res / table = "BREAST_CANCER_TRAIN" target = "class" inputs = inputs oob = True nTree = 500 maxLevel = 12 prune = True seed = 1234 varImp = True casOut = {name = "FOREST_MODEL_TABLE", replace = True} savestate = {name = "FOREST_MODEL", replace = True}; run; quit; 
The macro variable &inputs defines a list that contains the names of the input variables (except the ID variable). Following this training, a cutoff of 0.4 was selected to optimize the model’s misclassification rate on the training set. This cutoff was then used to assess the model on the test set, which achieved a misclassification rate of 2.91%. This accuracy is competitive with other blackbox models that are trained and published on the same data.
Although the predictive accuracy is good, the forest model remains difficult to interpret because of its complexity. This complexity can render the model entirely useless in clinical settings, where interpretability is paramount for a patient’s wellbeing and the physician’s confidence.
Table 1 contains the modelspecific variable importance of the input variables as calculated by the forestTrain action.
Table 1 shows that the five most important variables for the model’s predictions are the uniformity of cell size and shape, bare nuclei, bland chromatin, and clump thickness. As you can see, this table does not tell you the direct effect of these variables in the model; it only says which variables were important in the model’s construction. However, you can use this information to decide which variables to investigate further through other interpretability techniques such as PD and ICE plots.
The preceding information establishes that the forest model is highly predictive on unseen data and that not all the input variables are equally important. One remaining question is how each of the input variables affects the prediction of the model. The partial dependence plots for a particular variable illustrate how the model’s predictions change when that variable’s value fluctuates with all other variables being held constant.
The following CAS action call requests the partial dependence plot for the most important variable (cell size uniformity) of the variable importance table:
proc cas; /* Inputs and Nominals macro > CASL Var */ inputs = &inputs; /* Action Call */ explainModel.partialDependence result = pd_res / table = "BREAST_CANCER" inputs = inputs modelTable = "&model" modelTableType = "ASTORE" predictedTarget = "&pred_target" analysisVariable = {name = "&var_name", nBins = 50} iceTable = {casout = {name = "ICE_TABLE", replace = True}, copyVars = "ALL"} seed = 1234; run; /* Save PD Results */ saveresult pd_res["PartialDependence"] dataset = PD_RES; run; quit; 
Again, the &inputs macro variable is used to specify the list of input variables to the partialDependence action. The modelTable, modelTableType, and predictedTarget parameters specify the model being explained and its expected output column. The table and iceTable parameters indicate which data sets to use for the PD calculation and for storing replicated observations, respectively. The variable being explained is specified in the analysisVariable parameter, which contains other subparameters for slightly altering the produced PD calculation. For more information, see the chapter “Explain Model Action Set” in SAS Visual Data Mining and Machine Learning 8.5: Programming Guide.
Figure 1 shows the partial dependence of the forest’s predictions with respect to the cell size uniformity variable.
Figure 1 shows that the model’s predictions change greatly between the extremes of the cell size uniformity variable — for lower values of cell size uniformity, the model’s average prediction is lower than 0.3, and for larger values it reaches 0.6. The value of this variable in the input data can have a large effect on the prediction from the model, which is why its variable importance metric is so high. Figure 1 also shows that the model’s prediction with respect to cell size uniformity is monotonic, always increasing as the variable’s value increases, which is as expected based on the description of the data. This plot builds trust in the model by demonstrating that it is not behaving in an unexpected way.
Typically, the partial dependencies of variables are shown on individual plots. However, since all the input variables in this problem lie on the same scale and have similar impact on the target (larger values indicate increased chance of malignancy), the partial dependencies with respect to each variable can be overlaid as shown in Figure 2.
Figure 2 shows that the input variables are being used exactly as defined by the data description; the mean prediction of the model increases as the value of each of the input variables increases. Also depicted is the relative influence of each variable on the model’s prediction, with cell size uniformity, cell shape uniformity, bare nuclei, clump thickness, and bland chromatin all showing a large range in their mean predictions. These large ranges correspond to their high variable importance values and build further trust in the model. The mitoses variable’s partial dependence plot is almost perfectly flat, which would indicate that the variable contributes almost nothing to the prediction of the model.
The partial dependence plots are useful for understanding the effect of input variables on all observations. Sometimes, however, the role of a variable for an individual observation can be rather different from the role of that variable for the overall population. For example, for a particular patient, it would be useful to be able to determine which variables in an observation contribute most to the prediction of malignancy so that the patient could be further convinced of the need of a biopsy, a costly but necessary followup procedure. A reason such as “The thickness of the clumps in the FNA procedure lead us to believe that we should proceed with a biopsy” is more convincing to a patient than just saying “Looking at your FNA, we think we should proceed with a biopsy.” Explanations like these can be generated by using LIME and Shapley value methods.
The following code is used within a SAS macro to compute LIME coefficients:
explainModel.linearExplainer result = lex_res / table = "BREAST_CANCER_TRAIN" query = {name = "BREAST_CANCER_TRAIN", where = "sample_id = &observation;"} inputs = inputs modelTable = "FOREST_MODEL" modelTableType = "ASTORE" predictedTarget = "P_classMALIGN" preset = "LIME" explainer = {standardizeEstimates = "INTERVALS", maxEffects = &num_vars+1} seed = 1234; run; 
In the code, the preset parameter is used to select the LIME method for generating explanations. The table, modelTable, modelTableType, and predictedTarget parameters are used in the same way as in the previous partialDependence action call; they specify the data and model to use. The &observation macro variable specifies which observation’s prediction is being explained, and the &num_vars macro variable species how many input variables are being reported by LIME. The standardizeEstimates parameter is set to INTERVALS, which tells the action to standardize the least squares estimates of the LIME coefficients so that their magnitudes can be compared.
The following code is used to compute the Shapley values for explaining an individual prediction:
explainModel.shapleyExplainer result = shx_res / table = "BREAST_CANCER_TRAIN" query = {name = "BREAST_CANCER_TRAIN", where = "sample_id = &observation;"} inputs = inputs modelTable = "FOREST_MODEL" modelTableType = "ASTORE" predictedTarget = "P_classMALIGN"; run; 
As in the linearExplainer action call, the &observation macro variable in the shapleyExplainer call is used to specify the observation to be explained. The shapleyExplainer action call reuses all previous parameters: table, query, inputs, modelTable, modelTableType, and predictedTarget. The values of the input variables for the specified observation are shown in Table 2. This observation was determined by the physician to be malignant. The input variables in this observation are large, which would correspond to a high likelihood of malignancy according to the attending physician. The forest model produces a predicted probability of 100% that this observation is malignant.
Figure 3 shows the LIME coefficients for explaining the forest model for the observation that is shown in Table 2.
The LIME values for the observation are all positive, indicating that the model’s prediction increases as each variable value increases in the local region around this observation. This local explanation agrees with the global explanation that comes from the partial dependence plots, where the mean prediction increases along with each input variable.
Figure 4 shows the five largest Shapley values for explaining the prediction for the same observation that is shown in Table 2.
The Shapley values for this observation are all positive, indicating that the values of the input variables to the model in this observation increase the model’s prediction relative to other observations in the training data. This makes sense because the input variables in this observation are all high, taking values between 7 and 10, which would all indicate high likelihood for malignancy, and thus contribute positively to the model’s prediction. You can see that LIME and Shapley explanations mostly agree for explaining the pretrained forest model’s prediction for this observation.
Table 3 shows another sample from the data for which the model produced an incorrect prediction. Although the sample did prove to be malignant, the model predicts a likelihood of malignancy of 0.14, which is a large deviation from the truth. Most of the input variables in this observation take low values, meaning the investigating physician did not think any variable indicated a strong likelihood of malignancy. The LIME and Shapley values might provide insight into why the model produces this prediction.
Figure 5 shows the LIME coefficients for explaining the forest model’s false prediction.
The LIME coefficients offer little insight beyond what is already understood about the model—that its prediction increases with respect to each increasing input variable. Many of the variables’ values are small in this observation, and it would therefore be expected that the model’s prediction would also be small.
Figure 6 shows the Shapley values for the same observation that is shown in Table 3.
The Shapley values for the observation are more meaningful in the context of the incorrect prediction. It appears that the only variable that contributes positively to the prediction of malignancy is the bare nuclei variable (which takes an intermediate value of 5) and that all other input values cause the model’s prediction to decrease with respect to the other observations. It seems that the model is split in determining whether this observation is malignant. This can be caused by the model giving too much weight to some variables or there not being enough information in the input data to model this observation. Ultimately these results can identify a weakness in the modeling process or might indicate that this particular instance is really hard to diagnose on the basis of the available input variables. This information can be used to inform further data collection, feature engineering, and model tuning.
Model interpretability does not necessarily need to be confined to the end of a modeling process. Occasionally, the interpretability results can reveal information that leads to new feature engineering ideas or reveals that certain input variables are useless to the models and only contribute to the curse of dimensionality. For this data set, the partial dependency of the input variables increases monotonically. The simple nature of the relationship between the input variables and the model’s prediction might lead you to think that a simpler model will perform just as well as the forest model for this problem. Furthermore, the mitoses variable seems to be effectively unused by the model according to both the variable importance table and the partial dependence plots, which means it can likely be dropped entirely from the input data.
With the preceding information in mind, a logistic regression model is trained on the same data, dropping the mitoses input. Backwards selection is done using the logistic action in the regression action set. Only the clump thickness, cell size uniformity, and bare nuclei variables remain after selection. Based on the training data, a cutoff of 0.19 is selected, which yields a misclassification rate of 4.37% on the test set, a mere 1.46% decrease in accuracy from the forest model. Figure 7 shows the partial dependence of each input variable with respect to the mean prediction from the logistic model.
The partial dependence curves of the input variables for the logistic regression model show a similar relationship to what they show in the forest model, with the mean prediction increasing as the variable values increase. However, as expected, the logistic curves are much smoother than those of the forest model.
Now you have two models of comparable accuracy, each with its drawbacks. The forest model demonstrates a higher accuracy than the regression model, but it is natively uninterpretable. The logistic regression enables you to directly use the regression coefficients to understand the model, but it has a slightly lower accuracy. Ultimately the best model to use is the one that maximizes prediction accuracy while meeting the necessary interpretability standard. If using the LIME, Shapley, and partial dependence values for interpretations provides meaningful explanations given the modeling context, then the forest model is better. If not, then the logistic regression should be chosen. Since this model will ultimately be consumed by a clinician who has worked closely with the patient and directly developed the input features, the forest model is likely a better choice, because the clinician is there to safeguard against model inaccuracies.
Read our quick guide: The Machine Learning LandscapeThe post Explaining complex models in SAS® Viya® with programmatic interpretability appeared first on The SAS Data Science Blog.
]]>The post How to explain your complex models in SAS® Viya® appeared first on The SAS Data Science Blog.
]]>A machine learning pipeline is a structured flow of analytical nodes, each of which performs a single data mining or predictive modeling task. Pipelines can help automate the data workflows required for machine learning, simplifying steps including, variable selection, feature engineering, model comparisons, training and deploying the model. Visibility into modeling pipelines can improve model reuse and model interpretability.
To build predictive modeling pipelines in SAS Visual Data Mining and Machine Learning, use the SAS Model Studio webbased visual interface. You can access Model Studio by selecting Build Models from the Analytics Life Cycle menu in SAS Drive, as shown in Figure 1.
When working in Model Studio, you construct pipelines by adding nodes through the pointandclick web interface. Figure 2 shows a simple Model Studio pipeline that performs missing data imputation, selects variables, constructs two logistic regression models and a decision tree model, and compares their predictive performances.
Building pipelines is considered a best practice for predictive modeling tasks because pipelines can be saved and shared with other SAS Visual Data Mining and Machine Learning users for training future machine learning models on similar data. In addition to many feature engineering capabilities, Model Studio also offers numerous ways to tune, assess, combine, and interpret models.
In Model Studio, model interpretability functionalities are provided as a posttraining property of all supervised learning nodes. Changing posttraining properties and retrieving interpretability results does not require retraining the machine learning model. Figure 3 shows the model interpretability properties for the Gradient Boosting supervised learning node.
Model interpretability results are presented in a userfriendly way by decreasing the huge amount of information that might be overwhelming to users. Instead, Model Studio includes text explanations that are provided by natural language generation for an easier understanding of the results. This enables users who are less experienced with these techniques to find some meaningful insight into the relationships between the predictors and the target variable in blackbox models. For more information about Model Studio, see SAS Visual Data Mining and Machine Learning: User's Guide.
Let's look at an example that demonstrates how to use SAS Model Studio for performing posthoc model interpretability, using financial data about home equity loans. We will be building a model that determines whether loan applicants are likely to repay their loans. We will first explain how the model is built and then assess the model using various interpretability techniques.
The data comes from the FICO xML Challenge and is contained in an anonymized data set of home equity line of credit (HELOC) applications made by real homeowners. A HELOC is a line of credit typically offered by a bank as a percentage of home equity (the difference between the current market value of a home and any liens on the home). The customers in this data set have requested a credit line in the range of $5,000–$150,000.
The goal is to predict whether the applicants will repay their HELOC account within two years. The data set has 10,459 observations for a mix of 23 interval and nominal input variables, which include predictors such as the number of installment trades with balance, the number of months since the most recent delinquency, and the average number of months in the file. The target variable is RiskPerformance, a binary variable that takes the value Good or Bad. The value Bad indicates that a customer’s payment was at least 90 days past due at some point in the 24 months after the credit account was opened. The value Good indicates that payments were made without ever being more than 90 days overdue. The data are balanced with around 52% Bad observations.
This blog post focuses on explaining model interpretability, so it skips the data preprocessing and feature generation steps. All the input variables were taken asis, except for the variables Max Delq/Public Records Last 12 Months and Max Delinquency Ever, which were converted to strings according to the FICO data dictionary. Also, an ID variable was created to specify individual applicants when performing local interpretability.
A Model Studio project (called heloc) is created using this data set. For more information about how to create a SAS Model Studio project, see the Getting Started with SAS Visual Data Mining and Machine Learning in Model Studio section in the SAS Visual Data Mining and Machine Learning: Users Guide.
The data are partitioned by the default values of 60% for training, 30% for validation, and 10% for test sets, as shown in Figure 4.
By clicking the Data tab, you can assign different roles to your input variables. Figure 5 shows that the newly created ID variable is assigned the role Key. This step is necessary if you want to specify individual predictions for local interpretability. Figure 5 also shows that the binary variable RiskPerformance is specified as the target variable.
When specifying the target variable, you can choose the event level of the target variable. Figure 6 shows that the event level is specified as Bad. This means the predicted probabilities of the trained models represent the probabilities of a customer making a late payment.
To train a gradient boosting model in Model Studio, you simply need to connect the Gradient Boosting supervised learning node to the Data node. For this example, the Gradient Boosting node runs with its default settings without any hyperparameter tuning. By default, the validation set is used for early stopping to decide when to stop training boosted trees. Figure 7 shows the fit statistics of the blackbox gradient boosting model.
Figure 7 shows that the model’s misclassification rate on the test set is 27.8%. Figure 8 shows the corresponding event classification plot, where the larger portion of the model’s misclassification events are good applications that are predicted as bad.
To improve prediction accuracy, you can perform a hyperparameter search for your gradient boosting model by turning on the Perform Autotuning property, which is available in all the supervised machine learning nodes in Model Studio. To learn more about automated hyperparameter tuning functionality in SAS Viya, see Koch et. al. (2018).
This section shows how you can request and view global interpretability plots.
Figure 9 shows the checkboxes you use to enable a node’s global interpretability methods (variable importance and partial dependence plots) in Model Studio. Note that since the model interpretability techniques covered here are posthoc, they are done after training the gradient boosting model. This means that unless you change any model training properties, changing a posttraining property such as model interpretability does not require retraining the model.
The model variable importance table in Figure 10 shows the ranking of the importance of the features in the construction of the gradient boosting model.
In Model Studio, by default partial dependence plots are generated for the top five variables in the variable importance table. Figure 11 and Figure 12 show the partial dependency plots for the top two variables. In Figure 11, you can see that the predicted probability of payments being 90 days overdue decreases monotonically as the external risk estimate value increases. A text box to the right of the graph explains the graph by using natural language generation (NLG) in SAS Viya. All model interpretability plots have NLG text boxes. These explanations help clarify the graphs and are especially useful if you are not familiar with the graph type.
Figure 12 shows that the predicted probability of Bad payments decreases gradually as the applicant’s number of months in file increases from 50 months to 100 months. This is expected because applicants who have a longer credit history are deemed less risky. The heat map on the Xaxis shows that not many observations have an average number of months in file greater than 100. After the number of months in file reaches 100, the probability of Bad payments first increases slightly and then flattens because the model has less information in this domain. Hence, you should be cautious in explaining the part of the plot where the population density is low.
Local interpretability helps you understand individual predictions. Figure 13 shows the Gradient Boosting node’s options for requesting local interpretability (ICE, LIME, and Kernel SHAP) for five applicants who are specified by their IDs. This identification variable should have unique values and must be specified to have the role Key (only one variable can have this role) on the Data tab.
Figure 14 shows the five ICE plots for the specified observations for the external risk estimate input variable. The change in the model’s prediction for each of these observations decreases as the external risk estimate increases, which matches the behavior that is seen in the partial dependence curve (shown in blue). Each observation is affected by the external risk estimate slightly differently. For observation 152, there is a steep decline in the model’s predicted probability of late payment when the external risk estimate is between 60 and 70, whereas for observation for 4129, the decline is more gradual between 60 and 70 and is steeper after 70.
Figure 15 shows the LIME explanation of the prediction of the blackbox gradient boosting model for instance 8351. Gradient boosting models predict this instance as a highrisk HELOC application with a predicted probability of 0.965.
When LIME is implemented, a local model is fit after converting each feature into a binary feature according to its proximity to the explained observation. Therefore, the coefficients in the LIME explanation represent the impact of the observed feature values of the instance. The feature values of instance 8351 are shown in Table 1.
The LIME explanation for instance 8351 shows that Number Trades 60+ Ever=1 and Max Delinquency Ever=Never Delinquent decrease the risk of default, whereas all other predictors increase the risk of default.
Figure 16 shows the Kernel SHAP explanation of the prediction of the blackbox gradient boosting model for instance 4129. The ground truth for this instance is Good, but the models output a high probability (0.904) of predicting it as Bad.
The Kernel SHAP explanation in Figure 16 and the feature values in Table 1 show that the features that contribute most toward increasing the high risk of late payments are Percent Trades Never Delinquent=67, Number Satisfactory Trades=5, and External Risk Estimate=62. Even though the model has such a high confidence in its prediction, the same confidence is not seen by examining the top five Kernel SHAP explanations, which can be used as a warning sign for this false prediction.
Read our quick guide: The Machine Learning LandscapeThe post How to explain your complex models in SAS® Viya® appeared first on The SAS Data Science Blog.
]]>The post Monotonic Constraints with SAS appeared first on The SAS Data Science Blog.
]]>A monotonic relationship exists when a model’s output increases or stays constant in step with an increase in your model’s inputs. Relationships can be monotonically increasing or decreasing with the distinction based on which direction the input and output travel. A common example is in credit risk where you would expect someone’s risk score to increase with the amount of debt they have relative to their income.
This makes sense to most people and has the benefit of making the relationship actionable: If you’d like a loan, reduce your debt. Imagine a nonmonotonic relationship and informing some individuals to reduce their debt, while informing others to increase their debt in order to get a loan. As we can see, monotonicity can support common sense as well as fairness.
Monotonic constraints are restrictions that force models to preserver monotonic relationships between a model’s inputs and its output. In the prior example, it’s a way of telling the model that the final result must show that any increase in a person’s debttoincome ratio results in a nondecreasing change in the risk prediction.
SAS released its implementation of monotonic constraints for its Gradient Boosting procedure in Visual Data Mining and Machine Learning version 8.5. When growing a tree in SAS gradient boosting, the algorithm is constrained on each local split to enforce the monotonic relationship when this option is set. The optional setting refers to a restriction on the relationship of the input variable with the prediction function for the target event level. Therefore, if you want a decreasing relationship between a model input and output, you add a "monotonic=decreasing" option to your input statement.
As is often the case with machine learning models, it can be hard to determine the relationship of individual variables and a prediction function from the model itself. One of the easiest ways to validate these relationships is with partial dependence (PD) plots and individual conditional expectation (ICE) plots. For more background on these plots, I highly recommend reading the detailed blog post on model interpretability. Since ICE plots occur at the observation level compared to the model average in PD plots, they could be considered more effective in validating a monotonic relationship. However, I chose to display the PD plots in this blog post for simplicity.
For this walkthrough, I will use the everpopular HMEQ data set from SAS which is based on HELOC loan data in a Model Studio project. I will focus on two variables that often have monotonic relationships with the probability of default: DebttoIncome Ratio (DEBTINC – increasing relationship) and age of oldest trade line in months (CLAGE – decreasing relationship). First, we will run a standard gradient boosting model with all the default settings. We will add the Model Interpretability options to produce PD plots to depict the relationship between the input variables and the average prediction of the model. Of note for this particular data set is that the target variable "BAD" is coded as 0 and 1 with our desired predicted event level being 1, representing a bad loan. Since SAS gradient boosting defaults its event level as the first value for nominal targets (0 in our case), we will need to reverse the relationships to match our desired state.
The PD plots above show that there are clear increasing and decreasing relationships with the two variables and the predicted probability of loan default. However, we would not categorize this as monotonic as there are several points in the plot that conflict with the overall direction as highlighted with the red outline. Now, we will add the monotonic constraints to see how this impacts the models.
We will first run the models with a monotonic decreasing constraint. While this option is not available in Model Studio today, we can always use a SAS code node to run the gradient boosting model with the desired option. The code below can be dropped into a SAS Code node and be run in any Model Studio pipeline as it uses the standard macros created by the software.
/* SAS code */ /* Run Gradient Boosting using gradboost procedure */ proc gradboost data=&dm_data earlystop(tolerance=0 stagnation=5 minimum=NO metric=LOGLOSS) binmethod=QUANTILE maxbranch=2 assignmissing=USEINSEARCH minuseinsearch=1 ntrees=100 learningrate=0.1 samplingrate=0.5 lasso=0 ridge=1 maxdepth=4 numBin=50 minleafsize=5 seed=12345; %if &dm_num_interval_input %then %do; input %dm_interval_input / level=interval monotonic=decreasing; %end; %if &dm_num_class_input %then %do; input %dm_class_input/ level=nominal; %end; %if "&dm_dec_level" = "INTERVAL" %then %do; target %dm_dec_target / level=interval ; %end; %else %do; target %dm_dec_target / level=nominal; %end; &dm_partition_statement; ods output VariableImportance = &dm_lib..VarImp Fitstatistics = &dm_data_outfit ; savestate rstore=&dm_data_rstore; run; /* Add reports to node results */ %dmcas_report(dataset=VarImp, reportType=Table, description=%nrbquote(Variable Importance)); %dmcas_report(dataset=VarImp, reportType=BarChart, category=Variable, response=RelativeImportance, description=%nrbquote(Relative Importance Plot)); 
This code will develop a gradient boosting model and force a monotonic decreasing constraint on all interval variables. The PD plot below now only shows the relationship between input variables and the predicted target as increasing or constant throughout the model. However, since I've applied that relationship to all interval variables, there is a less intuitive relationship with CLAGE.
For demonstrative purposes, let’s see what happens when we tweak the code above and change the monotonic constraint from decreasing to increasing. Upon inspection of the resulting PD plots, the CLAGE plot is constrained optimally to show our desired relationship, but the DEBTINC plot is showing an undesirable relationship.
Like most realworld problems, we are seeing a mix of input variables in our dataset with varying desired relationships with our target variable. Luckily for us, SAS allows you to use multiple input statements to reflect the desired relationships in your data. The code below will enforce the monotonically increasing relationship with DEBTINC, the monotonically decreasing relationship with CLAGE, and not force any constraints on the remaining variables in the model.
/* SAS code */ /* Run Gradient Boosting using gradboost procedure */ proc gradboost data=&dm_data earlystop(tolerance=0 stagnation=5 minimum=NO metric=LOGLOSS) binmethod=QUANTILE maxbranch=2 assignmissing=USEINSEARCH minuseinsearch=1 ntrees=100 learningrate=0.1 samplingrate=0.5 lasso=0 ridge=1 maxdepth=4 numBin=50 minleafsize=5 seed=12345; input Mortdue value clno yoj / level=interval; input clage / level=interval monotonic=increasing; input debtinc / level=interval monotonic=decreasing; %if &dm_num_class_input %then %do; input %dm_class_input/ level=nominal; %end; %if "&dm_dec_level" = "INTERVAL" %then %do; target %dm_dec_target / level=interval ; %end; %else %do; target %dm_dec_target / level=nominal; %end; &dm_partition_statement; ods output VariableImportance = &dm_lib..VarImp Fitstatistics = &dm_data_outfit ; savestate rstore=&dm_data_rstore; run; /* Add reports to node results */ %dmcas_report(dataset=VarImp, reportType=Table, description=%nrbquote(Variable Importance)); %dmcas_report(dataset=VarImp, reportType=BarChart, category=Variable, response=RelativeImportance, description=%nrbquote(Relative Importance Plot)); 
What was the result? Just as Goldilocks would say about baby bear’s bed, this model appears just right. In fact, if we examine the Model Comparison, our custom monotonic model has the highest AUC. This can often be attributed to models without certain constraints overfitting on data. The monotonic constraints in our custom model actually helped it generalize better in this instance and is not uncommon to see in realworld applications. However, oftentimes organizations are willing to sacrifice model accuracy for the benefit of a model that makes business sense and is easier to gain regulatory approval.
For those working in regulated industries such as financial services, I highly recommend giving gradient boosting models with monotonic constraints a shot to see if you can develop a more accurate model while adhering to regulatory guidelines.
I would like to thank Guixian Lin, Katherine Taylor, and Andrew Christian for the feedback and guidance on this post.
The post Monotonic Constraints with SAS appeared first on The SAS Data Science Blog.
]]>The post Portfolio Optimization using SAS and Python appeared first on The SAS Data Science Blog.
]]>Analytics can be categorized into four levels: descriptive, diagnostic, predictive, and prescriptive. Descriptive analytics explore what happened, diagnostic analytics examine why something happened, and predictive analytics imagine what will happen. Today, we will focus on the final level, prescriptive analytics, which tries to find the best possible outcome. Specifically, we will focus on using optimization to select a stock portfolio that maximizes returns while taking risk tolerance into account.
Before we can solve our optimization problem, we must first set it up. An optimization problem consists of several pieces, including the objective function, the decision variables, and the constraints. The objective function is our goal and takes the form of a minimization or maximization. Our goal is to maximize expected return. Decision variables are the factors we have control over to change the value of the objective function. In our example, our decision variables are the proportion of our portfolio our potential stocks hold. Finally, constraints are bounds on our optimal solution based on what is possible. For our problem, we cannot hold a negative proportion of stock, we cannot invest more money than we have, but we will invest all of the money in our portfolio, and we cannot exceed our risk threshold.
As I was studying for the SAS Forecasting and Optimization Specialist certification exam, the sasoptpy open source package was released. This is a Python package providing a modeling interface for SAS Viya Optimization solvers. It supports Linear Problems (LP), Mixed Integer Linear Problems (MILP), NonLinear Problems (NLP), and Quadratic Problems (QP). To solidify my studies, I took the portfolio optimization problem and translated it into Python using sasoptpy in this Jupyter Notebook.
I started by declaring my parameters and sets, including my risk threshold, my stock portfolio, the expected return of my stock portfolio, and covariance matrix estimated using the shrinkage estimator of Ledoit and Wolf(2003). I will use these pieces of information in my objective function and constraints. Now I will need SWAT, sasoptpy, and my optimization model object.
import swat import sasoptpy as so conn = swat.CAS('localhost', 5570, authinfo='~/.authinfo', caslib="CASUSER") m = so.Model(name='portfolio_opt', session=conn) 
Next, I declared my decision variables, which are the proportions each stock will hold in my optimal portfolio. When declaring my decision variables, I added a lower bound of zero to ensure that I will have nonnegative proportion values.
proportion=m.add_variables(ASSETS, name='proportion', lb=0) 
Then I declared my constraints. My first constraint ensures that the proportion of stocks in my portfolio sum to one. This means I will invest all of my money, but not more.
money_con = m.add_constraint( (so.expr_sum(proportion[i] for i in ASSETS) == 1), name='money_con') 
I used the covariance matrix to constrain my risk, as measured by variance, such that is was smaller than my risk threshold.
risk_con = m.add_constraint( (so.expr_sum(COVAR.at[i,j]*proportion[i]*proportion[j] for i in ASSETS for j in ASSETS) <= RISK_THRESHOLD), name='risk_con') 
Finally, I created my objective function to maximize my expected value.
total_return = m.set_objective(so.expr_sum(RETURNS['EXPECTED_RETURNS'][i]*proportion[i] for i in ASSETS), sense=so.MAX, name='total_return') 
Now solving the optimization problem once it is set up is a piece of cake. SAS will automatically decide which optimization algorithm to used based on my problem.
sol = m.solve() 
And just like that, an optimal solution was found using the Interior Point Direct algorithm with a potential expected returned of 10.14 with the given stock proportions.
The full problem setup, additional information, completed code and results are available in my Jupyter Notebook!
We used sasoptpy to set up an optimization problem to select our stock portfolio for us. This was an oversimplified example as should not be used as investment advice, but this was only one use case for sasoptpy. Other really cool use cases to explore are more efficiently allocating political campaign funds, optimization of shipping logistics, solving the Stigler diet problem, and building the best sports teams.
If you are interested in learning more about optimization, check out the Optimization Concepts for Data Science and Artificial Intelligence course. It's a great course to take, especially if you are hoping to grab your SAS Professional Certification in Artificial Intelligence and Machine Learning!
The post Portfolio Optimization using SAS and Python appeared first on The SAS Data Science Blog.
]]>The post Build your ML web application using SAS AutoML appeared first on The SAS Data Science Blog.
]]>The image above shows two separate applications. Decisioning in Action (right) shows a customer who is trying to apply for a loan application. After filling out their information, one click on the Am I eligible? button can decide whether they are eligible to receive a loan.
Model Ops (left) shows the pointofview of a data scientist who is tasked to create a model that will decide whether to approve a loan application. The Create AutoML Model button invokes an API to develop automated model.
However, the model is not a "black box" in many ways. AutoML could be configured before creation, such as how long to train a model(s). In addition, algorithms created by AutoML can be viewed and customized.
The ability to view and change models makes AutoML less of a "black box" algorithm and more of a "model recommendation" for data scientists to refer to. Models can be governed, monitored, and deployed (in containers, cloud servers such as EC2 instances, and so on) using SAS Model Manager on SAS Viya by simply clicking on Publish Model.
Note: SAS Model Studio, the tool that developed this model, requires a SAS VDMML license. SAS Model Manager, where models are registered for governance, monitoring, and deployment, is available with a SAS MM license.
Developing a loan approval application is a sensitive task since automatically approving loans to customers who will default can be costly for a lender. SAS enables quick and easy development and exposure of decisionmaking models from a single source. A simple and robust environment can make decisions less prone to errors. In case you are wondering how the application was created, the section below generalizes the main components and how SAS interacts with the application.
The diagram below describes the steps it took to develop the application. For the frontend side of the application, HTML, CSS and Javascript were used to populate, design, and make the web content dynamic. On the other hand, Python language was used for the backend to run a lightweight web framework and transfer data between the frontend & SAS environment via REST APIs. Javascipt could have been used for the backend to call SAS functions as an alternative because API calls can be executed from many languages. The application was containerized and can be easily deployed from other machines with docker platform.
The post Build your ML web application using SAS AutoML appeared first on The SAS Data Science Blog.
]]>The post 4 modelagnostic interpretability techniques for complex models appeared first on The SAS Data Science Blog.
]]>Interpretability techniques can help overcome challenges with blackbox models that make them hard to understand. In this series we'll introduce four modelagnostic interpretability techniques you can use to explain and understand machine learning models, including:
We'll also demonstrate the use of these techniques in two case studies. For the first case study, we will use Model Studio application in SAS Viya, and for the second case study we will use the SAS Viya programming interface.
Modern machine learning algorithms can make accurate predictions by modeling the complex relationship between inputs and outputs. The algorithms build predictive models by learning from the training data, and then make predictions on new observations. Although machine learning algorithms can learn complex relationships, the models that are produced can be equally complex, making it difficult to understand the association between the inputs and outputs. Because of their complexity, many machine learning models are blackbox models, producing predictions without explaining why and how they are made.
One example of a blackbox machine learning model is a simple neural network model with one or two hidden layers. Though you can write out the equations that link every input in the model to every output, you might not be able to grasp the meaning of the connections simply by examining the equations. This has less to do with the shortcomings of the models, and more to do with the shortcomings of human cognition. Often, the higher the predictive accuracy of a model, the harder it is to interpret its inner workings. This is where interpretability techniques come into play by providing a lens through which you can view these complex models.
Model interpretability can meet different needs for different users, such as regulators, executives, data scientists, and domain experts.
Inherently interpretable models (also called explainable models) incorporate interpretability directly into the model structure, and thus are selfexplanatory. One type of commonly used inherently interpretable models is the generalized linear model (GLM), which includes linear and logistic regression. The coefficient estimates of GLMs directly reflect feature contributions; hence, these models can be explained through these coefficients.
More recently introduced examples of inherently interpretable models achieve interpretability by forcing the models to use fewer features for prediction or by enabling features to have monotonic relationships with the prediction (Ustun and Rudin 2015). Another inherently interpretable model is the generalized additive model with pairwise interactions (GA2M). These models enable you to understand the contribution of features through their additive components (Caruana et al. 2015). Overall, constraints on features can make complex models simpler and increase the model’s comprehensibility to users. However, imposing these constraints can also decrease the predictive ability of the model when compared to an unrestricted model.
In this series, we will explore modelagnostic interpretability methods that are used to explain trained supervised machine learning models, such as boosted trees, forests and neural networks. These posthoc techniques explain predictions of these models by treating the models as black boxes and then generating explanations without inspecting the internal model parameters.
Modelagnostic interpretability techniques enable fully complex models to be interpreted either globally or locally. Global interpretability provides explanations about the general behavior of the model over the entire population. For example, global interpretability might explain which variables played an important role in the construction of the model or describe the impact of each feature on the overall prediction of the model. Variable importance plots and partial dependence plots are global interpretability techniques.
In contrast, local interpretability provides explanations for a specified prediction of the model. In general, local interpretability techniques assume that machine learning predictions in the neighborhood of a particular instance can be approximated by a whitebox interpretable model such as a regularized linear regression model (LASSO). This local model does not have to work well globally, but it must approximate the behavior of the pretrained model in a small local region around the instance of interest. Then the parameters of the whitebox model can be used to explain the prediction of the pretrained model. LIME (local interpretable modelagnostic explanations) and Shapley values are local explanation techniques.
Variable importance tables indicate the statistical contribution of each feature to the underlying model. There are various ways to calculate modelagnostic feature importance. One method includes fitting a global surrogate decision tree model to the blackbox model predictions and using the variable importance table that is produced by this simple decision tree model.
Another commonly used approach is permutationbased feature importance as described in Altmann et al. (2010). This approach measures the decrease in model predictive performance when a single feature is randomly shuffled. This technique can be very expensive computationally if the number of predictors is very large, because it requires training a new model (on the perturbed data) for each feature.
If the pretrained model is treebased (decision tree, gradient boosting, or forest), you can also use the modelspecific variable importance table that is generated during the model construction. Generation of these treebased model variable importance tables is often based on the number of times a feature is used to split data.
Both PD and ICE provide explanations that are based on data perturbation, where the contribution of each feature is determined by measuring how a model’s prediction changes when the feature is altered. Partial dependence (PD) plots depict the relationship between a feature and the average prediction of the pretrained model.
PD plots focus on the average effect of a feature for the entire data, whereas ICE plots focus on the effect of a feature for a single observation (Goldstein et al. 2014), which makes ICE technique a local explanation method. Although ICE is a local explanation technique, it is more related to PD that other local explanation methods. By examining various ICE plots, you gain insight into how the same feature can have a different effect for different individuals or observations in the data.
For more information about how PD and ICE plots are generated in SAS Viya, see Wright (2018).
LIME (local interpretable modelagnostic explanations) explains the predictions of any model by building a whitebox local surrogate model (Ribeiro, Singh, and Guestrin 2016). The method first samples the feature space in the neighborhood of an individual observation with respect to a training data set. Then, a sparse linear regression model, such as LASSO, is trained on this generated sample, using the predictions that are produced by the pretrained model as a target. This surrogate model approximates the behavior of the blackbox model locally, but it is much easier to explain by examining the regression coefficients.
Like LIME, the Shapley values explain individual predictions (Kononenko 2010). Different from LIME coefficients, Shapley values for feature contributions do not directly come from a local regression model. In regression models, the coefficients represent the effect of a feature assuming all the other features are already in the model. It is wellknown that the values of the regression coefficients highly depend on the collinearity of the feature of interest with the other features that are included in the model. To eliminate this bias, Shapley values calculate feature contributions by averaging across all permutations of the features joining the model. This enables Shapley values to control for variable interactions.
The Shapley values are additive, meaning that you can attribute a portion of the model’s predictive value to each of the observation’s input variables. For example, if you have a model that is built with three input variables, then you can write the predicted value as summation of the corresponding Shapley values plus the average predicted value across the input data set. Note that even though Shapley values are additive, they are not ordered.
Because of computational complexities, there are multiple methods for computing approximations to Shapley values. SAS Viya offers the Kernel SHAP and HyperSHAP methods.
For more information about the SAS Viya implementation of local interpretability techniques in this section, see the chapter “Explain Model Action Set” in SAS Visual Data Mining and Machine Learning 8.5: Programming Guide.
The post 4 modelagnostic interpretability techniques for complex models appeared first on The SAS Data Science Blog.
]]>The post Automated Machine Learning Vs. The Data Scientist appeared first on The SAS Data Science Blog.
]]>Ever since automated machine learning has entered the scene, people are asking, "Will automated machine learning replace data scientists?" I personally don't think we need to be worried about losing our jobs any time soon. Automated machine learning is great at efficiently trying a lot of different options and can save a data scientist hours of work. The caveat is that automated machine learning cannot replace all tasks. Automated machine learning does not understand context as well as a human being. This means that it may not know to create specific features that are the norm for certain tasks. Also, automated machine learning may not know when it has created things that are irrelevant or unhelpful.
To strengthen my points, I created a competition between myself and two SAS Viya tools for automated machine learning. To show that we are really better together, I combined my work with automated machine learning and compared all approaches. Before we dive into the results, let's discuss the task.
My predefined task is to predict if there are monetary costs associated with flooding events within the United States. In comparing various approaches, I will weigh easeofuse, personal time spent, compute time, and several metrics such as Area under curve (C statistic), Gini, and Accuracy.
The data I am using comes from the NOAA Storm Events Database. I downloaded 30 years of data and appended the data together into one data set. Next, I binned similar event types together to create larger groupings of storm events. This took 68 event types down into 15 event types.
Additionally, I created a variable measuring total costs by adding all of the costs together and adding a flag for when costs are greater than 0. This binary flag will be the target variable. I then subset the data to focus only on flood events, which included flash floods, floods, and lakeshore floods. Moving forward, let's assume that all of this preparation work does not prove my usefulness yet.
When we examine the distribution of our target below, we see that there are nearly twice as many storm events that do not have an associated cost than there are with an associated cost. Only 35% of floods accrued any cost. Following, if we predict 0 for every flood event, we will be right ~65% of the time. Thus, at a bare minimum, our models should have an accuracy greater than 0.65.
I will order my models from mosteffort to leasteffort.
To make this fair, I went through my analytical process without any automation. This means I did not use my wellloved hyperparameter autotuning or automated feature generation. I started by partitioning my data and visually exploring my data.
I created several new columns. First, I calculated how far the storm traveled using the starting coordinates and the ending coordinates. Second, I calculated how long the storm was active using the starting datetime and the ending datetime. Finally, I checked if the storm crossed zip codes, county lines, and state lines. To be perfectly honest, this feature creation took a lot of time. Translating coordinates to zip codes, counties, and states without using a paid service is not a simple task, but these variables are important to the insurance industry, so I pressed through. This effort for three new columns means that the time and effort are inflated for any method that uses these features.
To clean my data, I grouped rare values, imputed missing data and added a missing indicator, and dropped columns with high cardinality, high missingness, high consistency, or were simply not useful. Next I tried several models including a LASSO Regression, a Stepwise Regression, a Decision Tree, a Random Forest, and a Gradient Boosting model. Additionally, I adjusted various hyperparameters such as learning rate, regularization, selection, and treestructure.
My winning model was a tuned Gradient Boosting model with stronger selection, higher learning rate, and deeper structure. Results of the process and performance are summarized in this table:
Method 
Difficulty (1=Easy & 5=Hard) 
Personal Time Spent 
Computational Time 
Best Model 
AUC of Best Model 
Accuracy of Best Model 
Gini of Best Model 
Data Scientist 
4/5

> 5 Hours 
> 2 hours 
Gradient Boosting 
0.85 
0.80 
0.71 
For this combined approach, I applied my knowledge of context with automated machine learning's ability to try a lot of different things. I started by taking the five features I created earlier (distance traveled, storm duration, changed zip codes, changed counties, and changed states) into Model Studio on Viya. Using Model Studio on SAS Viya, you can quickly chain together data mining preprocessing nodes, supervised learning nodes, and postprocessing nodes into a comprehensive analytics pipeline. In Model Studio, I created a pipeline using my favorite nodes and hyperparameter autotuning. Additionally, I used a provided advanced modeling template with autotuning and I asked Model Studio to create an automated machine learning pipeline. This approach took less than 10 minutes of my time to set up (after my features were already generated) and ran in the background for about 20 minutes while I completed other work.
My best performing model was an autotuned gradient boosting from the template, but all five of the variables I created were selected to be included in the model. Results of the process and performance are summarized in this table (including time for feature engineering):
Method 
Difficulty (1=Easy & 5=Hard) 
Personal Time Spent 
Computational Time 
Best Model 
AUC of Best Model 
Accuracy of Best Model 
Gini of Best Model 
SAS Model Studio and Me 
3/5 
> 5 Hours 
> 2 hours 
Gradient Boosting 
0.85 
0.80 
0.70 
Next, I combined my features with the automation of Data Science Pilot. Here, I specified to create all of the available transformations and selected the decision tree, random forest, and gradient boosting models using 5fold validation. Writing out the SAS code took a few minutes, but running this block of code took hours. Computationally, this method took the longest, but achieved the highest model performance. Results of the process and performance are summarized in this table (including time for feature engineering):
Method 
Difficulty (1=Easy & 5=Hard) 
Personal Time Spent 
Computational Time 
Best Model 
AUC of Best Model 
Accuracy of Best Model 
Gini of Best Model 
Data Science Pilot and Me 
3/5 
> 5 Hours 
> 2 hours 
Gradient Boosting 
0.89 
0.82 
0.78 
Following, I took the same Data Science Pilot code block and pointed it at the data set without my new features or cleaning efforts. Again, it only took a few minutes to write the code, but this method ran for hours. This approach also achieved high performance, but without my precious five features, it fell short of my previous method. Results of the process and performance are summarized in this table:
Method 
Difficulty (1=Easy & 5=Hard) 
Personal Time Spent 
Computational Time 
Best Model 
AUC of Best Model 
Accuracy of Best Model 
Gini of Best Model 
Data Science Pilot 
2/5 
< 10 Minutes 
> 2 hours 
Gradient Boosting 
0.88 
0.81 
0.76 
Within SAS Model Studio on Viya, there is an option to automatically generate a pipeline within a set time limit. I removed the time limit and clicked run. This process took less than 1 minute of my time and ran in the background as I completed other tasks. Unlike Data Science Pilot, I do not have control over the various policies and parameters as the pipeline is built, which has lead to slightly different results. Results of the process and performance are summarized in this table:
Method 
Difficulty (1=Easy & 5=Hard) 
Personal Time Spent 
Computational Time 
Best Model 
AUC of Best Model 
Accuracy of Best Model 
Gini of Best Model 
SAS Model Studio Auto ML 
1/5 
< 1 Minute 
~ 10 Minutes

Ensemble 
0.77 
0.73 
0.55 
The chart above compares the Area Under the Curve (AUC) for the best performing model produced by each method with my scale of effort, with 5 being hard and 1 being easy. Our easiest method had the worst model performance, but the accuracy of this model (0.73) exceeds the minimum accuracy we defined earlier (0.65). Data Science Pilot is a powerful automation tool, and with only a few specifications, it achieved the secondbest performance. But, even better performance is achieved by adding in the features I created. In conclusion, automated machine learning is a powerful tool, but it should be viewed as a complement to the data scientist rather than a competitor.
The post Automated Machine Learning Vs. The Data Scientist appeared first on The SAS Data Science Blog.
]]>The post Video: Modeling employee retention using deep learning with Python (DLPy) and SAS Viya appeared first on The SAS Data Science Blog.
]]>The example used in this video is about employee attrition. In this dataset, there are 15,000 observations with the time employees spend at the company ranging from 2 to 10 years. At the time of the study, approximately 76% of employees were still at the company. This means that the attrition event did not occur in the designated time period. This is called censoring as the timetoevent outcome variable does not have the exact tenure time for employees.
Survival models are needed to account for censoring to provide good prediction and valid inference. In the attrition example, we compare the Cox proportional hazards model with a deep survival model. The Cox proportional hazards model is a type of regression model. The main difference between the Cox model and the deep survival model is that the Cox model has no hidden layer. That is, the Cox model fits a simple linear function for prediction, while the deep survival model with hidden layers can automatically learn a nonlinear and complex function, which can lead to a better performing model.
The concordance index (Cindex), is the most useful criterion for evaluating a model’s overall predictive performance in survival analysis. It evaluates how well a model predicts the ordering of survival times based on individual risk scores. In this case, we want to know who will leave the company when (e.g. employee 101 will leave the company after 5 years, employee 39 will leave the company after 6 years). This is very different from a typical model evaluation metric such as mean squared error.
For example, a C equal to .5 is for a random model, whereas C equal to 1 is for a model with perfect predicted ranking of survival times. The higher the Cindex, the better the model is. In our employee attrition model, the CIndex is .93 for the deep learning model, and .85 for the Cox model. The higher Cindex value tells us that the deep survival model obtains more accurate predictions than the Cox model.
DLPy makes it easy to take advantage of deep learning. Simply choose your models, modify them and begin deep learning using the notebooks and examples on GitHub. And if you want to contribute to the DLPy library, create a pull request on GitHub, as SAS gladly accepts them.
Just a note, to use the package for model development, a SAS Visual Data Mining and Machine Learning license is required. If you do not yet have a license, consider a 14free trial at sas.com/tryvdmml.
In case you missed them, here are the previous blogs with videos on DLPy:
The post Video: Modeling employee retention using deep learning with Python (DLPy) and SAS Viya appeared first on The SAS Data Science Blog.
]]>The post Maximize model performance without maximizing effort appeared first on The SAS Data Science Blog.
]]>Training and tuning a model are two different tasks. In model training, you strive to find the best set of model parameters that map how your inputs affect your target. For instance, coefficient values are the parameters in a regression and splitting rules are parameters in a decision tree. In model tuning, you strive to create the best model configuration for the task at hand. For example, maximum tree depth is a hyperparameter in a decision tree model and the number of hidden layers is a hyperparameter in a neural network.
There are various methods used to search the hyperparameter space for improved model performance. These methods include a grid search, a random search, Latin Hypercube Sampling (LHS), a Genetic Algorithm (GA), and a Bayesian search.
Grid search was my preSAS approach to hyperparameter autotuning. The first step of grid search is to take each hyperparameter of interest and select a set of values. Next, models are trained and assessed using the combination of potential hyperparameter values. The best performing models win. The downside of grid search is that it can be computationally costly to create a grid that is granular enough to find the optimal combination. In the figure below, notice that there were nine combinations examined. Even though nine models were trained and assessed, only three potential values were examined for each hyperparameter and the optimal points were missed.
Random search involves training and assessing models with hyperparameter combinations chosen at random. Consequently, random search may find the optimal combination of hyperparameter by chance! However, random search may miss the optimal points altogether. The figure below also uses nine combinations but it may try more values per hyperparameter.
If a grid search and a random search don't sound optimal  it's because they're not. Therefore to search the hyperparameter space more efficiently, SAS uses Local Search Optimization (LSO) capabilities. Latin Hypercube Sampling (LHS) is one Local Search Optimization strategy. Latin Hypercube Sampling examines more values for each hyperparameter and ensures that each value shows up only once in randomly blended combinations. By identifying good values for each hyperparameter, then stronger combinations can be made from these good values. This strategy allows for a more efficient search using the same number of points. The figure below also uses nine combinations, but examines nine values for each hyperparameter, evenly spread across the entire range to ensure that the full range was studied.
A Genetic Algorithm (GA) uses the principles of natural selection and evolution to find an optimal set of hyperparameters. First, the Genetic Algorithm will create an initial population using Latin Hypercube Sampling. By default, the number of Latin Hypercube Sampling evaluations used to initialize the Genetic Algorithm is fewer than the number of evaluations used when Latin Hypercube Sampling is used as a search method. Next these configurations can experience crossover or random mutations to create new configurations. In crossover events, various hyperparameter combinations are created from the hyperparameter values of the parents. Random mutation events can add new values to hyperparameters at random. An additional Generating Set Search (GSS) step performs local perturbations of the best hyperparameter combinations to add more potential highperforming configurations. Following, the new configurations are evaluated and the best configurations can serve as parents for a new generation of configurations in this iterative process.
Bayesian search uses a surrogate model to improve upon Latin Hypercube Sampling. Latin Hypercube Sampling is used to initialize a Kriging surrogate model (also known as a Gaussian process surrogate regression) to approximate the objective function. These are probabilistic statistical models that measure the similarity between points to predict unseen points within a finite collection of variables that have a multivariate normal distribution. The surrogate models generate new configurations to evaluate, which are then used to iteratively update the model.
Between grid search, random search and Latin Hypercube Sampling, Latin Hypercube Sampling tends to perform better. It is important to note that for grid search, random search, and Latin Hypercube Sampling, all configuration evaluations can be done in parallel, but no learning occurs. With the Genetic Algorithm and Bayesian search, the configuration evaluations within each iteration can be done in parallel, but better configurations can be achieved through learning across sequential iterations.
Ultimately which algorithm runs fastest depends on the resources used to run the algorithm. A system that can run all of the evaluations in parallel would find that grid search, random search, and Latin Hypercube Sampling would run the fastest. Given that such a system may not always be available, there may be smaller or even negligible speed differences across the search methods. Following, better performance on the objective function is more likely to be achieved using a learningbased search like the Genetic Algorithm or Bayesian search. With these considerations, I now see why Genetic Algorithm is the default hyperparameter search method!
Getting started with hyperparameter autotuning on Viya is a piece of cake! You can build stronger models in Viya's visual interface or programming interface in one or two steps.
In SAS Model Studio on Viya, autotuning hyperparameters can be added with a single button click. For Bayesian networks, Support Vector Machines (SVM), decision trees, forests, gradient boosting models, and neural networks, hyperparameter autotuning is available in the options.
To add hyperparameter autotuning to a model created through a task, you only have to click a checkbox. Under the options for the forest, gradient boosting, decision tree, neural network, Support Vector Machine (SVM), and factorization machine tasks include the option for hyperparameter autotuning.
If coding is more your style, for many models Visual Data Mining and Machine Learning (VDMML) procedures (including the gradient boosting example above), hyperparameter autotuning can be added using the autotune statement. Additionally for CASL programming, you can use the Autotune Action Set.
In conclusion, you should be ready to maximize your model's performance without maximizing your effort! If you are interested reading more, this blog goes into depth on autotuning value, this blog discusses autotuning within machine learning best practices, this blog discusses Local Search Optimization for autotuning, this communities article includes a brief overview into autotuning, and this whitepaper and this whitepaper goes into a deep dive into hyperparameter autotuning.
A special thanks to Patrick Koch for his help in refining the search methods pieces of this blog!
The post Maximize model performance without maximizing effort appeared first on The SAS Data Science Blog.
]]>The post Getting started with deep learning using the SAS Language appeared first on The SAS Data Science Blog.
]]>SAS has a rich set of established and unique capabilities with regard to deep learning. These capabilities can be accessed through many programming languages including Python, R, Java, Lua, and SAS, as well as through REST APIs. In this and subsequent blog posts, I’ll focus on how to use the SAS language to build deep learning models, with specific examples using different types of modeling that SAS supports.
In the examples, I use the SAS Cloud Analytic Services Language (CASL), which is called using the CAS procedure. CASL might look intimidating, but the language is actually easy to learn and use.
Before we get started, I’ll explain the three categories of deep learning models in SAS:
1) Deep feedforward neural networks (DNN)
2) Convolutional neural networks (CNN)
3) Recurrent neural networks (RNN)
Each category has unique capabilities. In the first example below, I create a basic deep feedforward neural network. The DNN model type is the most basic deep learning model category. The CNN and RNN variants of SAS deep learning models have characteristics that are unique for a specialized task and include capabilities far beyond the DNN type.
For instance, the CNN type consumes images as inputs. Additionally, you can use the CNN model for traditional tabular data because the CNN includes a richer subset of layers that can improve the analysis of tabular data such as batch normalization, multitask learning, reshape and so forth. A detailed description of each model type will be presented in later blogs in this series.
Here are two examples that illustrate the use DNN modeling:
In this example, I demonstrate how you can manually build a deep learning model architecture from scratch. I start a Cloud Analytic Services (CAS) session, assign the libref mycaslib, I load the SASHELP.BASEBALL data set as an inmemory table, and then partition the data in preparation for modeling.
proc cas; libname mycaslib cas; data mycaslib.baseball; set sashelp.baseball; run; proc partition data=mycaslib.baseball samppct=75 samppct2=25 seed=12345 partind; output out=mycaslib.baseball_part; run; 
Next, I use PROC CAS to specify my CAS actions. I load the Deep Learning action set called deepLearn. I’ll use multiple actions from this action set in the upcoming steps. First, I create an empty deep feedforward neural network. You then add individual layers to the model to slowly define the network architecture. Each time you add a layer, you identify the name of the model to add the layer to, in addition to specifying the name and type of the layer. You can also specify the hyperparameters specific to the layer type, such as the activation function and the number of hidden units.
proc cas; /* Load action set*/ loadactionset 'deeplearn'; /***************************/ /* Build a model shell */ /***************************/ deepLearn.BuildModel / modeltable={name='DLNN', replace=1} type = 'DNN'; /****************************/ /* Add an input layer */ /****************************/ deepLearn.AddLayer / model='DLNN' name='data' layer={type='input' STD='STD' dropout=.05}; /*********************************/ /* Add several FullyConnected Hidden layers */ /*********************************/ deepLearn.AddLayer / model='DLNN' name='HLayer1' layer={type='FULLCONNECT' n=30 act='ELU' init='xavier' dropout=.05} srcLayers={'data'}; deepLearn.AddLayer / model='DLNN' name='HLayer2' layer={type='FULLCONNECT' n=20 act='RELU' init='MSRA' dropout=.05} srcLayers={'HLayer1'}; deepLearn.AddLayer / model='DLNN' name='HLayer3' layer={type='FULLCONNECT' n=10 act='RELU' init='MSRA' dropout=.05} srcLayers={'HLayer2'}; /***********************/ /* Add an output layer */ /***********************/ deepLearn.AddLayer / model='DLNN' name='outlayer' layer={type='output'} srcLayers={"HLayer3"}; quit; 
In this example, I create a deep feedforward neural network that includes batch normalization layers. Check out this video for a better understanding of batch normalization, a technique that enables you to train your neural network more easily.
To incorporate batch normalization layers, I need to specify the type of model as CNN in the buildModel action.
proc cas; deepLearn.BuildModel / modeltable={name='BatchDLNN', replace=1} type = 'CNN'; /****************************/ /* Add an input layer */ /****************************/ deepLearn.AddLayer / model='BatchDLNN' name='data' layer={type='input' STD='STD'}; /*********************************/ /* Add several Hidden layers */ /*********************************/ /* FIRST HIDDEN LAYER */ deepLearn.AddLayer / model='BatchDLNN' name='HLayer1' layer={type='FULLCONNECT' n=30 act='ELU' init='xavier'} srcLayers={'data'}; /* SECOND HIDDEN LAYER */ deepLearn.AddLayer / model='BatchDLNN' name='HLayer2' layer={type='FULLCONNECT' n=20 act='identity' init='xavier' includeBias=False } srcLayers={'HLayer1'}; deepLearn.AddLayer / model='BatchDLNN' name='BatchLayer2' layer={type='BATCHNORM' act='TANH'} srcLayers={'HLayer2'}; /* THIRD HIDDEN LAYER */ deepLearn.AddLayer / model='BatchDLNN' name='HLayer3' layer={type='FULLCONNECT' n=10 act='identity' init='xavier' includeBias=False } srcLayers={'BatchLayer2'}; deepLearn.AddLayer / model='BatchDLNN' name='BatchLayer3' layer={type='BATCHNORM' act='TANH'} srcLayers={'HLayer3'}; /***********************/ /* Add an output layer */ /***********************/ deepLearn.AddLayer / model='BatchDLNN' name='outlayer' layer={type='output'} srcLayers={"BatchLayer3"}; run; quit; 
With the data uploaded and the model defined, you can begin to train the model. This next program trains the model by using the dlTrain action and creates a plot of the model fit error across iterations, with labels for the best error performance values for each partition.
ods output OptIterHistory=ObjectModeliter; proc cas; dlTrain / table={name='baseball_part', where='_PartInd_=1'} model='BatchDLNN' modelWeights={name='ConVTrainedWeights_d', replace=1} bestweights={name='ConVbestweights', replace=1} inputs={'CrAtBat', 'nBB', 'CrHits', 'CrRuns', 'nAtBat', 'Position','CrRbi'} nominals={'Position'} target='logSalary' ValidTable={name='baseball_part', where='_PartInd_=2'} optimizer={minibatchsize=5, algorithm={method='ADAM', beta1=0.9, beta2=0.999, learningrate=0.001, lrpolicy='Step', gamma=0.5, stepsize=3}, regl1=0.00001, regl2=0.00001, maxepochs=60} seed=12345 ; quit; /****************************/ /* Store minimum training */ /* and validation error in */ /* macro variables. */ /****************************/ proc sql noprint; select min(FitError) into :Train separated by ' ' from ObjectModeliter; quit; proc sql noprint; select min(ValidError) into :Valid separated by ' ' from ObjectModeliter; quit; /* Plot Performance */ proc sgplot data=ObjectModeliter; yaxis label='Error Rate' MAX=35 min=0; series x=Epoch y=FitError / CURVELABEL="&Train" CURVELABELPOS=END; series x=Epoch y=ValidError / CURVELABEL="&Valid" CURVELABELPOS=END; run; 
The model performance seems reasonable because the performance improves for a period of time and the two performance curves track well together. However, tuning the hyperparameters might improve model performance. SAS offers easytouse tuning algorithms to improve the deep learning model. You can surface the Hyperband approach using the dlTuneaction. DlTune is very similar to dlTrain with regards to its coding options, except you can specify upper and lower bounds for the hyperparameter search. (NOTE: SAS has improved on the original Hyperband method with regards to sampling the hyperparameter space).
proc cas; dlTune / table={name='baseball_part', where='_PartInd_=1'} model='BatchDLNN' modelWeights={name='ConVTrainedWeights_d', replace=1} bestweights={name='ConVbestweights', replace=1} inputs={'CrAtBat', 'nBB', 'CrHits', 'CrRuns', 'nAtBat', 'Position','CrRbi'} nominals={'Position'} target='logSalary' ValidTable={name='baseball_part', where='_PartInd_=2'} optimizer = {miniBatchSize=5, numTrials=45, tuneIter=2, tuneRetention=0.8, algorithm={method='ADAM', lrpolicy='step', gamma={lowerBound=0.3 upperBound=0.7}, beta1=0.9, beta2=0.99, learningRate={lowerBound=0.0001 upperBound=0.01}, clipGradMax=100 clipGradMin=100} regl1={lowerBound=0.0001 upperBound=0.05} regl2={lowerBound=0.0001 upperBound=0.05} maxepochs=10} seed = 1234 ; quit; 
DlTune’s results provide a table that contains each hyperparameter value along with the model’s performance for each respective hyperparameter combination.
Note the large difference between the best (0.287) and worst (30.68) combinations with regards to the validation set error.
Now insert the best performing combination of hyperparameter values discovered by dlTune into the dlTrain code to retrain the model.
ods output OptIterHistory=ObjectModeliter; proc cas; dlTrain / table={name='baseball_part', where='_PartInd_=1'} model='BatchDLNN' modelWeights={name='ConVTrainedWeights_d', replace=1} bestweights={name='ConVbestweights', replace=1} inputs={'CrAtBat', 'nBB', 'CrHits', 'CrRuns', 'nAtBat', 'Position','CrRbi'} nominals={'Position'} target='logSalary' ValidTable={name='baseball_part', where='_PartInd_=2'} optimizer={minibatchsize=5, algorithm={method='ADAM', beta1=0.9, beta2=0.999, learningrate=0.00967, lrpolicy='Step', gamma=0.5888888889, stepsize=3}, regl1=0.0028722222, regl2=0.0161788889, maxepochs=60} seed=12345 ; quit; /****************************/ /* Store minimum training */ /* and validation error in */ /* macro variables. */ /****************************/ proc sql noprint; select min(FitError) into :Train separated by ' ' from ObjectModeliter; quit; proc sql noprint; select min(ValidError) into :Valid separated by ' ' from ObjectModeliter; quit; /* Plot Performance */ proc sgplot data=ObjectModeliter; yaxis label='Error Rate' MAX=35 min=0; series x=Epoch y=FitError / CURVELABEL="&Train" CURVELABELPOS=END; series x=Epoch y=ValidError / CURVELABEL="&Valid" CURVELABELPOS=END; run; 
The results of the tuned deep learning model appear to be much better when compared to the original model.
Rescaling the yaxis provides a clearer understanding of the model’s performance.
In summary, it’s easy to build and tune a deep learning model by using the SAS language. However, building a great model is not always easy, especially if you’re training a model with millions of parameters on a large amount of data. Check out this following Practitioner’s Guide video to learn a few additional tips that can help you build a good deep learning model.
The post Getting started with deep learning using the SAS Language appeared first on The SAS Data Science Blog.
]]>