Easier Feature Generation and New Data Science Pilot Actions

This blog is a part of a series on the Data Science Pilot Action Set. In this blog, we discuss updates to Visual Data Mining and Machine Learning with the release of Viya 3.5.

In the middle of my blog series, SAS released Viya 3.5. Included in Viya 3.5 was the release of Visual Data Mining and Machine Learning (VDMML) 8.5, which includes several new Data Science Pilot features and actions. Therefore, I will go over the two new actions, the new SAS Studio task, and the new node for the visual Model Studio interface. As a result of these additions, the Data Science Pilot action set is more powerful and feature generation has just gotten even more easier and accessible!

The DetectInteractions and GenerateShadowFeatures Actions

There are now nine Data Science Pilot Actions. The two newest actions are the detectInteractions action and the generateShadowFeatures action.

The DetectInteractions Action

The detectInteractions action will assess the interactions between pairs of predictor variables and the correlation of that interaction on the response variable. Specially, it will see if the product of any pair of predictor variables correlate with the response variable. Since checking the correlation between the product of every predictor pair and the response variable can be computationally intensive, this action relies on the XYZ algorithm to search for these interactions efficiently in a high-dimensional space.

The detectInteractions action requires that all predictor variables be in a binary format, but the response variable can be numeric, binary, or multi-class. Additionally, the detectInteractions action can handle data in a sparse format, such as when predictor variables are encoded using an one-hot-encoding scheme.

/* Transform Input Variables into Binary Variables */
proc cas; 
	transform / 
		table = "hmeq"
		pipelines = {
                	{inputs		= ${loan, mortdue} 
			discretize 	= {method = 'quantile', args={nbins=5}}},
			{inputs  		= ${job, reason}
                     	cattrans 		= {method = 'label'}}}
    		 copyvars = ${bad}
     		outVarsNameGlobalPrefix = ''
     		casout = {name='transformed_hmeq' replace=True};
run; 
 
/* Detect Interactions */
proc cas;
	loadactionset "dataSciencePilot";
	dataSciencePilot.detectInteractions /	
                table 		= "transformed_hmeq"
		target 		= "BAD"
		event 		= "1"
		sparse  		= True
 		inputs 		= ${loan, mortdue, job, reason}
     		inputLevels	= {5, 5, 6, 2}
     		casout    		= {name = 'intdet3_out', replace=True};
	run;
quit;

The GenerateShadowFeatures Action

The generateShadowFeatures action performs a scalable random permutation of input features to create shadow features. The shadow features are randomly selected from a matching distribution of each input feature. These shadow features can be used for all-relevant feature selection. In all-relevant feature selection, the inputs whose variable importance are lower than the shadow feature’s variable importance are removed from consideration. The shadow features can also be used in a post-fit analysis using Permutation Feature Importance (PFI). By replacing each input with its shadow feature one-by-one and measuring the change on model performance, one can determine that features importance based on relative size of the model’s performance change.

/* Generate shadow features */
proc cas;
	loadactionset "dataSciencePilot";
	dataSciencePilot.generateShadowFeatures
	/	table 		= "hmeq"
		casOut 		= {name = "SHADOW_FEATURES_OUT", replace=True}
		nProbes		= 2
		inputs		= {{name = "LOAN"}, {name = "MORTDUE"}, {name = "VALUE"},
				{name = "YOJ"}, {name = "DEROG"}, {name = "DELINQ"},
 				{name = "CLAGE"}, {name = "CLNO"}, {name = "DEBTINC"},
 				{name = "NINQ"}, {name = "REASON"}, {name = "JOB"}}
		nominals	= {"REASON", "JOB"}
		copyVars	= {"BAD"}; 	
	run;
quit;

Automated Feature Engineering SAS Studio Tasks

The tasks in SAS Studio make programming in SAS much easier. Tasks build out the code in real time based on the inputs you select from a point-and-click interface. As a result, you can write SAS code faster than you can type while not worry about syntax errors. In the latest release of Viya 3.5, the featureMachine action was included as a task. To use the task, simply select it from the task menu, select the data, and select the target variable. The task includes additional options for variable screening and feature engineering. In addition, you can choose what kinds of output you want, including the generated features, the feature metadata, the feature pipelines, and an Astore to score new data.

Automated Feature Engineering Task in SAS Studio

Feature Machine Node in Model Studio

In addition to a SAS Studio task, the featureMachine action is also available from the Model Studio visual interface as a node. In Model Studio, you can quickly chain together data mining preprocessing nodes, supervised learning nodes, and postprocessing nodes into a comprehensive analytics pipeline. To use the node, simply add it to your pipeline and select the desired transformations and screening rules. Now your Model Studio pipelines can include Everything But the Kitchen Sink Feature Generation!

Conclusion

In short, with the release of VDMML 8.5 on Viya 3.5, the Data Science Easy Button has just got easier and more powerful. Want to know more about the Data Science Pilot Action set? Then check out the other blogs in my series on the Data Science Pilot Action Set!