An Easy Button for Data Science

2

This blog is a part of a series on the Data Science Pilot Action Set. In this blog, we introduce the action set and the actions for building data understanding. 

Do you ever wish there was an easy button on data science? Do you ever think to yourself, "I want a solution that could examine the importance of missing variables, create new features, and build models with the best possible performance?" I am still looking for that easy button. However, the Data Science Pilot Action Set comes pretty close.

The Data Science Pilot Action Set

The Data Science Pilot Action Set is included with SAS Visual Data Mining and Machine Learning (VDMML) and consists of actions that implement a policy-based, configurable, and scalable approach to automating data science workflows. These actions allow you to move from a data set to a deployable model in a flash! There are seven actions in total:

Data Science Pilot Actions

The first four actions help find data-quality issues in a complete and exploratory manner. The next two actions help generate and select model features. The final action creates a complete automated machine learning pipeline. In addition to the seven actions, there are four policies. These policies define rules for specific actions. In this post of my blog series, I will introduce the first four actions and their corresponding policies.

The exploreData action

The exploreData action automatically calculates various statistics for each column in your data. These statistics include minimum, maximum, mean, median, mode, number missing, and much more. In addition, the exploreData action will group variables according to their data type and their values on various statistical metrics. These groupings allow for a pipelined approach to data cleaning. Furthermore, there is an optional explorationPolicy. This policy specifies how the data is grouped together. Using the exploreData action is as easy as specifying your data table, your target, and the name of your output table.

dataSciencePilot exploreData action

The exploreCorrelation action

The exploreCorrelation action performs a correlation analysis with your data. There are two ways you can approach this correlation analysis. First, if you specify a target variable, the correlation analysis will only include correlations from the input variables against the target. However, if you do not specify the target, the analysis will be performed on all variables against each other. That is a lot of correlations! It is even more impressive considering correlation measures are not valid between all data types. This action will take care of managing data types and the appropriate correlation measures with no additional user input required.

dataSciencePilot exploreCorrelation action

The analyzeMissingPatterns action

The analyzeMissingPatterns action helps a data scientist better understanding missingness in their data set. Before you throw out a column because of missing data, check to see if that missingness has any meaning. Does a specific value or missing value in one column make it more likely for a value to be missing in another column? For this action, you can specify a target variable to narrow down the analysis. This will only check for patterns between each input variable and the target. Otherwise, patterns between all variables will be examined.

dataSciencePilot analyzeMissingPatterns action

The screenVariables action

The final action set included in this blog is the screenVariables action. The screenVariables action recommends to keep, transform, or remove variables. This recommendation is based off of several factors including missingness, consistency, redundancy, variance, leakage, and more. By editing the screenPolicy, you can control how much data messiness is acceptable. In short, this action helps a data scientist understand their data quality, allowing them to think of next steps for data cleaning.

dataSciencePilot screenVariables action

Conclusion 

As evident from our snippets, the code for the Data Science Pilot Action Set is incredibly easy to use. Our only requirements are the name of our CAS table and the name of the target variable (for most actions). In addition, we can use the explorationPolicy and the screenPolicy to take more control over the process. This means that we can gather so much information in just a few lines of code! In this blog, we focused on building our data understanding and examining data quality. In upcoming blogs, we will examine feature engineering and automated machine learning.

Share

About Author

Sophia Rowland

Systems Engineer on FSBU Data Science Team

Sophia Rowland is a Systems Engineer at SAS on a team that focuses on Data Science applications for the Financial Services industry. Upon joining SAS in January of 2019, Sophia attended and completed the Customer Advisory Academy. Sophia is an alumnus of both UNC-Chapel Hill and Duke but remains loyal to UNC when it comes to sports. At UNC-Chapel Hill, Sophia double majored in Computer Science and Psychology. At Duke, Sophia attended the Fuqua School of Business and completed a Master of Science in Quantitative Management: Business Analytics. Sophia is comfortable in a variety of programming languages Sophia has worked on several projects spanning open source integration, text analytics, fraud detection, product recommendation, model deployment, and automation. After work, Sophia can be found reading, hiking, running, baking, and playing video games.

Related Posts

2 Comments

Back to Top