Machine learning best practices: Add features to training data


This is the final post in my series of machine learning best practices. If you missed the earlier posts, start at the beginning, or read the whole series by clicking on the image to the right.

While post four in the series was about combining different types of models, this post is about combining different types of data and a using variety of variables in your model.

Building the training data set

Training data sets require several example predictor variables to classify or predict a response. In machine learning, the predictor variables are called features and the responses are called labels. Data scientists typically spend 85 percent of the total modeling effort building this training data set. They aggregate transactional data into features, such as average balance, amount spent, etc. They combine those features with overlay data like demographics, geospatial data, social media, into a training data set.

I like to infuse my training data with many readily available features. Blowing out your training set with lots of features can help you derive a better fitting model.

Infuse models with the voice of the customer

One example is to infuse your model with customer feedback data. What the customer is saying or doing is very predictive! Make sure your model is listening to your customer.

If you have groups of customers who have complained about the cost of your services, then infuse this information into your churn model.  Do this by incorporating textual data like surveys and customer correspondences into the training data feature space. Use text analytics to first parse the text into a term by document frequency table. Now your textual data is in a numeric representation.  Next, create singular value dimensions and use those as candidate features in your model.

Infuse models with purchase data

Another example is to infuse your models with purchase history data. What customers have bought can be important when building purchase propensity or next best offer models.

I like to take purchase transactional data and compute market basket rules. An example rule might be: if a customer bought an XBOX, then she is 80 percent likely to buy a Nintendo Switch. I then output the top 100 rules and pivot them to be used as binary features in my training set. Join that back in with the rest of my features, including the singular value decomposition. Now, we have a pretty diverse training data set that has lots of candidate features.

Customer ID
XBOX >  Echo Dot
XBOX > Beats
Google Home > BEATS
XBOX >  Nintendo Switch
23545443 0 0 0 1
21243 0 1 0 0
Table 1.  Association Rules Pivoted as Binary Predictors

Seeing is often believing in machine learning. You can also infuse images as features into your models by using convolutional networks.  I call this whole process integrated machine learning. You can continue to add high and low quality sources to ultimately build high quality machine learning models.

More about machine learning

Thanks for following along with this series. Recently, I’ve presented these tips to fellow data scientists who find them useful for their own machine learning efforts. If you find them useful, let me know in the comments. Or add a few more tips there for others.

Stay tuned for my next post about deep learning, or learn more about the opportunities and challenges for machine learning in the paper, “The Evolution of Analytics.”


About Author

Wayne Thompson

Manager Data Science Technologies

Wayne Thompson, Chief Data Scientist at SAS, is a globally renowned presenter, teacher, practitioner and innovator in the fields of data mining and machine learning. He has worked alongside the world's biggest and most challenging organizations to help them harness analytics to build high performing organizations. Over the course of his 24 year tenure at SAS, Wayne has been credited with bringing to market landmark SAS analytics technologies, including SAS Text Miner, SAS Credit Scoring for Enterprise Miner, SAS Model Manager, SAS Rapid Predictive Modeler, SAS Visual Statistics and more. His current focus initiatives include easy to use self-service data mining tools along with deep learning and cognitive computing tool kits.

1 Comment

Leave A Reply

Back to Top