Machine learning best practices: the basics

I started my training in machine learning at the University of Tennessee in the late 1980s. Of course, we didn’t call it machine learning then, and we didn’t call ourselves data scientists yet either. We used terms like statistics, analytics, data mining and data modeling.

Regardless of what you call it, I’ve spent more than 30 years building models that help global companies solve some of their most pressing problems. I’ve also had the good fortune to learn from some of the best data scientists on the planet, including Will Potts, Chief Data Scientist at Capital One, Dr. Warren Sarle, a distinguished researcher here at SAS, and Dr. William Sanders while I was at the University of Tennessee.

Through hundreds of projects and dozens of mentors over the years, I’ve caught on to some of the best practices for machine learning. I’ve narrowed those lessons down to my top ten tips. These are tips and tricks that I’ve relied on again and again over the years to develop the best models and solve difficult problems.

I’ll be sharing my tips in a series of posts over the next few weeks, starting with the first three tips here. The next tips will be longer, but these first three are short and sweet, so I've included them in one post:

Look at your data.
You spend 80 percent or more of your time preparing a training data set, so prior to building a model, please look at your data at the observational level. I always use PROC PRINT with OBS=20 in Base SAS^®, the FETCH action in SAS^® VIYA, and the HEAD or TAIL functions in Python to see and almost touch the observations. You can quickly discern if you have the right data in the correct form just by looking at it. It’s not uncommon to make initial mistakes when building out your training data, so this tip can save you a lot of time. Naturally, you then want to generate measures of central tendency and dispersion. To isolate key trends and anomalies, compute summary statistics for your features with your label. If the label is categorical, compute summary measures using the label as a group by variable. If the label is interval, compute correlations. If you have categorical features, use those as your by group.
Slice and dice your data.
Usually, there’s some underlying substructure in your data. So I often slice my data up like a pizza – although the slices are not all the same size – and build separate models for each slice. I may use a groupby variable like REGION or VEHICLE_TYPE that already provides built in stratification for my training data. When I have a target, I also build a shallow decision tree and then build separate models for each segment. I rarely use clustering algorithms to build segments if I have a target. I just don’t like ignoring my target.
Remember Occam’s Razor.
The object of Occam learning is to output a succinct representation of the training data. The rational is, you want as simple a model as possible to make informed decisions. Many data scientists no longer believe in Occam’s Razor, since building more complex models to extract as much as you can from your data is an important technique. However, I also like to build simple, white-box models using regression and decision trees. Or I’ll use a gradient boosting model as a quick check for how well my simple models are performing. I might add first order interactions or other basic transformations to improve the performance of my regression model. I commonly use L1 to shrink down the number of model effects in my model (watch for more about this in an upcoming post). Simpler models are also easier to deploy which makes the IT and systems operation teams happy. Finally, using the simplest model possible also makes it easier to explain results to business users, who will want to understand how you’ve arrived at a conclusion before making decisions with the results.

My next post will be about detecting rare events, and you can click on the image below to continue to reading all ten machine learning best practices as I publish them.

If there are other tips you want me to cover, leave a comment here.

11 Comments

Bart Baesens on July 13, 2017 12:48 am

Nice contribution! Couldn't agree more on the Occam Razor point!
Wayne Thompson on July 13, 2017 4:29 pm

Thanks Bart. Appreciate your reply.
Beth Ebersole on July 14, 2017 3:11 am

This is great. I look forward to your future blogs as well! Thank you, Wayne!
Karolyn Hector on July 14, 2017 5:29 pm

Love your blog.
Wayne Thompson on July 16, 2017 12:31 pm

Thanks Karolyn and Beth - More in depth tips to com.
Hasan Akhtar on July 18, 2017 3:57 am

Good tips. Very useful
Sunny kumar on July 18, 2017 2:56 pm

Thanks Wayne ..
How about more on EDA. I mean if you can share something about the problem with missing and outliers in the data.
Thanks again
Wayne Thompson on July 19, 2017 1:38 pm

Sunny - Definitely a great topic that I will expand on later. One thing I like to do is bin continuous variables using something like tree based methods and weights of evidence. You can maintain a separate bin for missing. Binning also helps with extreme values in the tails. I also like using tree based methods that allow the missing values to float or use surrogate rules. One approach you can do with a good SAS compute cluster that is threaded is to use the feature (input) as a predictor and use some of the other features to predict the missing values. All of these are super easy to do in Enterprise Miner with just a simple process flow diagram (what many now call an ML pipeline). Super easy. Outliers is a big topic. Anyway thank for reading my blog.
Abdul on July 20, 2017 12:42 am

Hello sir,
that was really helped me i was in confusion state now i am clear ..sir i need you suggestion where i have start my work to analyze .how hot weather effects on public health (i.e skin diseases) as it is multi variant problem will plz suggest me what type machine learning algorithm is most suitable ..if any other recommendations
Surya on July 20, 2017 2:11 am

Thank you for the information on ML. I look forward for your next post on ML.
Pingback: Machine learning best practices: combining lots of models - Subconscious Musings

Blogs

Blogs

Machine learning best practices: the basics

About Author

Related Posts

11 Comments

Blogs

About Author

Related Posts

GPT-4o for Image Analysis

GenAI na prática: foco no valor aos negócios

¿Cómo la analítica y la IA pueden impulsar a los comercios minoristas?

11 Comments