Machine learning best practices: the basics


I started my training in machine learning at the University of Tennessee in the late 1980s. Of course, we didn’t call it machine learning then, and we didn’t call ourselves data scientists yet either. We used terms like statistics, analytics, data mining and data modeling.

Regardless of what you call it, I’ve spent more than 30 years building models that help global companies solve some of their most pressing problems. I’ve also had the good fortune to learn from some of the best data scientists on the planet, including Will Potts, Chief Data Scientist at Capital One, Dr. Warren Sarle, a distinguished researcher here at SAS, and Dr. William Sanders while I was at the University of Tennessee.

Through hundreds of projects and dozens of mentors over the years, I’ve caught on to some of the best practices for machine learning. I’ve narrowed those lessons down to my top ten tips. These are tips and tricks that I’ve relied on again and again over the years to develop the best models and solve difficult problems.

I’ll be sharing my tips in a series of posts over the next few weeks, starting with the first three tips here. The next tips will be longer, but these first three are short and sweet, so I've included them in one post:

  1. Look at your data.
    You spend 80 percent or more of your time preparing a training data set, so prior to building a model, please look at your data at the observational level. I always use PROC PRINT with OBS=20 in Base SAS®, the FETCH action in SAS® VIYA, and the HEAD or TAIL functions in Python to see and almost touch the observations.   You can quickly discern if you have the right data in the correct form just by looking at it.  It’s not uncommon to make initial mistakes when building out your training data, so this tip can save you a lot of time. Naturally, you then want to generate measures of central tendency and dispersion.  To isolate key trends and anomalies, compute summary statistics for your features with your label. If the label is categorical, compute summary measures using the label as a group by variable. If the label is interval, compute correlations. If you have categorical features, use those as your by group.
  2. Slice and dice your data.
    Usually, there’s some underlying substructure in your data. So I often slice my data up like a pizza – although the slices are not all the same size – and build separate models for each slice.  I may use a groupby variable like REGION or VEHICLE_TYPE that already provides built in stratification for my training data. When I have a target, I also build a shallow decision tree and then build separate models for each segment.  I rarely use clustering algorithms to build segments if I have a target. I just don’t like ignoring my target.
  3. Remember Occam’s Razor.
    The object of Occam learning is to output a succinct representation of the training data. The rational is, you want as simple a model as possible to make informed decisions. Many data scientists no longer believe in Occam’s Razor, since building more complex models to extract as much as you can from your data is an important technique. However, I also like to build simple, white-box models using regression and decision trees. Or I’ll use a gradient boosting model as a quick check for how well my simple models are performing. I might add first order interactions or other basic transformations to improve the performance of my regression model. I commonly use L1 to shrink down the number of model effects in my model (watch for more about this in an upcoming post). Simpler models are also easier to deploy which makes the IT and systems operation teams happy.  Finally, using the simplest model possible also makes it easier to explain results to business users, who will want to understand how you’ve arrived at a conclusion before making decisions with the results.

My next post will be about detecting rare events. If there are other tips you want me to cover, leave a comment here.


About Author

Wayne Thompson

Manager Data Science Technologies

Wayne Thompson, Chief Data Scientist at SAS, is a globally renowned presenter, teacher, practitioner and innovator in the fields of data mining and machine learning. He has worked alongside the world's biggest and most challenging organizations to help them harness analytics to build high performing organizations. Over the course of his 24 year tenure at SAS, Wayne has been credited with bringing to market landmark SAS analytics technologies, including SAS Text Miner, SAS Credit Scoring for Enterprise Miner, SAS Model Manager, SAS Rapid Predictive Modeler, SAS Visual Statistics and more. His current focus initiatives include easy to use self-service data mining tools along with deep learning and cognitive computing tool kits.


  1. Beth Ebersole
    Beth Ebersole on

    This is great. I look forward to your future blogs as well! Thank you, Wayne!

  2. Thanks Wayne ..
    How about more on EDA. I mean if you can share something about the problem with missing and outliers in the data.
    Thanks again

  3. Wayne Thompson
    Wayne Thompson on

    Sunny - Definitely a great topic that I will expand on later. One thing I like to do is bin continuous variables using something like tree based methods and weights of evidence. You can maintain a separate bin for missing. Binning also helps with extreme values in the tails. I also like using tree based methods that allow the missing values to float or use surrogate rules. One approach you can do with a good SAS compute cluster that is threaded is to use the feature (input) as a predictor and use some of the other features to predict the missing values. All of these are super easy to do in Enterprise Miner with just a simple process flow diagram (what many now call an ML pipeline). Super easy. Outliers is a big topic. Anyway thank for reading my blog.

  4. Hello sir,
    that was really helped me i was in confusion state now i am clear ..sir i need you suggestion where i have start my work to analyze .how hot weather effects on public health (i.e skin diseases) as it is multi variant problem will plz suggest me what type machine learning algorithm is most suitable ..if any other recommendations

Back to Top