I started my training in machine learning at the University of Tennessee in the late 1980s. Of course, we didn’t call it machine learning then, and we didn’t call ourselves data scientists yet either. We used terms like statistics, analytics, data mining and data modeling.
Regardless of what you call it, I’ve spent more than 30 years building models that help global companies solve some of their most pressing problems. I’ve also had the good fortune to learn from some of the best data scientists on the planet, including Will Potts, Chief Data Scientist at Capital One, Dr. Warren Sarle, a distinguished researcher here at SAS, and Dr. William Sanders while I was at the University of Tennessee.
Through hundreds of projects and dozens of mentors over the years, I’ve caught on to some of the best practices for machine learning. I’ve narrowed those lessons down to my top ten tips. These are tips and tricks that I’ve relied on again and again over the years to develop the best models and solve difficult problems.
I’ll be sharing my tips in a series of posts over the next few weeks, starting with the first three tips here. The next tips will be longer, but these first three are short and sweet, so I've included them in one post:
- Look at your data.
You spend 80 percent or more of your time preparing a training data set, so prior to building a model, please look at your data at the observational level. I always use PROC PRINT with OBS=20 in Base SAS®, the FETCH action in SAS® VIYA, and the HEAD or TAIL functions in Python to see and almost touch the observations. You can quickly discern if you have the right data in the correct form just by looking at it. It’s not uncommon to make initial mistakes when building out your training data, so this tip can save you a lot of time. Naturally, you then want to generate measures of central tendency and dispersion. To isolate key trends and anomalies, compute summary statistics for your features with your label. If the label is categorical, compute summary measures using the label as a group by variable. If the label is interval, compute correlations. If you have categorical features, use those as your by group.
- Slice and dice your data.
Usually, there’s some underlying substructure in your data. So I often slice my data up like a pizza – although the slices are not all the same size – and build separate models for each slice. I may use a groupby variable like REGION or VEHICLE_TYPE that already provides built in stratification for my training data. When I have a target, I also build a shallow decision tree and then build separate models for each segment. I rarely use clustering algorithms to build segments if I have a target. I just don’t like ignoring my target.
- Remember Occam’s Razor.
The object of Occam learning is to output a succinct representation of the training data. The rational is, you want as simple a model as possible to make informed decisions. Many data scientists no longer believe in Occam’s Razor, since building more complex models to extract as much as you can from your data is an important technique. However, I also like to build simple, white-box models using regression and decision trees. Or I’ll use a gradient boosting model as a quick check for how well my simple models are performing. I might add first order interactions or other basic transformations to improve the performance of my regression model. I commonly use L1 to shrink down the number of model effects in my model (watch for more about this in an upcoming post). Simpler models are also easier to deploy which makes the IT and systems operation teams happy. Finally, using the simplest model possible also makes it easier to explain results to business users, who will want to understand how you’ve arrived at a conclusion before making decisions with the results.