Even though the first papers in machine learning were in the 1950s, one could argue it goes back further to the work of Alan Turing and other early computer scientists. So why has this way of modeling seemingly become so popular now?
Because data has become a commodity. Large amounts of many different kinds of data are widely available. A lot of this data is messy and unstructured and does not easily lend itself to regression modeling.
For instance, even after preprocessing, text data is often sparse and can easily have more variables than rows (i.e. more different terms in a corpus than different documents). Likewise, pixels in images or videos are often highly correlated, and enterprise data warehouses and data lakes are full of nasty multibyte character variables, missing values, and high-cardinality nominal variables.
Of course, the exponential decrease in the price of powerful computing resources is also a driver of machine learning’s current popularity. Basically, data that justifies the business use case for machine learning has become common. These large amounts of data also help machine learning models become more accurate. And businesses and organizations can now afford the computing resources required to train iterative machine learning models.
It’s important to understand that, despite all the recent hype, machine learning is really just another tool in your toolkit. Like everything else, it has strengths and weaknesses. Interestingly, one of machine learning’s biggest weaknesses is also one of its biggest strengths. Since most machine learning models are uninterpretable, it forces you to not think about the given business problem you are trying to model in too much detail. This can be a good thing when a manufacturing process or social network is just too complex for one person to understand.
In these complex cases, you may need to give control of the modeling process over to a complex neural network or random forest to get the best results. However, you need to decide if not having an exact understanding of a particular business problem is appropriate for the project at hand.
Machine learning techniques are not as widely understood as regression techniques, and that’s another significant drawback. You can expect push back from your management chain if you show them a model that simply can’t be explained. To gain acceptance of machine learning modeling techniques in your organization, you need to prove their business value in a way that is clear and compelling to decision makers. Don’t pitch a machine learning proposal using equations. Instead, use concise data visualizations, keep the bottom line in mind, and don’t propose “fixing” time-tested analytical processes based on regression techniques with machine learning.
If you do choose to move forward with a machine learning project, never expect magical results, be very wary of over fitting your training data, and don’t cheat yourself by spending hours tuning parameters that make your model look amazing on one data set. It’s easy to alter a machine learning algorithm to get great results on one data set. It’s much more difficult to train a machine learning model that generalizes well to new datasets. These two recent kdnuggets posts point out other common machine learning pratfalls to keep in mind:
For more real world tips and a deeper discussion of machine learning, tune in to the webcast, Machine Learning: Principles and Practice, where I review some of the fundamental ideas of machine learning and a few lessons learned from helping customers use machine learning in the real world.
This webinar is the first in a series on machine learning techniques - the second is on principal component analysis, and subsequent webinars will present other methods like clustering and ensemble modeling.
Image credit: photo by Elliott Brown // attribution by creative commons