Sophisticated models with big data sets and powerful algorithms will always give you better results than a simpler approach – right? Not necessarily, says Andy Pulkstenis, Director of Data Science and Advanced Analytics at State Farm Insurance. This rationale assumes that the math is more important than the data. That’s a flawed assumption, he says. “Really good data with basic models beats bad data with amazing models almost all the time.”
Another flawed assumption many people have is that open source is key to everything because of its low cost and easy implementation. “We’ve over-steered in this direction,” Pulkstenis says. When you consider the true cost of running multiple free programs – with training overhead, cobbled solutions, version control, and compatibility issues – this rationale can quickly crumble.
Underlying assumptions like these often lead to poor choices and unnecessary mistakes – especially when using sophisticated methods like machine learning. Want to skip the mistakes others have made? Pulkstenis shares six of the most common machine learning pitfalls he has observed throughout his data science career.
Pitfall 1: Thinking big data and advanced analytics will solve all your problems
Analytics is incredibly important for getting insights about your business. But to make the best use of big data analytics – especially when you’re applying complicated math to massive data sets – you’ll need to cover the basics first.
If you haven’t managed your data properly – which includes accessing and integrating all the right types and sources of data, cleansing it, and ensuring that you have up-to-date information – some of your classic analytics problems may get worse instead of better.
Be mindful of these analytics basics that could trip you up:
- False positives. Many data scientists have made the mistake of concluding that something is true when it is, in fact, false. Powerful algorithms paired with powerful computers can detect all kinds of patterns and relationships in the data – even if they are only occurring through chance. You should thoroughly examine your data, confirm that your results are repeatable on new data, and ask questions. Don’t jump to a premature decision without having a full picture of the situation.
- Blind spots in the data. Keep an eye out for hidden bias in your data that’s leftover from historic decisions. This could have been introduced by your company’s former strategies, macroeconomic shifts, or even your customers’ decisions. Whatever the origins, hidden bias in data can lead to inaccurate conclusions.
- Correlation versus causation. It’s easy to mistake correlations in the data with actual causative factors. Remember that cause and correlation are not one and the same. If you don’t recognize the distinction, it may lead to faulty decisions.
Pitfall 2: Relying too much on observational data
Thanks to advancements in technology like embedded smart devices and telematics, and relatively cheap and available big data storage – along with a desire to inject more data science into business decisions – organizations are virtually drowning in vast amounts of observational data. But a high availability of data can lead to what Pulkstenis calls “observational data dumpster diving.”
While it’s valuable to collect data, you’re still dealing with non-random data that reflects all the historical decisions and economic conditions that were in place when the data was originally collected. And that can lead to errant conclusions. Still, many data scientists succumb to the temptation of fishing for insights without a hypothesis.
Don’t discount the value of experimentation and science in favor of observational data dumpster diving. It pays to be more deliberate about what you’re looking for when you’re using a powerful technique like machine learning. Leverage your extensive observational data resources to discover potentially powerful insights – but confirm those hypotheses with AB or multivariate testing to separate fact from fiction.
Pitfall 3: Ignoring existing barriers or implementation struggles in your company
Many artificial intelligence and machine learning projects fail. The reasons go back to a few long-running challenges: Data integrity and infrastructure issues, a lack of unity across the organization, and the challenge of moving models into production. Although these issues have been around for years, they’re just as important today as they were 20 years ago.
Think through which people and organizational divisions you’ll need to bring together to succeed with a machine learning project. Be prepared to face pushback, too. Some experts may apply manual rules to the data to get the results they expected if your data-driven approach turns up a surprising answer.
Better algorithms and machine learning won’t help if your company is still struggling to implement simple data science solutions.
Pitfall 4: Being blinded by buzzwords
As math, computers and data analytics continue to evolve and improve, buzzwords have spiraled out of control. For example, Pulkstenis says a term he hears a lot today is “data robot.” But what is that? A data robot is just a modeling or rule-based automation solution – something we’ve had for decades. Giving something a sexier name doesn’t change what it is or the issues it presents.
Watch out for people who toss around buzzwords and make empty promises – and protect yourself through self-education. Delve into topics like applied data science. Learn its history, how it’s being used, the pitfalls of the approach, the good, and the bad. Your knowledge will help you sift through fact versus fiction.
Pitfall 5: Believing that proper training or theory doesn’t matter
Fundamental math is called “fundamental” for a reason. To create good models, data scientists need a solid understanding of mathematical underpinnings and rules. And that comes from proper training and education.
Watch for citizen data scientists who are just smart enough to get into trouble. For example, consider the case of a citizen data scientist who had very large data sets and estimates from a model with tiny confidence levels. Due to the sample size and these very precise estimates, this person thought it was safe to ignore a basic law of modeling. But the results of the model were only half right because of the theoretical violation. The estimates were centered at the wrong value, resulting in very precise incorrect answers.
Luckily, other colleagues who were reviewing the model caught this mistake, so no harm was done. Otherwise, decisions made from the model would have been dramatically incorrect. Remember that data science is not magic – it’s still based on fundamental mathematical rules.
Pitfall 6: Thinking you have it all figured out: Over-confidence
In truth, no one has it all figured out. Don’t be afraid to share and act on what you know – but acknowledge what you don’t know, too. The most successful people and organizations question and measure and reevaluate routinely. They aren’t afraid to change what didn’t work, or what needs to evolve with the times. Be open to criticism and challenges – it’s the best path to success.