The Google Flu Trends application has received negative press since 2013 over its inability to accurately detect flu outbreaks. The latest critique, “The Parable of Google Flu: Traps in Big Data Analysis,” from Science magazine compares Google Flu Trends data to CDC data and dissects where the Google analysis went wrong.
As you might remember, Google Flu Trends was designed to pinpoint flu outbreaks by analyzing search data for flu related keywords. The problem? At least 80 percent of people who conduct flu related searches don’t actually have the flu.
Why does this story fascinate us? Partly because we can relate to it: Most of us have searched Google for medical information, and many of us, at one point or another, have thought we had the flu when we did not. But also because we like complex problems that are hard to solve.
The real lessons, though, are in the analysis. And this story reminds us of some important truths:
- Crowd sourced data is dirty data. It needs to be cleaned and managed before using it for any type of official analysis.
- Social data is just one data point. Whether you’re working with Twitter, Facebook or Google data, it’s going to be more powerful when combined with other data sources – like CDC data, for instance – and not as a standalone source.
- Keep monitoring and evaluating. You can’t just build a model and walk away. You have to monitor results and re-model your data over and over again before you might find an accurate representation of reality.
Be sure to read the Science magazine article for additional (and more scientific) lessons.