Big data hubris


While big data is rife with potential, as Larry Greenemeier explained in his recent Scientific American blog post Why Big Data Isn’t Necessarily Better Data, context is often lacking when data is pulled from disparate sources, leading to questionable conclusions. His blog post examined the difficulties that Google Flu Trends (GFT) has experienced while attempting to accurately provide real-time monitoring of influenza cases worldwide based on Google searches that matched terms for flu-related activity.

Greenemeier cited university research positing that one contributing factor in GFT mistakes is big data hubris, which they explained is the “often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis.” The mistake of many big data projects, the researchers noted, is that they are not based on technology designed to produce valid and reliable data amenable for scientific analysis. The data comes from sources such as smartphones, search results and social networks rather than carefully vetted participants and scientific instruments.

Google explained the premise of their model is that certain search query terms, such as “flu symptoms,” have a high historical correlation with doctor visits for influenza-like illness and so may be useful predictors of such visits in the future. They hypothesized that one reason for the GFT algorithm’s inaccurate predictions for the 2012-2013 flu season was a spike in search volume caused by people reacting to heightened media coverage. In other words, the media reporting that Google predicted a bad flu season caused a lot of people to google “flu symptoms” any time they didn’t feel well, which the GFT algorithm took as a strong signal it was, in fact, a bad flu season. This big data echo chamber caused GFT to overestimate the prevalence of flu in 2012-2013 by more than 50 percent and predict more than twice the number of doctor visits for influenza-like illness than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from a number of laboratories.

Google explained that they added “spike detectors” to the GFT algorithm to identify patterns of anomalous spikes in query volume due to flu-related media reports, and are experimenting with using independent measures of flu in the news media to modulate the signal of certain flu-related queries.

While the ability to replicate an experiment is a hallmark of science, as the university researchers noted, “platforms such as Google, Twitter, and Facebook are always re-engineering their software, and whether studies based on data collected at one time could be re-done with data collected from earlier or later periods is an open question.”

None of this is meant to imply big data analytics has limited value. As the researchers noted, “greater value can be obtained by combining GFT with other near-real-time health data. By combining GFT and lagged CDC data, as well as dynamically recalibrating GFT, we can substantially improve on the performance of GFT or the CDC alone.”

Therefore, the value of big data analytics is combining it with, and using it to supplement, traditional analytics.


About Author

Jim Harris

Blogger-in-Chief at Obsessive-Compulsive Data Quality (OCDQ)

Jim Harris is a recognized data quality thought leader with 25 years of enterprise data management industry experience. Jim is an independent consultant, speaker, and freelance writer. Jim is the Blogger-in-Chief at Obsessive-Compulsive Data Quality, an independent blog offering a vendor-neutral perspective on data quality and its related disciplines, including data governance, master data management, and business intelligence.


  1. Pingback: Big data lessons from Google Flu Trends - SAS Voices

Leave A Reply

Back to Top