Errors, lies, and big data


My previous post pondered the term disestimation, coined by Charles Seife in his book Proofiness: How You’re Being Fooled by the Numbers to warn us about understating or ignoring the uncertainties surrounding a number, mistaking it for a fact instead of the error-prone estimate that it really is.

Sometimes this fact appears to be acknowledged when numbers are presented along with a margin of error.

This, however, according to Seife, is “arguably the most misunderstood and abused mathematical concept. There are two important things to remember about the margin of error. First, the margin of error reflects the imprecision caused by statistical error—it is an unavoidable consequence of the randomness of nature. Second, the margin of error is a function of the size of the sample—the bigger the sample, the smaller the margin of error. In fact, the margin of error can be considered pretty much as nothing more than an expression of how big the sample is.”

A Big Lie about Big Data

That second point often provides the basis for a big lie about big data—quantity improves quality. In other words, people falsely believe big data has fewer data quality issues since larger data sets have smaller margins of error. Or stated more succinctly: more data, less statistical error.

However, there are plenty of other errors that creep into data that aren’t statistical in nature. “A more insidious kind of error,” Seife explained, “is systematic error. Unlike statistical error, systematic error doesn’t diminish as the sample size grows.” One systematic error I have discussed in a previous post is sampling bias, which is when a sample, regardless of how big it is, isn’t randomly collected but instead reflects a deep data collection bias that skews the statistical results toward a false conclusion.

Lies You tell Big Data

While “lies, damn lies, and statistics” is common data-bashing refrain uttered by people who don’t like what statistics are showing them, people are another significant source of systematic error. More precisely, lying people. Before you set your pants on fire by claiming not to be a liar, consider the following examples:

  • Are you really “friends” with the people you connect with or customers of the products and services you “like” on social networking websites?
  • Do you really read the books you review on Amazon or the content you re-tweet on Twitter?
  • When you complete an online survey (e.g., for a chance to win a new iPhone or iPad), do you honestly answers questions like “Annual family income”?
  • Do all of the job titles and keywords in your LinkedIn profile reflect your actual professional experience?  (Just in case anyone asks, I was a Vice President at Vandelay Industries.)
  • When you sign up for a free trial of a web service or download a white paper, do you provide an active email address or select the country you actually live in from the drop-down list?

Well, maybe you don’t tell lies, but at the very least you have to admit that the truthiness of your self-reported data makes it rather quality-ish. This injects an insidious systematic error into the volume and variety of data fed into the statistical analysis being applied to an increasing number of business areas including market segmentation, campaign effectiveness, consumer behavior, sales forecasting, and sentiment analysis.

Don’t Lie to Yourself about Big Data

Big data contains tremendous business potential. Big data also contains errors and lies. The errors can be both statistical and systematic. The lies can be both what we tell big data and what we tell ourselves about big data. Data science, like all science, is a quest to get as close as possible to the truth. To make sure your quest gets as close as possible to the truth, don’t lie to yourself about big data.


About Author

Jim Harris

Blogger-in-Chief at Obsessive-Compulsive Data Quality (OCDQ)

Jim Harris is a recognized data quality thought leader with 25 years of enterprise data management industry experience. Jim is an independent consultant, speaker, and freelance writer. Jim is the Blogger-in-Chief at Obsessive-Compulsive Data Quality, an independent blog offering a vendor-neutral perspective on data quality and its related disciplines, including data governance, master data management, and business intelligence.

1 Comment

Leave A Reply

Back to Top