Big data poses some interesting challenges for disciplines such as data integration and data governance, but this blog series addresses some of the most common questions big data raises related to data quality.
Does data quality matter less in larger data sets?
Some believe huge data volumes make individual data quality issues insignificant due to the law of large numbers. This view posits individual data quality issues will only make up a tiny part of the mass of big data, assuming data quality issues do not scale with increased data volume. My favorite analogy for this involves Kool-Aid. Adding one spoonful of the drink mix to a glass of water creates a tasty beverage (my favorite drink as a kid). Adding one spoonful to a gallon of water, however, will only make colorful water that still tastes like water. People who believe data quality matters less in larger data sets imagine big data pouring in gallons at a time while data quality issues trickle in only spoonfuls at a time. Don’t drink this Kool-Aid.
Others argue quantity automatically improves quality since larger data sets have smaller margins of error (i.e., less statistical error) than smaller data sets. Even though big data does minimize sampling bias by using all (or as much as possible) of the data – as opposed to just a sample – this still does not prevent a systematic bias from overwhelming some big data set. Consider those taken from social media, for example, where users are often disproportionately young, urban and technologically savvy.
Crowdsourcing exemplifies how quality can sometimes be achieved through quantity. Spelling errors and other mistakes in web search terms are a great example. Google alone receives 100 billion searches per month. With that many queries, correct spelling can be determined by successful searches without using a spell checking algorithm. However, in most cases the quality of big data will require more a formal assessment and improvement. Just remember to first determine the business impact of data quality issues before taking corrective action.Download a paper about data management best practices