Big data quality – Part 3 - The Data Roundtable

businesswoman considering data lakes and hadoop Big data poses some interesting challenges for disciplines such as data integration and data governance, but this blog series addresses some of the most common questions big data raises related to data quality.

What’s the meta with big data?

Sometimes the best way to know what’s the matter with your data quality is to ask: What’s the meta with your data? In other words, the quality of your data’s metadata often determines your data’s usage. This is especially true with big data, since most of it is created outside of your organization. Determining how useful big data can be, and how much quality big data needs, begins with assessing big data’s metadata.

A lot of big data is unstructured and tagging is a common strategy for making unstructured data more usable. Tagging, however, often produces homonyms (i.e., the same tags used with different meanings) and synonyms (i.e., multiple tags for the same concept), which may lead to inappropriate data relationships and inefficient searches for data about a particular subject. Music genres, photo captions and movie categories are just a few examples of our dependence on the quality of big data’s metadata.

Will sensor data be defect-free?

Either by creating defective data or by assuming data quality is someone else’s responsibility, people are one of the leading root causes of poor data quality. Human-generated data is often rife with defects. Can machines not only do better but perhaps even create defect-free data?

The Internet of Things (IoT) is comprised of machines with embedded software, sensors and connectivity enabling them to collect and exchange data. IoT is the source of the more-structured category of big data known as sensor data. Although immune to human error, the quality of machine-generated data faces other issues. Poor data quality could be caused by a poorly calibrated sensor. If that sensor is monitoring oxygen levels on a spacecraft carrying humans – for example, on a journey to Mars – then poor data quality could make the astronauts dead on arrival. Other issues, such as a power loss, mechanical failure, loss of connectivity or software crash, could also cause poor sensor data quality. Sensor data, therefore, may be mostly human-free – but it’s not defect-free.

What say you?

What other questions or issues about the relationship between big data and data quality were not covered during this series? Post them in a comment below and I will address them in a follow-up post.

Download a paper about data management best practices