Our world is now so awash in data that many organizations have an embarrassment of riches when it comes to available data to support operational, tactical and strategic activities of the enterprise. Such a data-rich environment is highly susceptible to poor-quality data. This is especially true when swimming in data lakes – the increasingly popular (and, arguably, increasingly necessary) storage repositories that hold a vast amount of raw data in its native format, including structured, semistructured and unstructured data. In data lakes, of course, data structures and business requirements do not have to be defined until the data is needed.
As the TDWI Best Practices Report Improving Data Preparation for Business Analytics explains, a key reason why organizations are creating data lakes is simply to make more data available for analytics, even if consistency and data quality are uncertain. The report noted Hadoop is playing an important role in this data availability.
A major limitation of early Hadoop versions was that they were computationally coupled with MapReduce. This meant for data quality functions to process data stored in the Hadoop Distributed File System (HDFS), data had to be either:
- Extracted from HDFS and processed outside of Hadoop (negating most Hadoop's efficiency and scalability benefits).
- Or, the functionality had to be rewritten in MapReduce so it could be executed in Hadoop (not only specialized and labor-intensive, but some data quality functionality isn’t MapReduce-able).
Thankfully, the latest Hadoop versions enable non-MapReduce data processing to be executed natively in HDFS. This is why leading data management vendors now offer solutions for improving the quality of big data in Hadoop.
Perspectives on big data quality vary, with some arguing big data has a smaller margin of error (i.e., less statistical error) and others arguing big data eliminates sampling bias. While there’s some merit to these arguments, neither can prevent a systematic bias from overwhelming big data. The bottom line, as I've previously blogged: data quality is still a big deal when it comes to big data. And it's becoming a bigger deal since, as the report cited, business users, analysts and data scientists are getting more and more requests to blend internal and external data views for analytics. So as data lakes extend their tributaries to grow into the wider data oceans outside the enterprise, the importance of big data quality grows as well.