Most discussions about big data focus on two things: data lakes and Hadoop. A data lake is a big data repository for collecting and storing a vast amount of raw data in its native format, including structured, semistructured and unstructured data. Hadoop is the open-source software framework most commonly used by data lakes. It stores data in the Hadoop Distributed File System (HDFS) and runs applications on clusters of commodity hardware to access and analyze the contents of data lakes.
According to Gartner, Hadoop distribution software revenue in 2016 was $800 million, a 40% growth over 2015 – and the Hadoop market is expected to reach $50 billion by 2020. While Hadoop adoption rates are on the rise, production deployments remain low. Gartner estimates that only 14% of Hadoop deployments are in production. And despite the hype of data lakes replacing data warehouses, recent TDWI reports show that has been done in less than 2% of Hadoop deployments – with Hadoop being used as a data warehousing complement in 17% of deployments (a use case predicted to double in the next three years), and 78% of data warehouse environments not currently using Hadoop in any capacity.
But no matter how fast the times are changing, or how much data exists in various formats, data collection and data storage are still not the same thing as data management and data governance. I believe this is one reason for the slow adoption and low production deployment of data lakes and Hadoop. In this two-part series, I advocate addressing data quality and data governance issues on the way to data lakes and Hadoop.
A Hadoop-enabled data lake reminds me of the project management triangle: fast, cheap, good – pick any two. Data lakes and Hadoop deliver a fast and cheap way to collect and store an enormous amount of data from different systems with varying structures, as long as you don’t stop to assess how good any of the data is. I acknowledge different data uses have different data quality requirements (e.g., aggregate analytics, where sometimes bigger, lower-quality data is better, such as the five-star ratings on Netflix). But in general I question the usability of the data lake when its data quality is not only questionable but often unquestioned.
A major limitation of early Hadoop versions was that they were computationally coupled with MapReduce. For data quality functions to process Hadoop data, this meant the data had to be either extracted from HDFS and externally processed, or the functionality had to be rewritten in MapReduce so it could be executed within HDFS. The first approach negated most of Hadoop’s efficiency and scalability benefits. The second meant the functionality had to be rewritten in MapReduce so it could be executed within HDFS. This not only required esoteric programming skills (another major barrier to Hadoop adoption), but some data quality functionality isn’t MapReduce-able.
Thankfully, the latest Hadoop versions enable non-MapReduce data processing to be executed natively in HDFS. This is why leading data management vendors offer in-database technologies for improving the quality of data already loaded into HDFS.
An even better option, in my opinion, is pre-processing before loading data into Hadoop. Yes, going back to the project management triangle, this means you sacrifice some fast to get good. But it also means you get better data loaded into your Hadoop-enabled data lake. At the very least, as Clark Bradley blogged, perform data profiling to capture descriptive statistics such as “table storage volume, table column data types, average column value, standard deviation, null count per column or null percent per column, and median value per column (just to name a few). This metadata does not exist in a single place or a unified form in Hadoop.”Read what TDWI has to say about data warehouse modernization