As I explained in the first blog post of this series, most discussions about big data focus on two things: data lakes and Hadoop. A data lake is a big data repository for collecting and storing a vast amount of raw data in its native format, including structured, semistructured and unstructured data. Hadoop is the open-source software framework most commonly used by data lakes, storing data in the Hadoop Distributed File System (HDFS) and running applications on clusters of commodity hardware to access and analyze the contents of data lakes.
According to Gartner, Hadoop distribution software revenue in 2016 was $800 million (a 40% growth over 2015), and the Hadoop market is expected to reach $50 billion by 2020. However, while Hadoop adoption rates are on the rise, production deployments remain low, with Gartner estimating that only 14% of Hadoop deployments are in production. And despite the hype of data lakes replacing data warehouses, recent TDWI reports show that has been done in less than 2% of Hadoop deployments. Hadoop is being used as a data warehousing complement in 17% of deployments (a use case predicted to double in the next three years), while 78% of data warehouse environments are not currently using Hadoop in any capacity.
One reason for the slow adoption and low production deployment of data lakes and Hadoop is that no matter how fast the times are changing or how much data there is in various formats, data collection and storage are still not the same thing as data management and data governance. In this two-part series, I advocate addressing data quality and data governance issues on the way to data lakes and Hadoop.
Data governance for data lakes and Hadoop
In general, data governance provides the guiding principles and context-specific policies that frame the processes and procedures of data management. Hadoop-enabled data lakes largely circumvent this oversight. The much-lauded schema-on-read capability of Hadoop enables the data lake to rapidly acquire data by bypassing predefined data structures, such as those required by relational databases and data warehouses. But to use data, you eventually have to pay its construction cost. And when the data stored in Hadoop-enabled data lakes is used by a business process or analytical application, structure has to be imposed.
So, while data lakes and Hadoop offer a fast and cheap way to collect and store an enormous amount of data from different systems with varying structures, it can be slow and expensive to actually use the contents of the data lake.
In addition, data integration at its core is about relating multiple data sources and bringing them together to make data more valuable. And while Hadoop is great at collecting data, it does not automatically integrate data. Metadata provides the definition across data sources that makes integration possible. Predefined metadata definitions (i.e., not only technical column definitions but also business terminology) are also bypassed by Hadoop for the sake of rapid acquisition.
For the organization to realize its business potential – especially for the content originating outside of the enterprise – the data lake needs to be integrated with the enterprise data and applications already under the purview of data governance. “Practical aspects of integration and interoperability between Hadoop systems and the existing infrastructure still pose challenges to many organizations,” David Loshin explained. “Many struggle to understand how all the various components fit together.”
Download a paper about bringing the power of SAS to Hadoop