Integrating big data into existing data management processes and programs has become something of a siren call for organizations on the odyssey to become 21st century data-driven enterprises. To help save some lost time, this post offers a few tips for successful big data integration.
Collection is not integration
Hadoop provides a cheap and fast way to capture data from a variety of sources, many of which originate outside the enterprise, enabling organizations to collect a veritable treasure trove of potential business insights. However, for organizations to realize its potential, this data needs to be integrated with enterprise data under the purview of data governance. While Hadoop is great at collecting data, it does not automatically integrate data. So, while data lake has become a popular term associated with big data, and Hadoop is the most popular tool for pumping data into them, a data lake is more akin to a data dump until data integration is performed.
Cleanse the elephant
Not only does Hadoop not integrate data by default, as I previously blogged, data quality functionality has to be embedded into every data integration process. A major limitation of Hadoop version 1.x was that it was computationally coupled with MapReduce. This meant that in order for integration and quality functions to process data stored in the Hadoop Distributed File System (HDFS), either data had to be extracted from HDFS and processed outside of Hadoop (negating most of the efficiency and scalability benefits of Hadoop), or the functionality had to be rewritten in MapReduce so that it could be executed in Hadoop (not only specialized and labor intensive, but some data quality functionality isn’t MapReduce-able).
Thankfully Hadoop version 2.x, which introduced YARN (Yet Another Resource Negotiator) as the new resource management framework, enabled non-MapReduce data processing to be performed within Hadoop clusters. This allowed data integration and data quality functions to be executed natively in HDFS.
Leading data management vendors offer solutions for cleansing the elephant (i.e., improving the quality of big data in Hadoop) that you should familiar yourself with, such as the SAS offerings Guido Oswald recently blogged about.
Master the metadata first
Integrating big data into master data management (MDM) implementations is an oft-cited top priority, especially integrating social media data into MDM. However, it’s more important to master metadata first. Data integration, at its core, is about relating multiple data sources and bringing them together to make your data more valuable. Metadata provides the definition across data sources that make this possible. In addition, metadata allows you to trace which data moved when, how it was changed, the business rules that were applied and the effects those changes might have. Failure to place enough emphasis on metadata will result in problems later, as David Loshin recently blogged.
SAS is a leader in Gartner Magic Quadrant for data integration tools for the fifth consecutive year.