I've noticed a recurring pattern among a number of our clients over the past few months. With the growing popularity of two different yet complementary big data storage paradigms, there's a corresponding interest in surveying the existing data environment and exposing more data to the community of data consumers. The two storage paradigms are Hadoop (or more precisely, the Hadoop Distributed File System, or HDFS) and cloud storage (such as Amazon’s S3, Microsoft Azure’s Blob storage and Google Cloud Storage).
Both of these paradigms provide scalable, extensible storage capacity that provides a foundation for creating a data lake into which different functions in the organization can push data assets to be shared across the enterprise. This has triggered interest in revising the enterprise data strategy to rapidly embrace a cocktail of on-premises HDFS systems, cloud-based storage repositories and cloud-based Hadoop environments that can connect to HDFS as well as alternate cloud storage services.
Understanding the obstacles
The problem that continues to pop up has little to do with moving data into the data lake – and everything with getting data out of the data lake. This becomes more critical as analytics teams seek out additional data sets that can be folded into their analytical models. The issue is that there is typically little or no knowledge about what data sets are in the data lake, let alone what each of those data sets contains. The result is that usable data sets are overlooked, and data consumers create their own data products and dump them into the data lake even though similar data products already exist. This suggests some impediments to effectively using the data lake, such as:
- Awareness. How can data consumers know what data assets have been added to the data lake?
- Comprehension. For each data asset in the data lake, can the designers ascertain the details of the structure and the semantics of the data contained within the data asset?
- Availability. Where are the data assets stored, and how can you access them?
- Consistency. Can the designers ensure that their anticipated use of an existing data asset won't conflict with other uses of that data asset?
Taking away the barriers
One approach to removing these impediments involves creating a catalog of the data assets that are in the data lake. The data catalog maintains information about each data asset to facilitate data usability – including, but not limited to:
- Structural metadata. For structured assets, enumerate the data elements by name, type and description.
- Business metadata. Provide definitions and descriptions of business terms, reference data sets and any applied business rules for the data asset.
- Searchability. Provide a means for indexing definitions and descriptions and linking them to their enclosing data assets so that they can be searched using keywords, phrases and even concepts. This searchability enables data asset discovery, so data consumers can more easily find assets that meet their needs.
- Access methods. Simplify accessibility by describing the location and corresponding connector methods for access.
- Services. Describe the lighter-weight services that can be used by downstream applications to query the data.
- Data protection. List the methods in place for access control, encryption and masking, among other data protection techniques.
- Lineage. Provide a description of the processes and data sources used to produce each data asset.
- Data quality. Describe any rules applied to ensure accuracy, consistency, completeness and currency, among other data quality dimensions.
Keep an eye out for emerging technologies that will help as you design and populate the information for data catalogs. These will help overcome the impediments to effective uses of enterprise data stored in a data lake.Download a TDWI best practices report about data lakes