Crowdsourcing data assets in the data lake

man in server room contemplating data lakes A long time ago, I worked for a company that had positioned itself as basically a third-party “data trust” to perform collaborative analytics. The business proposition was to engage different types of organizations whose customer bases overlapped, ingest their data sets, and perform a number of analyses using the accumulated data sets. This work could not be done by any of the clients on their own.

By providing a trusted repository for the data, the company ensured that no client would have access to any other client’s data; yet all would benefit from the analytics applied across the board when value could be derived from merging the data sets.

Two of the byproducts of this process included improved data quality and identity resolution. Standardizing each data set simplified deterministic matching between the data sets, and improved standard representations contributed to the probabilistic matching algorithms. For example, consider the case where the same individual was a customer of two of the companies, but neither company had a complete record for that customer. Once the two records were linked, each of the original records could be augmented using data pulled from the record with which it was paired.

Combining data from multiple origins is one form of data crowdsourcing. The growing expansion of data lakes that become the collection point for the data sets that are managed by the enterprise provide a slightly different framework for data asset creation. But the same basic principles hold: collect data sets from different contributors, place them in a repository where the data can be analyzed, and publish the results back out to the contributors.

The data lake differs a bit, though, because of three challenges:

Data variety. Many organizations are ecstatic about having a resting place for structured data extracted from existing systems, data acquired from external sources, and reclaimed data (such as old documents and emails) pulled from their ancient archives presumably to be put to better use.
Absence of governance. There is another presumption that data analysts always want access to all of the data in its original format. So the best way to make that data available is to populate the data lake by moving all corporate data into it. However, without properly documenting what is added, along with the data set’s relevant business metadata (that is, what's in the file and what can I do with it), it becomes difficult to even know what data is available, let alone whether it can be adapted for any particular analysis.
Interpretation on read. Finally, allowing the analyst to use the data in its original format when there are no predetermined standards for what the data means forces data consumers to come up with their own definitions, standardizations and interpretations.

There is an emerging methodology for addressing these issues. It involves proactively scanning data sets in the data lake to infer enough information about their contents to establish de facto standards for combining the different data sets. For example, an application can:

Scan an unstructured document.
Identify any recognizable entities (such as individuals by name, or locations based on the address).
Catalog what's in the data set (from the metadata and business glossary perspectives).
Prepare a structured view into that data in accordance with references to those entities. In turn, that structured “template” for the data asset can be linked with other structured and unstructured data sets.

In essence, to manage data in a lake, it's essential to be able to create a catalog that can be searched. In my next post we will look at the characteristics of the tools that can be used for cataloging data sets to enable data asset creation.

Read an article: Big data integration: Go beyond "just add data."