Data cataloging for data asset crowdsourcing

people studying data catalogs What does it really mean when we talk about the concept of a data asset? For the purposes of this discussion, let's say that a data asset is a manifestation of information that can be monetized. In my last post we explored how bringing many data artifacts together in a single repository enabled linkage, combination and analysis that could lead to profitable business actions.

On the one hand, the more data that's available, the better chance there is for combining multiple artifacts in ways that can be monetized. But at the same time, the more data there is to search from, the more difficult it is to figure out what you need, whether it exists and how it can be used.

That is where data cataloging comes in. No matter what format a data artifact takes, there will always be some concepts or markers that describe what the artifact contains. The easiest example is a structured data set.

The artifact’s columns’ values can be profiled, then compared against existing reference data sets and abstract data type formats (like social security numbers, zip codes or telephone numbers). And an inventory of what's in the file can be created and added to a catalog that's shared among all the consumers. Unstructured text can be scanned as well, looking for sentinel phrases or formats (name formats, address formats, etc.). A collection of all identifiable entities can be linked to that artifact’s entry in the shared data catalog. More sophisticated image processing applications can scan graphics, images and videos, and can pluck out concepts and individuals by matching against databases of images. Yet again, any identified objects associated with the data artifact can be added to the data catalog.

Finally, any business term or concept that can be associated with the data artifact can be added to a searchable index. This index can embed a semantic network for concepts as well. That way, a person searching for data about “cars” will not only be provided with pointers to data artifacts containing the word “cars,” but will also be pointed to artifacts containing any terms that are in some way aligned with cars (such as trucks, vans, buses, SUVs, etc.).

The combination of a data catalog that lists all data artifacts – along with what is contained within a semantic index – will provide a breadth of results to data scientists looking for the right data to incorporate into an analytic project. Look for tools that embrace the capabilities necessary to make this happen, which should include features like these:

Data profiling to scan through structured text.
Term indexing, to create the index for direct word/phrase searches.
Text analytics for entity identification and extraction in free-form content.
A semantic network linking concepts to enhance the search capability.

Download a paper about 5 data management for analytics best practices.