Data cataloging for data asset crowdsourcing

0

people studying data catalogsWhat does it really mean when we talk about the concept of a data asset? For the purposes of this discussion, let's say that a data asset is a manifestation of information that can be monetized. In my last post we explored how bringing many data artifacts together in a single repository enabled linkage, combination and analysis that could lead to profitable business actions.

On the one hand, the more data that's available, the better chance there is for combining multiple artifacts in ways that can be monetized. But at the same time, the more data there is to search from, the more difficult it is to figure out what you need, whether it exists and how it can be used.

That is where data cataloging comes in. No matter what format a data artifact takes, there will always be some concepts or markers that describe what the artifact contains. The easiest example is a structured data set.

The artifact’s columns’ values can be profiled, then compared against existing reference data sets and abstract data type formats (like social security numbers, zip codes or telephone numbers). And an inventory of what's in the file can be created and added to a catalog that's shared among all the consumers. Unstructured text can be scanned as well, looking for sentinel phrases or formats (name formats, address formats, etc.). A collection of all identifiable entities can be linked to that artifact’s entry in the shared data catalog. More sophisticated image processing applications can scan graphics, images and videos, and can pluck out concepts and individuals by matching against databases of images. Yet again, any identified objects associated with the data artifact can be added to the data catalog.

Finally, any business term or concept that can be associated with the data artifact can be added to a searchable index. This index can embed a semantic network for concepts as well. That way, a person searching for data about “cars” will not only be provided with pointers to data artifacts containing the word “cars,” but will also be pointed to artifacts containing any terms that are in some way aligned with cars (such as trucks, vans, buses, SUVs, etc.).

The combination of a data catalog that lists all data artifacts – along with what is contained within a semantic index – will provide a breadth of results to data scientists looking for the right data to incorporate into an analytic project. Look for tools that embrace the capabilities necessary to make this happen, which should include features like these:

  • Data profiling to scan through structured text.
  • Term indexing, to create the index for direct word/phrase searches.
  • Text analytics for entity identification and extraction in free-form content.
  • A semantic network linking concepts to enhance the search capability.

Download a paper about 5 data management for analytics best practices.

Share

About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at b-eye-network.com and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at mdmbook.com . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at loshin@knowledge-integrity.com.

Related Posts

Leave A Reply

Back to Top