Data preparation is the task of blending, shaping and cleansing data to get it ready for analytics or other business purposes. Data preparation begins with attempting to find the data best-suited for a specific purpose – an exercise that many users cite as frustrating and time-consuming. With so much data available from numerous sources in various formats, it's crucial to be able to discover the right data quickly. Only then can you expedite the process of getting value out of the vast data assets from across your enterprise.
Creating and maintaining a comprehensive, well-documented data catalog (i.e., metadata repository) is an essential part of making the discovery process more efficient. The catalog provides a descriptive index pointing to the location of available data. This descriptive index is comprised of both business and technical metadata, including technical definitions, data profiling statistics (e.g., row counts, column data types, min, max, and median column values, null counts), business terminology, source data lineage, relationships to other data, data usage recommendations, associated data governance policies and identified data stewards.
Metadata is too often a discarded by-product of many enterprise initiatives and ad hoc projects. In her blog series on the importance of metadata, Joyce Norris-Montanari discusses why formally collecting and cataloging this metadata bridges the gaps in the enterprise’s understanding of its data. Surveying the existing data environment and creating a catalog of data assets, as David Loshin recently blogged, facilitates data usability by providing a “means for indexing definitions and descriptions and linking them to their enclosing data assets so that they can be searched using keywords, phrases and even concepts. This enables data asset discovery, so data consumers can more easily find assets that meet their needs.”
Making it easier for users to find relevant data will also make it easier for them to quickly put that data to use. Even better, discoverability enables self-service data preparation, empowering business users to work with data on their own and free up IT to work on other tasks. In the process, the entire organization becomes more productive.
Download a BARC research study on data preparation