It’s safe to say that no data scientist ever went into the job wanting to spend most of their time on data preparation and other data management tasks. Nor do organizations hire these valuable resources to perform such tasks. But here we are. In 2021, it’s still the reality. The time-consuming tasks that have to be done before you can start using data represent one of the worst parts of the data user experience. Not just for data scientists, but for business users too.
Instead of being able to use data to generate insights, data scientists and others waste considerable time simply finding and understanding data. Clearly, the practice of data management has advanced in many ways – for example, augmented data preparation. But there are still issues. Consider the difficulty in:
- Knowing all the data that’s available at your organization.
- Determining the content of a particular data set.
- Evaluating the data applicable to making specific business decisions.
- Measuring data quality – that is, the data’s fitness for a particular use.
- Expediting time to value for the business.
As more data sets arrive in this big data environment, the issues compound. This greatly affects the value of analytical outputs and other business uses of enterprise data assets. With so much information available from numerous data sources – and formats – it’s crucial to be able to discover the right data quickly. To help, many leading organizations have adopted data catalog tools.
So: what is a data catalog?
A data catalog is a comprehensive, well-documented metadata repository that provides an organized, descriptive and searchable inventory of business data assets. It provides a descriptive index pointing to the location of available data.
This descriptive index is comprised of business, technical and operational metadata, which includes:
- Business terminology.
- Technical definitions.
- Data profiling statistics (e.g., row counts, column data types, min, max, and median column values, null counts).
- Data lineage.
- Relationships to other data.
- Data usage recommendations.
- Associated data governance policies and related data stewards.
In essence, metadata management via a data catalog helps users discover, understand, trust and efficiently use enterprise data for a variety of business purposes.
Discovering the data best suited for a specific business purpose is perhaps the central capability of a data catalog. Data discovery is enabled through searching, by keywords, tags, filters and other parameters. Many data catalogs automatically sort data assets by relevance, viewing frequency and usage. That makes the best data easier to find.
By making it faster and easier for people to find and use relevant data, a data catalog improves data and facilitates usability across the organization. This also enables self-service data preparation. And that empowers business users to work with data on their own, freeing IT to work on other tasks. In the process, the entire organization becomes more productive.
What makes a data catalog so valuable?
- Serves as an inventory of data assets
- Helps users quickly find the best data for each purpose.
- Provides information for evaluating fitness of data.
- Prevents wasted time and effort spent on data discovery.
- Empowers users to focus on getting insights from data.
- Serves as an inventory of data assets
- Helps users quickly find the best data for each purpose.
- Provides information for evaluating fitness of data.
- Prevents wasted time and effort spent on data discovery.
- Empowers users to focus on getting insights from data.
Data shopping
One of my favorite analogies for the usefulness of a data catalog is also one of the few reasons I leave my house these days: grocery shopping. Imagine how hard it would be to find items in a grocery store if they weren’t organized in aisles with lots of signage. And sales specials or soon-to-expire items are usually moved to a more discoverable location (like an end cap, or near the checkout counters).
The task is more complicated if you need to shop at different grocery stores. Just like data at businesses, all store layouts are different – with items organized in slightly different ways.
A good analogical bridge between grocery shopping and data catalogs is the option I sometimes choose – online ordering. Grocery store websites are driven by a digitized catalog of products. This serves as a good model for actual data catalogs, too.
Searching and sorting by brand, price, expiration date and dietary needs (e.g., gluten-free) is less time consuming than walking up and down aisles. Recommendation engines can identify related items to add to your virtual shopping cart (e.g., if you’re buying hamburgers maybe you need buns, ketchup, mustard, relish and sliced cheese). Perhaps most useful are ratings and reviews from other users. These help you evaluate general quality as well as fitness for specific uses (e.g., cheese that melts quickly without burning).
Data shopping is often the first – and most time-consuming – task in business uses of data. That’s especially true for data analytics. Data shopping includes:
- Knowing what data is available.
- Identifying the choices for particular subject areas and data domains.
- Evaluating the preparation tasks that might be required to use certain data.
- Learning what the data has been used for previously.
- Assessing how previous users felt about the business results the data produced.
Find more data, find more insights
The more data you can apply to a business problem, the better its potential solutions. While there’s no shortage of data available today, it’s often difficult to know what data you have and how it can be used. For example, privacy regulations require businesses to provide data security – particularly for sensitive data, like personally identifiable information.
A data catalog uses metadata to help users quickly search an organization’s entire data landscape – including sources like data lakes. It also helps them understand the data available to them and operationalize that data for insight-driving analyses and other business applications.
But the true value of a data catalog is at the user level. A data catalog gives users a powerful solution for avoiding manual workarounds or having to ask IT for help. This allows time for them to spend doing that they were hired to do – finding insights for the company.
Discover what SAS Data Management can do for your business