Data profiling is probably the most common data quality-related activity that most companies get started with. One problem you may face is that when starting out with data profiling, you get into some bad practices. So in this post I’m going to outline three simple techniques that you’ll find useful as you develop your "data profiling chops."
Tip #1: Adopt an “Inside-Out” and “Outside-In” Mindset
A data profiling tool can give you lots of information and insights into where problems exist, but it also gives you a lot of noise. If you’re examining a legacy system for the first time you can often find literally thousands of seemingly erroneous data.
This is the challenge you’ll face right out of the gate. To make your profiling activity meaningful, you have to not only inspect data from the inside and draw conclusions for the outside world, you have to operate in reverse. Work from the outside, too. Speak to business users. Create short feedback surveys. Look at customer complaint reports. Analyse external trends and impacts in the business.
Combine this outside knowledge with your inside-out data profiling results and create a cohesive process that is not biased in any one style of approach. You’ll be able to answer the “who cares?” question much more easily and get that early phase traction you so desperately need.
Tip #2: Adopt the Right Profiling Approach for the Right Type of Data
There are different types of data. For example, in any system you will probably find:
- Transactional Data (e.g. when we buy a tin of beans from the supermarket)
- Reference Data (e.g. the industry coding standard for the item that is updated weekly from the supplier)
- Master Data (e.g. a central master inventory record for the item)
When profiling, you have to be mindful of the impacts you’re witnessing based on the underlying type of data.
If you find a record with a missing value in a transactional-type record, then any issue is likely to be localised to that transactional process. Obviously the transaction can feed into downstream systems and cause knock-on problems and there may be an underlying defect that has affected many more transactions in the past. But in terms of issue scope, the issue is confined to that transaction.
However, if you find a defect in your master data - for example, you have an incorrect product classifier for a popular stock item - then this issue could affect thousands of transactions instantly.
When you’re looking at defects reported it’s not enough to look at physical counts and percentages; you need to look at the impact scope of that defect - how far will that data reach across the organisation and cause issues?
Tip #3: Segment Your Data Along Information Chains For Faster, More Relevant Results
Modern data profiling tools don’t have the performance challenges we used to face; they can typically cope with full volumes of data. However, this doesn’t mean it’s in your best interest to always perform an exhaustive profile of the entire data set, particularly when reporting back your findings.
Historical analysis is useful to spot trends and quantify long-term impacts, but when time is not on your side I find it far easier to take a slice of data that relates to a specific process, product line, service or something else that is relevant to the business.
For example, if you know that 80% of your revenue comes from 20% of your product line then you can often eliminate a huge amount of data to create a more focused analysis on the products that really matter to the business. Information chain analysis is critical here; you need to “walk the data paths” that information takes as it fulfills the process. Extract data along this path, segment accordingly and tell a much more compelling story.
By joining up the data along the information chain and creating an "end-to-end" profiling story, you'll be adding far more value to the business. You've also created the basis for ongoing data quality rules management and monitoring.
What tips do you have for more effective data profiling? Welcome your ideas.