Data profiling is a core technique of data quality management and often the starting point for so many projects these days. Because it’s such a relatively simple technique to apply, it’s easy to overlook some of the more advanced techniques that can take your data profiling to the next level. I'd like to share some "profiling power tips" to help improve your data profiling skills.
Power tip : Segment your data
Profiling tools give you the ability to check for statistics such as uniqueness, completeness, value distribution, format distribution and a whole host of other metrics.
The problem is that profiling is often carried out on an entire set of data, but this can have the effect of seriously skewing the findings.
For example, in one past assignment we analysed a utilities system that contained millions of equipment records. We were particularly interested in the power rating value and whether it was populated or not.
Profiling this attribute in isolation told us that the organisation had a great deal of work to do in order to improve the power rating value. When we segmented the data by criteria – such as active equipment that wasn’t facing decommission – we realised the picture wasn’t as bleak as the initial profiling stats led us to believe.
The lesson here is that it’s vital to split your records into distinct subsets so you can make greater sense of what is really happening with your data quality. You will often be bombarded with questions by business users and sponsors when you present your profiling findings, so make sure you’ve explored many angles with your analysis.
Power tip : Augment your data
In order to better segment your data and add greater relevance to your profiling results, you often need to extend or augment with existing information.
For example, if you have a database of equipment that contains an identification code, then parsing out each component of this code into separate constituents or attributes can make your profiling much more effective.
Your data will often contain links to other data within the same data store, across disparate systems or even outside the organisation entirely. By pulling these information sources into your profiling environment, you can start to focus your efforts and add more business relevance to your findings.
Power tip : Transform your data
When profiling data you always have to be mindful of the differences between what we mean by "missing data" and the computer definition.
Profiling tools will typically look for NULL or empty attribute values as part of a completeness check. The problem here, of course, is that users can often add a variety of values that mean the data is missing.
For example, in one system we found more than 80 different values that translated to empty content.
A user may write values such as:
-
TBC - this means ‘To Be Confirmed’
-
Pending
-
UK - Unknown
-
"---" or various combinations such as "???"
It is critical that you translate these values into their NULL variant so your profiling tool can make sense of them and report an accurate statistic. Quite often this requires the creation of a complex data quality rule to map each value to its NULL equivalent but it is a worthwhile activity as completeness checking will be a corporate-wide activity.
Obviously you need to ensure that you adapt the rule for situations where "pending" or "UK" are valid values.
Summary
Data profiling tools are a great starting point with any data quality initiative, but take some time to learn the more advanced techniques listed in this article and employ them in your own initiatives. They will help you create far more impact and relevance to your data quality presentations and reports.