Start with the end in mind -- wise words that apply to everything, and in the world of big data it means we have to change the way we look at managing the data we have.
There was a time when we managed data quality, and the main goal was to meet a metric that said data should be x% accurate. I’d argue that this is no longer relevant. Now, before I’m hunted down by all the data analysts out there, I’d like to clarify that I’m referring to managing data for data’s sake. Often we manage the value out of the data right when we need it most.
Does this mean I’m advocating that we cease performing any data quality work in the data stores holding that information? The answer is yes and no. That’s helpful isn’t it?
The answer became nebulous about the time new capabilities were created with the new big data architectures. Let me explain what I mean.
Context is king
Context has become king! How do we know what state we need the data in if we don’t know what it’s going to be used for?
When we clean the data to meet a data quality KPI, do we then remove a lot of its value? That will depend on what context that data is going to be used in. For example, if you’re looking to analyse for fraud, removing anything that looks “wrong” (duplicated/transposed/an error) may in fact hide the fraud that is occurring.
What if we’re looking at an internal data discovery process where we also want the murkiness? There are times where strong data quality is required, as we certainly don’t want to send any company directors to jail, however do we now compromise the value of our data by managing quality too early?
Bring on the data swamp
Often we hear the concept of a data lake (all-encompassing data storage), with the stated goal of IT and data departments being to ensure this lake doesn’t turn into a data swamp. Well, I disagree -- a data swamp is exactly what we need!
For all discovery processes we want the murkiness, because the hidden value may be there waiting in the deepest, darkest recesses of your transactional data. I’m certainly not advocating this is where your analysis should occur, but it is the consolidated repository from which your data analysts (or wranglers, if you prefer) should draw from.
Based on the answer or action they're trying to determine, or whether this is simply an internal discovery expedition, will then allow them to identify the level of quality acceptable within the data.
Once value is discovered, good data governance dictates that we start again from the swamp to extrapolate, perform appropriate data quality activities and validate that the findings still hold true before moving into a well-governed operationalisation process.
So, is it realistic to create a great big data swamp? With the rise of big data technologies and the ability to use cheap commodity hardware for both storage and processing, this is now very much a possibility. The one caveat is that a comprehensive data strategy, with a solid data and process governance plan, must be in place.
Right place, right time
No longer should we be looking to manage data quality in a repository as default. The goal is to manage the quality of the data we’re about to analyse based on the context that analysis requires. Start with the end in mind, ensure your data quality processes at the point of creation are strong, discover from your swamp, and then let your data strategy and data governance capabilities take over.
For more information on creating that data strategy, download this white paper: The 5 Essential Components of a Data Strategy.