I have recently qualified as a volunteer first responder to assist ambulance crews in my rural community, which is an interesting break from the world of data.
But not a break entirely.
During my training, it occurred to me that we’re simply not equipping many data quality practitioners with the right techniques to get to the complex root causes of a data quality problem.
Tackling data quality defects is very similar to treating patients at the scene. Your first task is to assess for any serious difficulties that you can tackle immediately. The same applies to your data. If a customer order is incomplete and lacking the correct address details, you will actively chase down that information in order to keep the customer happy (not to mention the delivery driver!).
With the immediate danger over, you then look to perform some basic observations to discover a range of potential problems. In a data quality context this may involve data profiling and data quality assessment activities to help you gather some baseline statistics of the data and its lineage.
As first responders we are trained to use a methodology called SAMPLE, and I think elements of this can certainly benefit the data quality practitioner.
The ‘S’ stands for "signs" and "symptoms." These are markers that help you gain deeper insight into a patient’s condition. For example, they may have a severe toothache, but when you combine this information with other observations it could stem from a possible issue with their heart.
Data quality practitioners need to spend a lot of time gathering signs and symptoms for their data defects, because this will help them not only build a case for action but also uncover some of the broader issues that are happening. Just as medical staff will seek out friends, family and bystanders for input, data quality practitioners can’t find all the answers at a computer. You need to talk with people out in the field about the signs and symptoms they’ve been witnessing.
‘A’ represents known allergies. In data quality this would represent known problems or things you know are likely to cause defects. For example, during one of my past data quality assessments, I found certain workers would repeatedly make the same data entry mistakes time after time. In another system I knew each time a feed came in from a utilities servicing partner it would have defects in certain attributes.
‘M’ stands for medication. In data quality terms this would perhaps represent known fixes. How do people fix the data currently? What workarounds are adopted? Who is responsible for maintaining them? How long do they take to implement? Lots of questions to ask here.
‘P’ stands for past medical history and is one of the most critical activities you learn as a responder because the presenting illness is so often caused by some past issue that is manifesting itself in a different form. This is obviously very relevant in data quality terms because we need to understand what has happened before with the data under investigation.
For example, have there been any recent updates or design changes to the data? What is the health record of the data in the past? Has the data ever been subjected to a health check? Where does the data come from? What is its history and lineage?
‘L’ is irrelevant in data quality terms but ‘E’ is certainly important because it stands for ‘"events leading up to the emergency." From a data quality perspective this is critical because you want to understand things like the trigger event of the episode. Was the user trying to do something different with the order process? It is also useful to understand how the problem was detected so you can put in place measures to either prevent it or improve the reporting and alert process in the future.
Admittedly I’ve stretched the medical metaphor as far as I can take it, but I’ve learned a great deal lately by watching seasoned professionals applying the methodology in a wide variety of situations.
Working with highly experienced paramedics demonstrates how important this investigative phase really is. Emergency crews all follow this system and record a considerable amount of data for each emergency. I think this sense of structure and method is something we’ve perhaps lost in the data quality profession as people generally "follow their noses" to look for the more obvious causes of data issues, often overlooking the deeper causes such as poor training, bad system design and lack of inbound data SLA’s in the process.
Hopefully there are some takeaways in this article you can use to expand your own data quality analysis methodology, if so let me know how you get on out in the field.