Many moons ago I was responsible for technical architecture and data quality of a financial information system. When we were trying to improve data quality, I found that adding some useful attributes to the data structures made a big impact in terms of helping us improve the data quality processes.
The first attribute we added was a system source value. This was just a three-character field that indicated the system on which the data was originally created. For example, our system could have data inputed via a web form or standard application screen, and later we introduced input feeds straight from supplier interfaces. Instantly we could trace defects back to the original source. Without this information it would take us literally hours to fathom out where the original information had come from. It also proved invaluable when we performed legacy migrations and moved old data into the new platform.
Next we added a source creation user attribute. This was a longer variable character field that included a label of the person who created the original record. So if Mary Smith entered the original record via the web, then this information was collected and stored. This proved invaluable because when issues or anomalies were observed we could go directly to the user expert and establish whether they had made a typo or there was indeed an extreme event taking place in financial circles.
After this we added timestamps for when the record was created and last updated. This helped us in many ways. First, we could reduce the amount of processing required to monitor data quality rules because we only needed to check records that had been updated since the last check. With millions and millions of records this became a god-send (this was the mid '90s, after all!). Second, we could identify which datasets had high volatility, and this gave us a clue to the most important information sources.
Next we added a sequence number that allowed us to gauge the growth of datasets. This was particularly useful for reporting data quality over time. We could illustrate how fast a dataset was growing and therefore demonstrate the benefits we were delivering as well as the impact to the business if our initiatives were pulled.
Finally we added update attributes for users to see who last updated a record. We also kept a separate audit log of all updates, but as this was purged periodically it was useful to have a persistent record.
Although the team responsible for managing the data was small (fewer than 15), these simple techniques became invaluable in our quest for better quality data.
Hopefully this has provided food for thought on your data quality quest. Perhaps you can suggest additional attributes to add to the list?
1 Comment
Very good summary for some basic fields, which are often a good idea. For improved traceability, I would also suggest to ensure, that you never delete data. Use isDeleted attributes to avoid it - as long as the data is not sensitive, of course. The alternative would be to introduce enhanced logging capabilities, but it might more comfortable to use the existing analysis features you mentioned above, by just querying the DB for changes.