Both the fiscal crisis and the swine flu pandemic have caused a great deal of worry and panic. Worry is sometimes a good defense, but panic only leads to more panic and that only leads to more trouble.
At the same time, both crises have brought to light the importance of having accurate, timely data. As data professionals, this is what we do every day. But it is not often that the stuff of our professional lives becomes talking points for the talking heads on cable tv. In the past couple of days, some of the reporters and pundits covering the swine flu story have sounded like they were leading an IT staff meeting.
If you listen to the news today, you can actually hear people talking about the importance of a denominator. “Unless we know the true denominator for the number of people who have contracted swine flu, we can’t know whether it is rarely or often fatal.” That is absolutely true, but you don’t often hear a public discussion about data quality.
Unfortunately, while the crisis has drawn attention to the need for accurate data and things like valid denominators, the solutions being proposed are completely unrealistic. Some people think the answer is easy: One simple, absolutely accurate source of all data that will allow us to answer any question we have. That’s all we need. We’ll just get some computers and make a file like that. But anyone who has worked with data in the real world knows that such a “master file” will never exist.
The “master file” solution rests on assumptions that there were never any data entry errors, all of the people who were supposed to enter data did so, and there is only a single copy of that master file housed on a super-secure server in a temperature-controlled vault in an undisclosed location. People who actually work with data know better. There will never be a single, omniscient master file. What there can be, however, is a data integration process that draws together all of the available data, performs data cleansing and merge resolution functions, and sets up routines to make sure that process continues as new data become available from those systems.
But data integration is not the stuff of headlines. For most people, data Integration is very boring stuff. Data integration is the stuff of IT journals, computer science courses, and data governance seminars held at the budget hotel by the airport. To tell the truth, even those of us charged with data integration tasks recognize that data work can be mind-numbing and thankless.
The IT staff charged with merging two data sets will be the first to have to report that the two systems, which everyone had assumed were basically parallel and almost duplicate data, don’t match at all. The IT staff will be the ones who need to find a way to determine if the two twelve-year-old boys named “John Smith” in Capitol City are really a duplicate entry or the same boy. The IT staff will be the ones who find a way to merge the massive, older FORTRAN mainframe records with the newer, sparsely populated MySQL database built by a programmer who left last year. The IT staff will be the ones who document every minor irregularity in metadata so that an analyst who will use those data next year will know that “Flu1” and “Flu1a” are actually the same variable, drawn from two different databases, one represented as numeric and the other represented as alpha and, between them, there is only an 87% match. That is data integration.
Statistical Consultants get bonuses for creating remarkable statistical models that make highly accurate predictions. Report Writers get popular creating beautiful dashboards and graphs that make the data easy to understand. data integration staff get ulcers, absolutely certain that the update to the SQL database that houses Table 456.2B is going to invalidate the automatic merge algorithm used to correct for weaknesses in the old Table 456.2A extract.
Data integration tools have done much to increase the speed, ease, and efficiency of data integration. At the same time, databases and software have become more complex with larger systems and more one-off, undocumented custom builds floating around the enterprise. All of these must be brought together if you want a denominator you can trust.
William Carlos Williams, the American poet, once wrote:
so much depends
upona red wheel
barrowglazed with rain
waterbeside the white
chickens.
Sometimes the fate of the world can depend on the most ordinary things. Simple things that seem almost irrelevant. Things you would think you could take for granted.
Today, one of the most important pieces of information medical experts need to know is a denominator: What is the total number of people who may have contracted swine flu? Meanwhile, in countries around the world, in basement offices and cubicles and even on laptops in a conference room that just happens to be empty from 10-2 today, IT professionals are working on data integration processes. That work may be exceptionally frustrating and time consuming. But developing a process for getting the most complete and accurate model of all of your available data is critical to being able to respond to opportunities… and emergencies. And nothing is more important than that.
1 Comment
Its important to maintain a single source of truth for customer data.
Customer data integration brings in whole lot of things with itself. Unless each of these mentioned in your blog is taken care, say for example.. support for data governance, data model extensibility, it will not be considered as complete. Thanks for the information. Do take time to read through a similar blog
http://www.infosysblogs.com/oracle/2009/05/cdi_critical_path_to_trading_c_1.html