One of our clients is a government agency that, among many other directives, is tasked with collecting data from many sources, merging that data into a single asset and then making that collected data set available to the public. Interestingly, the source data sets themselves represent aggregations pulled from different collections of internal transactions between any particular company and a domain of individuals within a particular industry.
The agency must then collect the data from the many different companies and then link the records for each individual from each company, sum the totals for the sets of transactions and then present the collected totals for each individual.
This scenario poses a curious challenge: there is an integration, an aggregation, another integration, then another aggregation. But the first sets of integration and aggregation occur behind the corporate firewall while the second set is performed by a third party. That is the reason that I titled this blog post “Data Inte-Aggregation,” in reference to this dual-phased data consolidation that crosses administrative barriers.
In fact, it is the crossing of administrative boundaries that piqued my interest. I am confident that the readers of the Data Roundtable are keenly aware of the challenges of linking data about individual entities (such as customer or product) pulled from different internal sources, especially when those representations can differ slightly (or even more than slightly) across the different systems. However, this challenge becomes even greater when trying to link across data sets coming from different organizations, where there can be no expectation of similarity of structure or model. So over the next few weeks, we will drill down into this situation in greater detail.
1 Comment
What you describe here sounds like a design challenge. Context often must be discovered or inferred in these cases. I'd think that if your data was in hadoop, various dbs, XML or other sources the same techniques might apply. Wouldn't data profiling also play an important role?