Data inte-aggregration

1

One of our clients is a government agency that, among many other directives, is tasked with collecting data from many sources, merging that data into a single asset and then making that collected data set available to the public. Interestingly, the source data sets themselves represent aggregations pulled from different collections of internal transactions between any particular company and a domain of individuals within a particular industry.

The agency must then collect the data from the many different companies and then link the records for each individual from each company, sum the totals for the sets of transactions and then present the collected totals for each individual.

This scenario poses a curious challenge: there is an integration, an aggregation, another integration, then another aggregation. But the first sets of integration and aggregation occur behind the corporate firewall while the second set is performed by a third party. That is the reason that I titled this blog post “Data Inte-Aggregation,” in reference to this dual-phased data consolidation that crosses administrative barriers.

In fact, it is the crossing of administrative boundaries that piqued my interest. I am confident that the readers of the Data Roundtable are keenly aware of the challenges of linking data about individual entities (such as customer or product) pulled from different internal sources, especially when those representations can differ slightly (or even more than slightly) across the different systems. However, this challenge becomes even greater when trying to link across data sets coming from different organizations, where there can be no expectation of similarity of structure or model. So over the next few weeks, we will drill down into this situation in greater detail.

Share

About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at b-eye-network.com and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at mdmbook.com . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at loshin@knowledge-integrity.com.

1 Comment

  1. What you describe here sounds like a design challenge. Context often must be discovered or inferred in these cases. I'd think that if your data was in hadoop, various dbs, XML or other sources the same techniques might apply. Wouldn't data profiling also play an important role?

Leave A Reply

Back to Top