Applying a modern programming model to data inte-aggregation

0

In my post a few weeks back, I shared a sequence of steps for the hierarchical integration of aggregate transaction data across a community to be published by a single coordinator. Those steps were:

  1. Each organization extracts data from a variety of sources.
  2. Transactions are organized by individual.
  3. Sets of individual transactions are aggregated (e.g., sums are collected of transactions by individuals).
  4. The coordinator collects interim results from the community of organizations.
  5. The coordinator then sorts collected interim results by individual.
  6. The coordinator then generates the final result by finalizing the aggregation across the collected interim results.
  7. The final result is packaged for publication.

As I was reflecting on this sequence, it occurred to me that this process was somewhat familiar, but not from the perspective of the integration. Rather, it was the repetitiveness of the steps at different levels of an aggregation hierarchy, and it reminded me of some of the Hadoop MapReduce algorithms that I have recently been thinking about.

Let’s look at it a little more carefully:

  • Step 1 loads data into an analytical environment
  • Steps 2 and 3 aggregate data by individual at each “processing node”
  • Step 4 communicates the interim results to a single coordinator
  • Step 5 “flips” the data for collection across interim results
  • Step 6 calculates the final totals

In other words, individual calculations are the Map phase, and the collections and final aggregations are the Reduce phase – a really big example of a MapReduce approach, even if it is not actually deployed specifically in a MapReduce environment.

So that makes me think: are there some general coordinated operational scenarios that can be modeled using a similar abstract programming model? If so, are there ways to embed certain services and governance to effect some degree of standardization across the community? And if so, what types of tools, techniques and oversight would be needed to make it actually work seamlessly across the administrative boundaries? I would be interested to hear from any of the readers if they have any similar thoughts…

Share

About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at b-eye-network.com and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at mdmbook.com . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at loshin@knowledge-integrity.com.

Leave A Reply

Back to Top