The streaming data paradigm shift for legacy systems

Streaming data in a smart city
Learn about SAS Event Stream Processing.

No doubt about it – there's a lot of streaming data these days, especially when you consider web activity data streams, Internet of Things (IoT) devices, self-driving cars that generate loads of data, smart cities, etc. In fact, documenting thoughts about the need for improved information management processes and procedures to handle the volume, velocity and variety of the myriad numbers of streaming data sources is sort of like shooting fish in a barrel.

I was in a strategy meeting recently that made me rethink the concept of streaming data in the context of much more mundane (and much more “legacy”) systems. Here is the scenario. The customer is reviewing their current products used for data integration, entity resolution and master data management. One question that was raised had to do with reconsidering the approach used for identity resolution. The current approach extracts and then matches collections of records in batches, assigns unique identifiers to matched records, and then creates a master index that links all source records together that represent the same entity.

Apparently, the client had been introduced to a number of newer MDM products that did not build a traditional master data index. Instead, these products combined some of the conventional methods of similarity scoring (that is, determining how closely related any pair of records are) with more innovative big data platform tools for distributing data, indexing objects and parallel searching to effectively create a dynamic capability for approximate record linkage. In other words, as opposed to processing batches of data to form a master index, the system maintains an inventory of entity data objects, and searches and matches records on demand.

If you work for an organization whose systems largely operate in batch, this approach to master data management is not only foreign, it does not make sense at all. For example, how could you facilitate the population of the customer domain for a static data warehouse if you are not able to extract that domain from your MDM system?

To what would one attribute this alternative take on MDM? A lot of it boils down to use cases and consumption patterns. Think about it: If your primary objective for record linkage and matching is to populate a data warehouse, you are not considering the operational business contexts in which the matched records are being used. This is a typical old-fashioned “consolidation-oriented” approach to MDM: Collect the data, perform integration, dump the data into a repository and don't worry about what happens next.

This is where streaming data comes in. Many business applications that deal with entity data (such as e-commerce sales, call center customer service operations, telecommunications, etc.) involve continuously streaming transactions. Sometimes these are conventional transactions (for an e-commerce application something like “customer buys product”) while others may involve micro-transactions (in a multiplayer game app, something like “player turns right”).

Where the paradigm shift begins

A system that must manage millions of entities interacting via streams of micro-transactions is not going to tolerate the latency of batch extracts, matching, indexing and storing to figure out what customers are online or which individuals are currently playing the game. Most of the data interactions are streaming, and therefore modern applications are streaming applications. So it is not unreasonable to expect that the associated data integration (and consequently entity resolution and record matching) be configured for streaming data.

So what about our old-fashioned batch applications? Here is where the paradigm shift happens. Don't think in terms of data batches; view each record that is processed as part of the batch process as a single transaction in a streaming sequence of transactions. Legacy systems also process data in streams, even if those streams are configured as collections of data and processed at specific times or across cyclic periods. Shorten the time between batches, decrease the time between processing periods, and all of a sudden, your legacy system is a streaming application!

If, using that perspective, you see how the approaches to streaming data integration and dynamic MDM adequately meet the combined needs of the user communities, there may be an opportunity to modernize the master data environment.

Download a free paper: Channeling Streaming Data for Competitive Advantage

About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at

Related Posts

Leave A Reply

Back to Top