Differentiating process, persistence, and publication: Data management for analytics


As part of two of our client engagements, we have been tasked with providing guidance on an analytics environment platform strategy. More concretely, the goal is to assess the systems that currently compose the “data warehouse environment” and determine what the considerations are for determining the optimal platforms to support the objectives of each reporting/analytics system.

Interestingly, yet not surprisingly, the systems are designed to execute data warehousing tasks of varying complexity and to support a community of users whose skills range from loading extracts into excel to make simple bar charts to sophisticated analysts performing data discovery and predictive/prescriptive analytics. Yet the siloed approach to management, design, and development of these systems has led to some unintentional replication.

The challenge is to determine what that replication really is. In some cases there is obvious data replication: a data subset is extracted from a set of files and is then loaded into an auxiliary data mart. In some cases the replication is process replication: an input data source is scanned, parsed, and transformed for loading into one data environment, and later the same data is pulled from the data environment, scanned, parsed, and transformed prior to loading into a data system that lies further downstream.

From an enterprise architecture perspective, both of these replications seem unnecessary, especially when the processes push and pull data from platform to platform and when the processing stages are sometimes run on a mainframe system using coded COBOL and at other times on a set of servers using an ETL tool. In fact, the replication is purely a byproduct of management-imposed constraints.

I inferred an interesting insight from reviewing some of the systems’ operational characteristics. First, there was one simple assertion about these data warehouses and marts: each of the systems composed three different facets of the analytics pipeline:

  1. Data ingestion, preprocessing, and transformations
  2. Data processing
  3. Facilitation of data access and publication

The insight was that the horizontal view of the ways all of the systems worked suggested that all of the ingestion and preprocessing stages could be composed, much of the data processing could be composed, and much of the data accessibility and presentation could be solved using the same set of tools and technologies.

For example, the data to be loaded into system A require edits and cleansing prior to loading. However, system B extracted data from system A, then sorted and aggregated the data while applying some transformations prior to loading. The preprocessing for system A and system B are both data-independent and highly parallelizable and are easily migrated to an application developed on Hadoop. However, the types of queries performed on system A and those for system B were much better suited to some kind of data warehouse appliance or in-memory computing.

In all cases, though, one could differentiate between the processing and data preparation tasks from the data ingestion and persistent storage, and both could be segregated from data accessibility. At the same time, the similarities of the processing, persistence, and publication components suggests that selecting best-of-breed data management tools provides a solid foundation for engineering the data management platform for analytics. In my upcoming posts, we will look at the determination of business needs, followed by the characteristic tool suite required for effective data management for analytics.

Get more information on how modern, big data architectures are affecting data management.



About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at b-eye-network.com and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at mdmbook.com . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at loshin@knowledge-integrity.com.

Related Posts

Leave A Reply

Back to Top