Differentiating process, persistence, and publication: Data management for analytics

As part of two of our client engagements, we have been tasked with providing guidance on an analytics environment platform strategy. More concretely, the goal is to assess the systems that currently compose the “data warehouse environment” and determine what the considerations are for determining the optimal platforms to support the objectives of each reporting/analytics system.

Interestingly, yet not surprisingly, the systems are designed to execute data warehousing tasks of varying complexity and to support a community of users whose skills range from loading extracts into excel to make simple bar charts to sophisticated analysts performing data discovery and predictive/prescriptive analytics. Yet the siloed approach to management, design, and development of these systems has led to some unintentional replication.

The challenge is to determine what that replication really is. In some cases there is obvious data replication: a data subset is extracted from a set of files and is then loaded into an auxiliary data mart. In some cases the replication is process replication: an input data source is scanned, parsed, and transformed for loading into one data environment, and later the same data is pulled from the data environment, scanned, parsed, and transformed prior to loading into a data system that lies further downstream.

From an enterprise architecture perspective, both of these replications seem unnecessary, especially when the processes push and pull data from platform to platform and when the processing stages are sometimes run on a mainframe system using coded COBOL and at other times on a set of servers using an ETL tool. In fact, the replication is purely a byproduct of management-imposed constraints.

I inferred an interesting insight from reviewing some of the systems’ operational characteristics. First, there was one simple assertion about these data warehouses and marts: each of the systems composed three different facets of the analytics pipeline:

Data ingestion, preprocessing, and transformations
Data processing
Facilitation of data access and publication

The insight was that the horizontal view of the ways all of the systems worked suggested that all of the ingestion and preprocessing stages could be composed, much of the data processing could be composed, and much of the data accessibility and presentation could be solved using the same set of tools and technologies.

For example, the data to be loaded into system A require edits and cleansing prior to loading. However, system B extracted data from system A, then sorted and aggregated the data while applying some transformations prior to loading. The preprocessing for system A and system B are both data-independent and highly parallelizable and are easily migrated to an application developed on Hadoop. However, the types of queries performed on system A and those for system B were much better suited to some kind of data warehouse appliance or in-memory computing.

In all cases, though, one could differentiate between the processing and data preparation tasks from the data ingestion and persistent storage, and both could be segregated from data accessibility. At the same time, the similarities of the processing, persistence, and publication components suggests that selecting best-of-breed data management tools provides a solid foundation for engineering the data management platform for analytics. In my upcoming posts, we will look at the determination of business needs, followed by the characteristic tool suite required for effective data management for analytics.

Get more information on how modern, big data architectures are affecting data management.