The scope of data integration has significantly changed over the past two decades. In what has essentially become the de facto standard for flowing transactional and operational data into the enterprise data warehouse, many organizations extract data from their operational systems and convey it to a system designated as a staging area. In the staging area, extracted data is standardized, cleansed and validated, then it's transformed and reorganized into the target data warehouse model's format and structure in preparation for periodic batch loads.While this approach may have been satisfactory for conventional, on-premises data warehouses, the information world has rapidly changed over the past few years – creating numerous data integration challenges.
Many organizations now deploy reporting and analytical environments both on-premises and using cloud computing services. Numerous analytical applications depend on a hybrid architecture combining information from across different physical data centers. There is greater interest in integrating data sourced both from within and outside of the organization’s firewall. And as more Internet of Things (IoT) applications come online, there are increasing demands on the data warehouse to ingest a massive number of simultaneous, continuous data streams and make that data available in real time. I refer to this hybrid environment as the “extended information enterprise.” With it, the scope of data management extends beyond traditional organizational boundaries.
Extended information enterprise = data integration challenges
These drastic changes have introduced some notable challenges when it comes to data integration. Consider:
- Increased complexity. Because the reporting and analytics environment is no longer confined to a single target data warehouse/repository, data preparation and delivery has become more complex. In addition, the increased number of external data sources also means increased data integration complexity.
- Broader requirements. Hybrid architectures have components that store and manage data differently, leading to different integration requirements.
- Data currency. On-premises systems and cloud-based systems are subject to varying rates of refresh rates, resulting in unsynchronized production cycles and refresh cadences.
- Time-to-availability. Data consumers expect immediate data availability, And while the continuously streaming data sources produce data at different rates, all streams need to be ingested and processed in real time.
- Uncertainty. The formats and structures of API- and services-based data sources are subject to unannounced changes
- Scalability. Increased data volumes impose greater requirements for scalability; increased consumer demand imposes greater requirements for accessibility and performance. The data integration processes must accommodate both of these scalability expectations.
Old-fashioned ETL just won’t cut the mustard when we look at the emerging extended data warehouse architecture and environments. In fact, these newer computing paradigms have to be hardened to support rapid changes. For example, API-based data source owners often change their interfaces with little or no advance warning. At the same, time, downstream consumers’ thirsts for more information will necessitate seeking out and plugging into a steady pipeline of new data sources. And as data velocities accelerate, attempting to straddle both onsite and externally hosted systems will tax the organization’s ability to maintain synchronization and coherence.
DevOps and agile development methodologies are sympathetic to these challenges, but it is important to recognize when platform and system engineering decisions impede your ability to quickly adapt as the number of touchpoints grows. In my next blog post, we'll look at some data integration best practices that can be adopted to address (and eliminate) the data integration challenges described here.
Learn more in this white paper: Data Integration Deja Vu