I've seen a number of articles and webinars recently that discuss data integration as a cloud-based service. So I thought it was worth exploring what this really means in the context of big data – specifically when the objective is to exploit many sources of streaming data for analytics.
My initial reaction to the concept was confusion. If I were managing a data environment whose goal was to absorb data from a diverse set of sources, I would expect to instantiate some kind of hybrid system. Such a system might comprise a large-scale storage component linked to an event stream processing component capable of absorbing many simultaneous streams.
While this configuration might be good for an event stream analytics engine, its results would still need to be forwarded to other reporting and analytics platforms. For example, I might need an event stream system to monitor social media channels as part of a brand management application, but I would still need to link the output of that monitoring system to a data warehouse with profiles of known entities (customer, prospects, etc.). This linking would be important, because it could help determine the actions to be taken once a brand risk event was identified. In this scenario, all the data integration is taking place within your own environment.
So where would “cloud-based” big data integration happen?
Cloud-based big data integration
Before addressing the question about where cloud-based big data integration happens, we need to reconsider the concept of data integration in a world of increasing data volumes. In one case your goal might be to hoard as much data as possible in case you wanted to analyze it later. In another scenario, you might want to monitor for certain data patterns that would trigger specific actions within a defined time frame – and in this case, you might not want to archive any data other than a select set of recognized events. In either extreme, though, you have to consider the data sources themselves.
In the past, much of the data to be analyzed originated within the enterprise or was stored in the enterprise. Customer profiles, product information, sales transactions, etc. were all captured by on-premises systems, and they were stored on-premises as well. Today, there are many cloud-based application services (Salesforce.com and Marketo immediately spring to mind) in which your corporate data sits in virtualized environments off-premises. Cloud-based data warehousing is gaining acceptance, as are cloud-based or hosted Hadoop environments. This means that less and less corporate data sits locally in more agile environments.
Integrating data from the different hosted environments means pulling all that data down to your systems, performing data integration tasks, and then pushing those data sets back to their hosted systems. If most of your data is sitting in the cloud, it seems to make sense to push data integration to the cloud as well.
Why use cloud-based data integration?
There are two main benefits of cloud-based data integration. First, in the evolving API economy, more and more data streams are going to emerge that might be beneficial to your analyses. A cloud-based data integration platform could reduce the effort involved in integrating new sources because it allows a single team to develop access interfaces to the various sources. Second, using hosted virtualized environments can reduce the operational costs of maintaining the platform and its associated software stack.
Do you have to go to a cloud-based data integration vendor to get these benefits? Not necessarily. With the proper planning and management, an organization can develop its own hosted data integration capability. That environment can be managed as an operational resource to which the rest of the organization subscribes. The environment could be designed and deployed as a special shared service using internal platform resources, or it could be developed internally yet deployed externally on a cloud-based environment. In the latter case, you'd still derive the same benefits.
I'm beginning to see the value in cloud-based data integration. But the deployment and implementation models still have room to evolve. Is the solution going to be developed by external parties? Will it be done by your favorite contracted system integrator? Or will it be deployed in-house?
We can revisit this topic in a few months to see how the market has continued to change…