At a recent TDWI conference, I was strolling the exhibition floor when I noticed an interesting phenomenon. A surprising percentage of the exhibiting vendors fell into one of two product categories. One group was selling cloud-based or hosted data warehousing and/or analytics services. The other group was selling data integration products.
Of course, when you think about it, this makes a lot of sense. The economics of cloud computing has shown benefits when using software-as-a-service products like Salesforce.com. Clearly, this paradigm significantly reduces the costs of developing and managing big data projects using tools like Hadoop without having to pop for purchasing the necessary hardware. But as data moves off-premise, it does not obviate the need for internal data accessibility for in-house reporting. That means being able to integrate data wherever the data lives.
Therein lies the problem: as organizational applications migrate to hosted environments, so does the data. And once that data is sitting in someone else’s environment, you begin to lose control over it. Think about this: When you access customer data sitting in an SaaS CRM product, you are not only bound to their internal data models, you're also constrained by their data accessibility methods. Most importantly, you're constrained by their semantics – what the data elements are, what their specifications and definitions are, and how those are interpreted by the application. Alternatively, consider a hosted Hadoop deployment where the data sets are managed as schema-on-read objects with little or no preprocessing or data quality assurance prior to storage.
The rampant distribution of data across on-premises and off-premises environments creates a demand for what I have started to refer to as “cross-platform integration” products. In the best scenario, these products will provide three main functions:
- They will streamline data movement between off-premises hosted or cloud-based environments and the enterprise data environment.
- They will enable native access to a wider variety of off-premises data sources, especially streaming sources.
- They will allow for incorporation of data standardization and validation rules to ensure proper alignment at the integration point.
That third item is critical. It implies that developers can have the integration system provide semantic alignment of concepts that may have different representations or reliance on different reference data sets. This will help ensure some degree of data usage quality for downstream users. That's especially true when combining data from internal systems, SaaS providers, cloud-based big data applications and numerous data streams.
What puzzles me is the absence of awareness about this challenge. At the same time, I'm surprised that vendors seem to struggle to communicate what their products do, how they do it and why anyone would care. I don't anticipate that this messaging vacuum will last long. I predict that in the next three to six months, more vendors will actively promote products addressing the need for high-quality, cross-platform integration.