What is big data integration?

It's an important and profound question, one that executives will increasingly be asking themselves in the coming years. What's more, much like the definition of big data, it's folly to think of "one" or "the right" definition of the term "big data integration."

Here's mine.

Think about most mature enterprise back-office systems for a moment. Sure, they contain the standard fields spread across multiple tables. Examples include customer ID, last name, first name and the like. Companies control the values in those fields, either through manual employee input or some type of batch processing.

An example

Let's make this less abstract. Let's look at Madrigal Communications, a fictitious Internet service provider (ISP). (Yes, the Breaking Bad reference is intentional.) Somewhere in Madrigal's internal systems, a table similar to the one I've mocked up below exists:

This is nothing new. Big data integration, however, links those fields with externally generated data – i.e., data generated by customers, partners and users. Moreover, this data isn't "born" in internal systems. Rather, it's linked or imported from outside sources.

Again, let's turn to our Madrigal example. Note the existence of a new field on the right (Twitter handle):

(Note that I possess no insider knowledge of Madrigal's systems and database tables. I'm just explaining what big data integration would look like.)

Now, via the Twitter API, Madrigal can import and link to any one of all of my 25,000 or so tweets. However, that might be overkill. That is, Madrigal probably doesn't care about my über-fascinating tweets from Rush or Marillion concerts. Of greater importance are the ones that include @Madrigal, Madrigalhelp, and/or @Madrigalcomm, the fake company's real Twitter account).

It wouldn't be terribly difficult to retrieve and store all Madrigal-related tweets in a database table for not only me, but for every customer who tweets. (Hence the term, big data integration.)

What could Madrigal do with this information? Let me count the potential benefits:

Via the geolocation metadata possibly embedded in each tweet, Madrigal could see which areas experience slow connectivity or outages. Imagine having problems fixed while you sleep based on others' tweets.
Madrigal could determine which of its customers were the most active and theoretically influential on Twitter (based on proxies like number of followers). If Jim (see above) starts venting on the social network, it's more likely that he'll cause a stir – all else being equal. Perhaps it could nip the next United Breaks Guitars in the bud.
Madrigal could more effectively target ads in Twitter to customers in areas in which it operates. Consumers cannot as of yet pick any cable company to service their needs. Let's say that Madrigal competes with Verizon FIOS in Austin, TX. Why not try and poach a customer by showing an ad via Twitter?
Create alerts for Madrigal customer service reps to call me if I use certain words and/or hashtags such as #fail, #sucks, etc. That goes double if the tweets occur in a short period of time or over an extended period of time.

I could go on, but you get my point. There are certainly degrees of integration, but I wanted to keep this example relatively simple. As I'll argue in a forthcoming post, it's unwise to put all data into a single repository. In other words, it's not wise to go all-in.

Simon Says

In this simple example, I examined the potential benefits of integrating internal and relational enterprise data (read: small data) with the larger, messier stuff. Big data integration offers enormous possibilities.