Data integration: Comparing traditional sources and big data

While not on the same level of Rush, I do fancy myself a fan of The Who. I'm particularly fond of the band's 1973 epic, Quadrophenia. From the track "5:15":

Inside outside, leave me alone
Inside outside, nowhere is home
Inside outside, where have I been?

The inside-outside distinction is rather apropos when thinking about traditional data integration versus its newer, bigger, more dynamic counterpart.

Why should I care?

Many if not most organizations have historically struggled with integrating enterprise data – i.e., internal data ostensibly under its control. Even data imported via ETL or electronic data interchange (EDI) from, say, a bank or an insurance carrier was almost always, in a word, predictable. That is, Aetna or ADP would not suddenly decide to reformat its data, insert new fields and/or remove old ones willy-nilly. Third parties and software vendors have typically communicated well in advance any forthcoming data-related changes affecting their clients, lest they bury their own reps with urgent calls from irate clients. In turn, IT departments would eventually make the requisite tweaks to interfaces, batch routines and other loads to ensure seamless integration and happy employees – whatever was necessary to keep the lights on.

Of course, you probably know all of this, but what about the following?

Throw most of that accountability, predictability and control out the window with big data.

Let's say that an organization wants to import or link to LinkedIn data via one of its application program interfaces (APIs). (Click here to get started.) That's not terribly hard to do, and LinkedIn is hardly unique in encouraging external developers to take its core offering in new and exciting directions. (This is a key point in The Age of the Platform.)

But is the format of that data "locked" to the same extent as some of those aforementioned third-party sources, never mind homegrown applications? Is it as static? In each case, the answer is probably not.

LinkedIn may suddenly decide that it no longer wants its API to include a field upon which your organization relies. Perhaps LinkedIn continues to collect that information but chooses to no longer pass it along to developers. Then there all sorts of non-data-related changes, like modifications to its developer program. Or, what about the atomic option like restricting and/or removing access to its public API altogether? Yes, that has happened as well.

Note that I'm not excoriating LinkedIn here. I could have just as easily used examples from Twitter, Facebook, Google and a host of other companies.

Simon Says: Big data isn't "set it and forget it."

The point here is two-fold. First, big data generally lies outside of the control of organizations. Second, it tends to be more dynamic than traditional enterprise data. As such, it's essential to closely monitor external data sources to ensure data integration within an enterprise.

To use a sports analogy, a golfer isn't hitting the ball well with his own clubs. What happens when he has to use a loaner set? To be sure, not all 6-irons are created equal. There's going to be an adjustment period, and probably an ugly one at that.

Understand that on the practice range. Don't be surprised while you're on the first tee.