Big data integration: The case against an "all-in" approach

I've spent a great deal of time in my consulting career railing against multiple systems of record, data silos and disparate versions of the truth. In the mid-1990s, I realized that Excel could only do so much. To quickly identify and ultimately ameliorate thorny data issues, I had to up my game. I became proficient at SQL, Microsoft Access, Crystal Reports and other reporting tools precisely because much of my clients' data was incredibly messy. If I could solve these problems, then I could keep myself billable.

I'll still argue for the benefits of single data sources (where possible) until I'm blue in the face. Here, I'm talking about small data controlled by the enterprise. But what about integration with big data? Does a single data repository make sense? Should an organization go all-in?

Tabling the tables

Color me a hypocrite, but my "single data repository" line of thinking vanishes when I think about big data – the largely unstructured information that (critically) sits outside of organizational control.

Make no mistake: "Connectors" aside, you can't realistically cram what we call big data into a relational database such as Oracle or Microsoft SQL. Tables schemables. Many "big-data solutions" such as BigTable eschew normalized tables altogether. On the contrary, the arrow would have to go the other way. In other words, an organization would need to use a robust application like Hadoop to store, process and retrieve "simple" structured data, semi-structured data and petabytes of the much messier unstructured stuff.

Could this be done? Sure. Can and should, though, are two very different things, especially for mature organizations. True or "total" integration means that organizations would need to rip out or significantly alter their back-office systems (read: CRM and ERP) and put absolutely everything into Hadoop. Put differently, they would have to consolidate all of their enterprise data in one place – and integrate that data with dynamic third-party and external data sources.

I'll let the enormity of that task sink in for a moment.

It's no overstatement to call this a Sysphean task. Mission-critical reports that pointed to tables in a relational database would have to be rewritten. Ditto for dashboards, never mind the myriad ad hoc reports, Access databases and Excel spreadsheets that permeate corporate America today. I could keep going.

As I know all too well, once mature organizations configure these types of applications, they are hardly apt to tinker with them very much – and often for good reason. The drawbacks often dwarf their benefits. (For more on this, see Why New Systems Fail.)

Simon Says: Don't go all-in. Consider better options.

For mature organizations, less might be more. That is, better integration solutions include the following:

Building some type of bridge/ETL tool.
Using one of the aforementioned connectors.
Doing something creative with data virtualization.

Before concluding, there's a caveat here. For greenfield startups, the benefits of an all-in approach may exceed their costs. Knowing the import of big data, a single data repository might make more sense and require far less heavy lifting. After all, there isn't the same degree of legacy data and data to replace, alter and reformat.