Hadoop is not Beetlejuice

3

In the 1988 film Beetlejuice, the title character, hilariously portrayed by Michael Keaton, is a bio exorcist (a ghost capable of scaring the living) hired by a recently deceased couple in an attempt to scare off the new owners of their house. Beetlejuice is summoned by saying his name three times. (Beetlejuice. Beetlejuice. Beetlejuice.)

Nowadays it seems like every time big data is discussed, Hadoop comes up in the conversation eventually. Need a way to deal with all that unstructured data? Hadoop. Need a way to integrate big data into your other applications? Hadoop. Need a replacement for your clunky old data warehouse? Hadoop.

Hadoop. Hadoop. Hadoop. It’s like saying it three times summons a data exorcist, a supernatural solution to all your big data problems. Unfortunately, neither big data nor Hadoop is that simple. I decided, therefore, to conduct a seance with a few of my fellow Data Roundtablers and exorcise three common misconceptions about Hadoop.

Structure still matters

The much-lauded schema on read capability of Hadoop enables the rapid acquisition and storage of data. Unlike the predefined data structures required by relational databases, Hadoop just stores data files, and those data files can be in just about any format. It is only later, when the data stored in Hadoop’s file system is processed so that it can be put to use, that structure is imposed. Oftentimes, that second point is conveniently forgotten, leading people to mistakenly believe that structuring data doesn’t matter with Hadoop. However, as Tamara Dull has blogged, structuring data in Hadoop environments is just as important as it is in relational environments. For more details, read her excellent post: The Hadoop experiment: To model or not to model.

Data warehouses still matter

Enterprise data warehouses (EDW) have a long and storied history, not only throughout the data management industry, but also within your organization. Master data management (MDM) caused a kerfuffle over continuity of the EDW story since MDM started managing data also managed by EDW. This lead many to initially question if MDM would replace EDW. It didn’t. One reason is EDW manages more than just master data. Over time EDW and MDM become symbiotic, with each focusing on managing different data and sharing data with each other. And now Hadoop’s popularity is leading many to question if it will replace EDW. It won’t. For more details, read Mark Torr’s excellent post: Three ways to use a Hadoop data platform without throwing out your data warehouse.

SQL still matters

In a series of recent blog posts, David Loshin discussed how although Hadoop is increasingly being adopted as the go-to platform for large-scale data analytics, using Hadoop effectively in this capacity raises serious concerns about data latency and query optimization. Bringing robust and efficient SQL-like capabilities, especially to support less technical users, to the Hadoop environment is an ongoing evolution. For more details, read Loshin’s post: Using Hadoop: Emerging options for improved query performance.

Share

About Author

Jim Harris

Blogger-in-Chief at Obsessive-Compulsive Data Quality (OCDQ)

Jim Harris is a recognized data quality thought leader with 25 years of enterprise data management industry experience. Jim is an independent consultant, speaker, and freelance writer. Jim is the Blogger-in-Chief at Obsessive-Compulsive Data Quality, an independent blog offering a vendor-neutral perspective on data quality and its related disciplines, including data governance, master data management, and business intelligence.

Related Posts

3 Comments

  1. Eelco van Gelderen on

    I agree. But SQL is a 50 year old technology.
    This can be done entirely differently!
    And I know how.

      • Cindy Turner
        Cindy Turner on

        According to Internet research, SQL was invented in the mid-1970s, but it didn't become a standard until the mid-80s. SAS was founded in 1976.

Leave A Reply

Back to Top