Hadoop's evolution, part 2: Thoughts on the past, present and the future

Businessman considers the future of Hadoop In my last post, I described the evolution of Amazon Web Services (AWS). Although a one-to-one comparison with Hadoop isn't remotely valid for several reasons, I believe that the latter's evolution may mimic AWS's on a few levels. First, remember that most of AWS original products, integrations (such as active directory) and features didn't "ship" with Amazon's pilot product. The same holds true with Hadoop. As Bob Muglia writes, Hadoop was conceived as "a batch processing system for search data; it was not designed for high-speed, interactive analytics and reporting." It's hard to argue today, though, that there's a dearth of reporting options. In fact, quite the opposite is true.

Researching and writing Too Big to Ignore in 2012, I remember that there was no way to get even near-real-time data from Hadoop. Today there are plenty of ways to access streaming data via Hadoop. (To see the technical details behind Impala, click here.)

What's more, five years ago one could not use Structured Query Language (SQL) to extract data from Hadoop. Most industry folks expected this domino to topple pretty quickly and it has. By way of background, the Hadoop database – aka HBase – is non-relational. Many people forget the following:

HBase is a NoSQL database.
NoSQL stands for Not Only SQL, not no SQL (this is not merely a semantic difference).

Not long after the book's publication, this statement also no longer held water. The last few years have seen no shortage of SQL interfaces designed to extract data from Hadoop.

In a similar vein to AWS, Hadoop and its complementary suite of products have evolved to meet users' needs.

If I look back over the past five years, though, perhaps the biggest change is under the hood. That is, Hadoop introduced YARN in version 2.0 and enhanced it in version 3.0. From the 2.7 release notes:

The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a directed acyclic graph (DAG) of jobs.

Perhaps John and Jane Q. Analyst didn't notice this, but make no mistake: this was a big architectural change designed to handle increasingly large and complex streams of data.

Simon says: The Hadoop world is still shaking out.

As for the future of Hadoop, it still seems misty to borrow a word from Haifeng Li. The landscape is littered with a confusing array of complementary products: Tez, Pig, Hive, Impala, Kudu, Accumulo, Flume, Sqoop, Falcon, Samza, etc.

By the time this post sees the light of day, I'm sure that a few more projects will have entered the fray. Unlike Amazon's AWS, the open-source nature of Hadoop means that just about anyone can create a new project or utility designed to extend its capabilities. This is concurrently very democratic and very confusing. As executives at HortonWorks know, however, making money in this environment is easier said than done.