Using Hadoop: Emerging options for improved query performance

In my last two posts, we concluded two things. First, because of the need for broadcasting data across the internal network to enable the complete execution of a JOIN query in Hadoop, there is a potential for performance degradation for JOINs on top of files distributed using HDFS. Second, there are some techniques that can be hand-engineered (such as replicating tables or breaking your queries into Semi-JOINs first) that can alleviate some of the performance bottlenecks.

The tight-coupling of task and resource management and the MapReduce execution model in Hadoop 1.0 imposes some constraints on enabling some of these optimizations to be introduced automatically within the execution model. However, the execution model, resource management, and task management in YARN (Hadoop 2.0) are all decoupled, which allows developers to devise components that are much better at both recognizing opportunities for optimization and orchestrating the tasks to achieve those optimizations.

A good example is Impala, which is basically Hadoop provider Cloudera’s high-performance SQL-on-Hadoop engine that provides significantly-increased speed of query response time. Impala leverages three key factors to achieve these performance improvements: decoupling from MapReduce, natively-developed on top of YARN, and in-memory processing.

Another example is Stinger, which is Hortonworks’ entry into the high-performance SQL-on-Hadoop. Some of the key factors in Stinger’s performance improvements include a query optimizer that seeks to reduce or eliminate redundant computation, as well as pushing down record filtering to the storage layer nodes to reduce the amount of records that need to be accessed by the SQL engine itself. Lastly, Stinger can execute what they refer to as vectorized query execution using the Tez, yet another Hadoop component built in top of YARN that more efficiently parallelizes execution tasks to both reduce the lock-step dependency on MapReduce while increase simultaneous execution as long as there are no explicit temporal or data dependencies among the tasks.

Clearly, the developer community recognizes the inherent performance bottlenecks of the Hadoop 1.0 execution model and has embraced a more flexible approach that will enable mush faster internal componentry and improved query optimization. The implication for the less-technical data consumer is that it is just a matter of time before the SQL-on-Hadoop tools within the Hadoop ecosystem will mature enough to become a real contender for more than just data lake storage augmentation and batch algorithmic execution.