Using Hadoop: Emerging options for improved query performance


In my last two posts, we concluded two things. First, because of the need for broadcasting data across the internal network to enable the complete execution of a JOIN query in Hadoop, there is a potential for performance degradation for JOINs on top of files distributed using HDFS. Second, there are some techniques that can be hand-engineered (such as replicating tables or breaking your queries into Semi-JOINs first) that can alleviate some of the performance bottlenecks.

The tight-coupling of task and resource management and the MapReduce execution model in Hadoop 1.0 imposes some constraints on enabling some of these optimizations to be introduced automatically within the execution model. However, the execution model, resource management, and task management in YARN (Hadoop 2.0) are all decoupled, which allows developers to devise components that are much better at both recognizing opportunities for optimization and orchestrating the tasks to achieve those optimizations.

A good example is Impala, which is basically Hadoop provider Cloudera’s high-performance SQL-on-Hadoop engine that provides significantly-increased speed of query response time. Impala leverages three key factors to achieve these performance improvements: decoupling from MapReduce, natively-developed on top of YARN, and in-memory processing.

Another example is Stinger, which is Hortonworks’ entry into the high-performance SQL-on-Hadoop. Some of the key factors in Stinger’s performance improvements include a query optimizer that seeks to reduce or eliminate redundant computation, as well as pushing down record filtering to the storage layer nodes to reduce the amount of records that need to be accessed by the SQL engine itself. Lastly, Stinger can execute what they refer to as vectorized query execution using the Tez, yet another Hadoop component built in top of YARN that more efficiently parallelizes execution tasks to both reduce the lock-step dependency on MapReduce while increase simultaneous execution as long as there are no explicit temporal or data dependencies among the tasks.

Clearly, the developer community recognizes the inherent performance bottlenecks of the Hadoop 1.0 execution model and has embraced a more flexible approach that will enable mush faster internal componentry and improved query optimization. The implication for the less-technical data consumer is that it is just a matter of time before the SQL-on-Hadoop tools within the Hadoop ecosystem will mature enough to become a real contender for more than just data lake storage augmentation and batch algorithmic execution.

Want to try a data management tool designed for the Hadoop environment? Take a free trial of Data Loader for Hadoop.


About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at

Related Posts

Leave A Reply

Back to Top