Using Hadoop: Query optimization


In my last post, I pointed out that an uninformed approach to running queries on top of data stored in Hadoop HDFS may lead to unexpected performance degradation for reporting and analysis. The key issue had to do with JOINs in which all the records in one data set needed to be compared with all the records in a second data set. The need to look at the cross-product of the two data sets lead to a data latency nightmare, requiring broadcasts of each data chunk to all the computing nodes that flood the network.

There is no doubt that the culprit is not just the data distribution but the ways the data chunk are accessed. There are some ideas that can be adopted to minimize the impacts of increased data latency, including:

  • Table replication – If the issue is that all the chunks of one of the tables need to be broadcast to all of the compute nodes to execute a specific JOIN, then you will probably have prior knowledge of the need for the broad-based data access. If one of the tables is comparatively smaller than the other, instead of distributing the smaller table, it might make sense to store a full copy of it aligned with each of the compute nodes. That way, that table is always physically close to where the data needs to be, reduces the need for broadcasting, and eliminates the network bottleneck.
  • Semi-joins – The naïve approach presumes that all of the records need to be accessed by each compute node. However, this is often not really the case. Most JOIN query conditions filter the result set based on the inspected data attributes, and you can reduce the number records that need to be shared if you can precomputed the filter condition. One way to do this is by breaking the JOIN into composite SELECTs: pull out the unique values of the data attributes in the condition, then ship just that reduced column to where the chunks of the second table are stored, and only select those records where that part of the JOIN condition is true, and then send only those filtered records to the compute nodes. This process, called a semi-join, can dramatically reduce the amount of data injected into the network, which reduces latency as well as diminishes the impact of the network bottleneck.
  • Pipelining – Rather, what I mean here is some kind of means of overlapping the exchange of the chunks of the table needed for the next phase of executing the query with the execution of the current phase of the query. In regular intervals, each compute node receives one specific chunk of the second table and begins to execute that part of the JOIN. At the same time, the compute node shifts that same chunk to its neighboring compute node, which can cache it until the current phases is done executing. At that point, the required data has already been delivered. Even though data is still being communicated over the network, overlapping the data exchange with ongoing execution masks the latency and reduces its overall impact.

These are some strategies for tweaking your reporting and analysis on Hadoop. However, as the Hadoop ecosystem becomes more sophisticated, new components are being developed (as I will discuss next time) that will relieve the developer from the burden of hands-on query optimization.

Learn more about a technology designed to allow business users to manage big data - and sign up for a free trial.


About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at

Related Posts

1 Comment

  1. Pingback: Hadoop Happenings: Market Rumblings - Qubole

Leave A Reply

Back to Top