Using Hadoop: Query optimization

In my last post, I pointed out that an uninformed approach to running queries on top of data stored in Hadoop HDFS may lead to unexpected performance degradation for reporting and analysis. The key issue had to do with JOINs in which all the records in one data set needed to be compared with all the records in a second data set. The need to look at the cross-product of the two data sets lead to a data latency nightmare, requiring broadcasts of each data chunk to all the computing nodes that flood the network.

There is no doubt that the culprit is not just the data distribution but the ways the data chunk are accessed. There are some ideas that can be adopted to minimize the impacts of increased data latency, including:

Table replication – If the issue is that all the chunks of one of the tables need to be broadcast to all of the compute nodes to execute a specific JOIN, then you will probably have prior knowledge of the need for the broad-based data access. If one of the tables is comparatively smaller than the other, instead of distributing the smaller table, it might make sense to store a full copy of it aligned with each of the compute nodes. That way, that table is always physically close to where the data needs to be, reduces the need for broadcasting, and eliminates the network bottleneck.
Semi-joins – The naïve approach presumes that all of the records need to be accessed by each compute node. However, this is often not really the case. Most JOIN query conditions filter the result set based on the inspected data attributes, and you can reduce the number records that need to be shared if you can precomputed the filter condition. One way to do this is by breaking the JOIN into composite SELECTs: pull out the unique values of the data attributes in the condition, then ship just that reduced column to where the chunks of the second table are stored, and only select those records where that part of the JOIN condition is true, and then send only those filtered records to the compute nodes. This process, called a semi-join, can dramatically reduce the amount of data injected into the network, which reduces latency as well as diminishes the impact of the network bottleneck.
Pipelining – Rather, what I mean here is some kind of means of overlapping the exchange of the chunks of the table needed for the next phase of executing the query with the execution of the current phase of the query. In regular intervals, each compute node receives one specific chunk of the second table and begins to execute that part of the JOIN. At the same time, the compute node shifts that same chunk to its neighboring compute node, which can cache it until the current phases is done executing. At that point, the required data has already been delivered. Even though data is still being communicated over the network, overlapping the data exchange with ongoing execution masks the latency and reduces its overall impact.

These are some strategies for tweaking your reporting and analysis on Hadoop. However, as the Hadoop ecosystem becomes more sophisticated, new components are being developed (as I will discuss next time) that will relieve the developer from the burden of hands-on query optimization.