Hadoop is increasingly being adopted as the go-to platform for large-scale data analytics. However, it is still not necessarily clear that Hadoop is always the optimal choice for traditional data warehousing for reporting and analysis, especially in its “out of the box” configuration. That is because Hadoop itself is not a database, even though there are some data organization methods that are adapted to and firmly ingrained within the distributed architecture.
The first is the distributed file organization itself – the Hadoop Distributed File System, or HDFS. And while the data organization provided by HDFS is intended to provide linear scalability for capturing large data volumes, aspects of HDF will impact the performance of reporting and analytical applications.
Hadoop is deployed across a collection of computing and data nodes and correspondingly, your data files are likewise distributed across the different data nodes. But because one of the foundational aspects of Hadoop is fault-tolerance, there is an expectation of potential component failure, and to mitigate this risk, not only does HDFS distribute your file, it replicates the different chunks across different nodes so that if one node fails, the data is still accessible at another one. In fact, by default the data is stored redundantly three times.
Redundancy leads to two performance impacts. The first is from a sizing perspective – your storage requirement will be three times the size of your data set! The second involves the time for data loading; because the data is replicated, the data has to be written to disk three times, increasing the time it takes to store the file. The impacts of data redundancy are, for the most part, static, and can be addressed by changing the default.
Hadoop also provides data organization for reporting. Hive is Hadoop’s standard for providing an SQL interface for running queries. However, it is important to recognize that despite the promise of high-performance execution, data distribution has a more insidious performance impact for reporting because of data access latency.
Data access latency (sometime just referred to as “latency”) is the time it takes for data to be accessed from its storage location and brought to the computation location. For most simple queries (that is, filtering data sets using conditional SQL SELECT queries), data access latency is not an issue. Because each computation node is looking at the chunks of the data set with which the data distribution is aligned, the latency is limited to the time it takes for the records to be streamed from disk.
The problem is the more complex queries (e.g., JOINs), where potentially all the records in one table are compared with all the records in another table. Consequently, each record in one of the tables has to be accessed by all of the computing nodes. If the distribution of the data chunks is not aligned with the computing nodes, data must be sent over the interconnection network from the data node to all of the computing nodes. Between the time it takes to access the data, package it for network transmission, and push it through the network, the result is a dramatically increased latency, which is only made worse due to the burst-y style of the communications that will tax the network bandwidth.
In other words, other than embarrassingly parallel queries, SQL-style reporting may not perform as what might be naively expected from a high-performance analytics platform. In my next post, we will consider some ways to optimize the queries so that they are less impacted by the data organization.