I recently discovered this technical white paper on SAS’ customer support site called Data Modeling Considerations in Hadoop and Hive, written by one of SAS’ R&D teams. I was intrigued by the team’s findings, so in this post, I want to share its highlights – without getting into the technical weeds.
As the paper points out, there’s a lot of technical information about Hadoop, but not a lot about how to effectively structure data in a Hadoop environment. The question is: Do we even need to structure (model) the data or can we just dump it all into Hadoop and figure it all out “later”?
Read on and see what this team discovered.
The objective of this project was to show that how data is modeled in a big data environment is just as important as it is in a traditional, relational environment.
Since the project team works for SAS, they used SAS’ customer support website (support.sas.com) as the data source and came up with test questions/queries typically used to analyze website traffic data.
The test environment included the following hardware and software components:
- Dell server rack with 10 blades
- Cloudera’s Distribution including Apache Hadoop version 4.2.1
- Base SAS 9.3
- A major relational database
The business questions
Here are the five business questions (SQL queries) based on SAS’ customer support site that were used in the experiments:
- For a given week, which top-level support directories get the most traffic? List the directory names and the visitor count.
- Which support pages get the most referrals from Google search? By month, list the pages and the visitor count.
- For a given year, what are most popular search terms on the support site? List the search terms and how many times each one is used.
- For a given week, how many visitors use the Safari browser to view each page? List the pages and the visitor count.
- For a given week, how many visitors spend more than 10 seconds viewing each page? List the pages and the visitor count.
The experiments and results
Below is a brief summary of the five experiments, along with results, that were conducted with each of the queries above:
- Experiment 1: Flat file versus star schema
- This experiment determined which structure – star schema or flat file – performs better in the RDBMS and Hadoop environments.
- Result: The star schema structures in both environments performed significantly faster than the flat file structures.
- Experiment 2: Compressed schema files
- This experiment only applied to the Hadoop environment. The data in Hadoop was converted from uncompressed text files to compressed sequence files to determine whether the file type made a difference in query performance.
- Result: The compressed sequence file format significantly improved query performance.
- Experiment 3: Indexes
- Indexes were applied to the appropriate columns in both the RDBMS and Hadoop environments.
- Result: With the exception of the query 3 (popular search terms) in the Hadoop environment, adding indexes provided a significant performance gain for all queries in both environments.
- Experiment 4: Partitioning
- In this experiment, partitioning was added to the same date column in both environments. A partition was created for every date value.
- Result: As in experiment 3, query 3 performed significantly slower with partitioning. The rest of the queries performed very well in both environments.
- Experiment 5: Cloudera Impala
- This experiment only applied to the Hadoop environment. It compared using Hadoop’s built-in processing engine, MapReduce, and Hive to Impala, a distributed query access engine developed by Cloudera. The queries were executed on compressed sequence files (see experiment 2) with indexes.
- Result: The results were mixed. Three queries ran much faster (queries 1, 2 and 4) and the other two ran slower, especially query 3.
The key takeaways
And, finally, here’s a summary of the conclusions drawn from this project:
- Structuring data in Hadoop/Hive is as important as it is in a RDBMS environment.
- Storing data in a compressed sequence file format alone can improve performance by more than 1,000 percent.
- Indexing and partitioning can significantly improve performance in a Hadoop/Hive environment (like it does in a RDBMS environment) since it reduces the amount of data to be processed.
- We need to understand our data and the underlying technology in Hadoop to effectively tune our data structures.
It was refreshing to see that the RDBMS skills some of us have developed over the years still apply with these new big data technologies. And while discussions of late binding (i.e., applying structure to the data at query time, not load time) work their way down our corporate hallways, we are reminded once again that “it depends” is a far more honest and accurate answer than it’s been given credit for in the past.
To model or not to model is no longer the question. How and when to model is.
If you’d like to see the technical details of this report, you can download the full, 26-page report here: Data Modeling Considerations in Hadoop and Hive. No registration is required. You’re welcome.