Many people perceive big data management technologies as a “cure-all” for their analytics needs. But I would be surprised if any organization that has invested in developing a conventional data warehouse – even on a small scale – would completely rip that data warehouse out and immediately replace it with an NoSQL database or a Hadoop cluster. More likely, earlier adopters would create a hybrid environment that incorporates a variety of technologies for specialized purposes.
Alternatively, you could adapt different technologies to support and improve existing platform componentry. An interesting scenario is using Hadoop to augment storage for an existing data warehouse. Despite the size of many enterprise-class data warehouses, most of the queries access only a small percentage of the data. Here's a practical example. A team member recently did a study of a number of queries against a data warehouse comprised of approximately 150 tables. He reported that over a specific time period, more than 67% of the queries only touched three tables, while more than 95% touched a total of eleven tables and 99% of the queries touched twenty-four tables. This showed that most of the accesses touched a very small percentage of the tables in the entire data warehouse.
I've seen other analyses with similar results, suggesting that the overall average time for executing queries will drop dramatically if you adopt technology components that help speed access time for tables that are queried the most. One approach to doing this is to move the most frequently accessed data into memory – because accessing data that sits in memory is much faster than accessing data stored on disk. Columnar data layouts are often optimized to align data in ways that enable compression. In turn, this lets you store a lot of data in memory.
In-memory databases
The in-memory database is one big data innovation that can be used to increase performance. But despite the lowered cost of main memory, its cost may still be prohibitive when if you're considering moving entire tables into this faster medium. In addition, some parts of a table may be accessed more frequently than others (for example, certain filters applied to the queries refine the set of frequently touched records). And even though a large percentage of the queries touch a small number of tables, there will still be queries that access less frequently viewed parts of the data warehouse. In other words, different parts of the data warehouse are “hotter” than others. Really hot data could be moved into memory.
But what about the “warm, cool or cold” data? Where should that data be stored?
If you're using a dedicated data warehouse appliance, it may be engineered with its own high-speed disks (perhaps even solid-state disk drives that are much faster than conventional disk drives). That may be the right place for warm data. Cooler data can be moved to a platform for storage augmentation, such as a Hadoop cluster composed of many disk nodes. The benefit of using Hadoop is that performance levels can still be maintained by transforming queries that access less frequently touched data into Hadoop tasks (such as transforming the query into a series of MapReduce codes).
Hybrid models
The implications? A good approach to increasing database performance may involve blending. That is, blending a platform for your data warehouse models using a hybrid of conventional databases deployed on top of a data warehouse appliance, with both in-memory data management and Hadoop data management components. This hybrid model not only speeds queries touching the most frequently accessed tables – it also provides a scalable architecture to maintain high-performance data warehouse execution.
2 Comments
Can you explain this?
"A team member recently did a study of a number of queries against a data warehouse comprised of approximately 150 tables. He reported that over a specific time period, more than 67% of the queries only touched three tables, while more than 95% touched a total of eleven tables and 99% of the queries touched twenty-four tables. This showed that most of the accesses touched a very small percentage of the tables in the entire data warehouse."
Lets say there are 100 queries. If 99 of the queries are using 24 tables, how can "more than 67" of those only be using 3 tables? Perhaps you mean only 1% are using 24 tables?
I think he means 99% touch 24 tables or less. 95% touch 11 tables or less. 67% touch 3 tables or less.