Using big data techniques to increase database performance

2

Many people perceive big data management technologies as a “cure-all” for their analytics needs. But I would be surprised if any organization that has invested in developing a conventional data warehouse – even on a small scale – would completely rip that data warehouse out and immediately replace it with an NoSQL database or a Hadoop cluster. More likely, earlier adopters would create a hybrid environment that incorporates a variety of technologies for specialized purposes.139545186

Alternatively, you could adapt different technologies to support and improve existing platform componentry. An interesting scenario is using Hadoop to augment storage for an existing data warehouse. Despite the size of many enterprise-class data warehouses, most of the queries access only a small percentage of the data. Here's a practical example. A team member recently did a study of a number of queries against a data warehouse comprised of approximately 150 tables. He reported that over a specific time period, more than 67% of the queries only touched three tables, while more than 95% touched a total of eleven tables and 99% of the queries touched twenty-four tables. This showed that most of the accesses touched a very small percentage of the tables in the entire data warehouse.

I've seen other analyses with similar results, suggesting that the overall average time for executing queries will drop dramatically if you adopt technology components that help speed access time for tables that are queried the most. One approach to doing this is to move the most frequently accessed data into memory – because accessing data that sits in memory is much faster than accessing data stored on disk. Columnar data layouts are often optimized to align data in ways that enable compression. In turn, this lets you store a lot of data in memory.

In-memory databases

The in-memory database is one big data innovation that can be used to increase performance. But despite the lowered cost of main memory, its cost may still be prohibitive when if you're considering moving entire tables into this faster medium. In addition, some parts of a table may be accessed more frequently than others (for example, certain filters applied to the queries refine the set of frequently touched records). And even though a large percentage of the queries touch a small number of tables, there will still be queries that access less frequently viewed parts of the data warehouse. In other words, different parts of the data warehouse are “hotter” than others. Really hot data could be moved into memory.

But what about the “warm, cool or cold” data? Where should that data be stored?

white paper

Big data white paper

If you're using a dedicated data warehouse appliance, it may be engineered with its own high-speed disks (perhaps even solid-state disk drives that are much faster than conventional disk drives). That may be the right place for warm data. Cooler data can be moved to a platform for storage augmentation, such as a Hadoop cluster composed of many disk nodes. The benefit of using Hadoop is that performance levels can still be maintained by transforming queries that access less frequently touched data into Hadoop tasks (such as transforming the query into a series of MapReduce codes).

Hybrid models

The implications? A good approach to increasing database performance may involve blending. That is, blending a platform for your data warehouse models using a hybrid of conventional databases deployed on top of a data warehouse appliance, with both in-memory data management and Hadoop data management components. This hybrid model not only speeds queries touching the most frequently accessed tables – it also provides a scalable architecture to maintain high-performance data warehouse execution.

Share

About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at b-eye-network.com and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at mdmbook.com . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at loshin@knowledge-integrity.com.

Related Posts

2 Comments

  1. Can you explain this?

    "A team member recently did a study of a number of queries against a data warehouse comprised of approximately 150 tables. He reported that over a specific time period, more than 67% of the queries only touched three tables, while more than 95% touched a total of eleven tables and 99% of the queries touched twenty-four tables. This showed that most of the accesses touched a very small percentage of the tables in the entire data warehouse."

    Lets say there are 100 queries. If 99 of the queries are using 24 tables, how can "more than 67" of those only be using 3 tables? Perhaps you mean only 1% are using 24 tables?

Leave A Reply

Back to Top