The traditional data warehouse and Hadoop

0

Data warehouse (DWH) environments have typically been the standard when it comes to supporting analytical environments. There can be many systems supporting a particular modeling or analytical group, and because these groups have varying requirements for data, the replicated data is maintained because the transition to new storage and computing environments doesn’t happen overnight.

A relational DWH is designed with data models and schemas that can address a business’s particular requirements, keeping in mind the data needs of the intended consumers. Database views are created and optimized for performance based on anticipated queries.

If requirements change enough, the data model must be re-architected and the data reloaded, causing system downtime. It’s a time-consuming, costly process. As data grows in volume and type, the cost of the DWH increases – or storage becomes impossible.

"We have a lot more data processing workloads today than ever before. We have newer workloads around real-time, analytics, and unstructured data that the data warehouse was not designed for, but that's okay because you can have secondary platforms within the extended data warehouse environment that are well suited to those workloads," said Philip Russom, research director for data management with TDWI Research, in an interview in BI This Week.

The bottom line is, DWH data models must evolve with ever-changing business requirements, including the need to address mobile computing, social media and interactions with sensor or machine data.

Hadoop can serve as an extension to a data warehouse. It’s naturally suited to store any type of data – and lots of it. It can also handle operations-related problems. Organizations in the financial, retail, telecom, entertainment and oil and gas industries need a way to store all their data (high volume with variable structure) that’s generated from different devices, sensors and mediums. Hadoop fits the bill because it’s flexible enough to store big data in an extended DWH environment, and it can do so at a very low cost.

Today, with vast volumes of data, businesses may need all of their data to be ready and available for use in analytics. However, data scientists query data from a traditional DWH and develop models to perform complex analyses using many of the SAS analytical procedures. Typically, data subsets are used for model development and testing, and then are deployed in batch mode, in-database or near-real time.

Unfortunately for the analyst (and the business), sometimes the data needed for modeling should have been part of the upfront requirements when the DWH was built, but wasn’t (or, in some cases, it changed or grew in size). Ad hoc query requests can help, but they’re time-consuming and expensive.

Using Hadoop data and current in-memory SAS technology enables analysts or data scientists to get the job done through various tasks, including conducting exploratory data analysis on all the data at one time versus only a subset, finding relationships between a variety of data sources, building out analytical base tables, developing models and deploying those models in batch, in-database, near real-time or in-memory. And that gives businesses a competitive edge, efficiently allocating costs to data storage and model development.

Share

About Author

Charlotte Crain

Solution Architect, Americas Technology Practice

Charlotte has worked with many SAS customers in the government, financial, retail and education sectors as well as with system integrators and business partners. Her areas of expertise include information management, data quality and integration, data management methodology/architecture, data governance, SAS architecture, business analytics, and SAS programming. She also has experience with energy demand statistical modeling, time series analysis and forecasting, credit risk modeling and applications development in the areas of web applications/interfaces and automation. She holds an M.S. in mathematics with an emphasis in numerical analysis, linear and non-linear statistical data modeling.

Leave A Reply

Back to Top