We’ve been talking about data recently at the Analytic Hospitality Executive. I’ve advocated to use whatever data you have, big or small, to get started today on analytic initiatives that will help you avoid big data paralysis. In this blog, I’m going to get a bit more technical than usual because I have recently been learning about some innovations in data management that I believe will dramatically change the game for analytic hospitality executives. I think it’s important that business users have a high level understanding of these issues so you can help your IT departments to put the right data management infrastructure in place.
Regardless of the size of the data, one of the biggest challenges in hospitality has always been that disparate systems collect and manage all of the wide variety of data that we need to gain insights about our business. These systems speak different languages and collect data in different ways and in different formats. In order to effectively analyze data from disparate systems, the data needs to be integrated (meaning combined to form one, unified, view). This involves extracting data from source systems, transforming that data (transposing columns, standardizing values), and loading it into a data storage area. This process is known as ETL (extraction, transformation, loading). It involves detailed knowledge of where all the data is, an extensive amount of coding, and needs to be changed every time an upgrade to a system is made or a system is added or replaced.
Many companies invest in a data warehouse to integrate and store data from disparate operational systems. The benefits of data warehouses are:
- All of your data in one place – the data warehouse integrates data from the disparate systems into one location according to a pre-defined schema.
- Speed - data warehouses are very good at quickly extracting, transforming and loading (ETL) data from the transactional system and can quickly render reports on historical data
- Reduces the processing burden on operating systems – instead of hitting the transaction system directly when you need data, you make the request from the EDW. The data is pulled from the transaction system at some scheduled interval, so it can focus its energy on executing transactions instead of delivering data.
However, data warehouses also have their drawbacks.
- Relatively inflexible:
- They have a fixed data schema, so any new data or changes to data collection in source systems needs to be recoded.
- They are optimized for reporting but not necessarily for analytics. Analytics typically require wide tables (a lot of different information about one entity for predictive purposes). Reporting requires long tables (many instances of total sales by period). Analytical resources need to write code to restructure data in formats that are appropriate for analytics and probably store the results somewhere as well.
- Batch processing: the ETL processes for a data warehouse typically operate in batch (all data transferred at once with less frequency, say once a day or once an hour). This means that data in the data warehouse is only updated periodically.
- Processing intensive: The ETL processes can also be very processing intensive. Large amounts of data are moved around, and transformations can be extensive depending on how diverse data formats are and how “dirty” the data is.
This inflexibility means that adding data or creating new views, tables or analyses requires a lot of coding, which breaks every time something new is added to the system (and we never add new technology or new data to the hospitality infrastructure, right?). This is time and resource intensive. Processing takes time, slowing down access, increasing time to results and consuming computing resources that could be used for analytics or reporting.
Enter data federation. Data federation is a data management mechanism that treats autonomous data stores as one large data store. The rules for how the data relate to each other are kept in the federation layer, and data integration is done “on the fly”. This means that data is stored in its original format in the individual systems and then only integrated when the user wants to access it. It can also mean that the data is available in “real-time” – whatever the source system is holding currently is available, rather than waiting for the batch to run.
The benefit of data federation is that with reduced movement of data there are fewer chances for data errors. There is a significant reduction in the workload associated with moving data around, especially if some of it is not ever going to be used. This frees up computing resources. Data federation also increases the speed of access to the data for the users, as data is available closer to “real time”.
Typically, data virtualization goes hand in hand with data federation, so you might have heard this term as well. Data virtualization is defined as any approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data like how it is formatted or where it is physically located. Virtualization facilitates data access, because the user doesn’t need to know where the data is stored, or what format it is in to access and use it. The virtualization layer takes care of that. It can also provide some data cleansing, data profiling and data modeling capabilities. (Note that you can have federation without virtualization, or virtualization without federation, but they most typically operate together for maximum benefit. You really don’t want me to get into that, although some of it is quite logical).
The biggest benefit of data virtualization is provides much easier data access for business users. The location and characteristics of the data are transparent to the business user who wants to access the data for reporting, exploration or analytics. They don’t have to understand technology architecture or write complex code. The second benefit is a dramatic reduction of coding burden on IT. IT does not have to write special code every time the user has a unique need, and for some technical reasons that are not important to us, the ETL coding burden is lesser as well.
There are a few things to consider with both data federation and data virtualization.
- Impact on transactional systems: Data federation applications can still burden transactional systems with too many requests for data, so you may still need a data warehouse to store data from certain transactional systems.
- Data Governance: A unified approach to data management like this will require different, and stricter, data governance rules. IT will need help from the business to understand who uses the information and how, so you need to be prepared to establish strong data governance (which is a good idea anyway)
- Historical information: With a data federation method, you can only access the data that is in the source systems at the moment you ask for it. This means that if the source systems aren’t keeping historical data or if they write over history, you need to store that information elsewhere (like in a data warehouse).
We may never get away from the need for EDWs (enterprise data warehouses), but we may be able to get away with smaller versions in an environment that still facilitates access to data by business users. Implementing data management technology like I describe above will require investment and business process change, but it should dramatically streamline the data management process, helping business users get to their data when they need to.
The goal of this blog was to help you get a high level understanding of data management options. My hope is that information like this will help you to have more informed conversations with IT as you are planning your data and analytics strategy. This e-book describes data virtualization in a bit more detail, but also in business language.
Thanks for the article Kelly. A good high level overview. Can you please provide a link to the e-book that's mentioned?
Thanks Michelle, I think that our link was lost along the way. I have updated Kelly's post. Here it is for your reference: Data Virtualization: Federated data checks in E-book