In a number of recent posts, we have discussed the issues that surround big data, largely looking at the need to access data from a number of sources of variant structure and format. From the perspective of the analytical environment, this has not only complicated the population of data warehouses in a timely and consistent manner, it also impacts the ability to ensure that the performance requirements of the downstream systems are met.
The barriers to success are a combination of the complexity of data extraction and transformation from numerous sources along with the timing and synchronization characteristics of data loading that expose inadvertent inconsistencies between analytical platforms and the original source systems. Data virtualization tools and techniques have matured to address these concerns, providing some key capabilities:
1) Federation: They enable federation of heterogeneous sources by mapping a standard or canonical data model to the access methods for the variety of sources comprising the federated model.
2) Caching: By managing accessed and aggregated data within a virtual (“cached”) environment, data virtualization reduces data latency, thereby increasing system performance.
3) Abstraction: Together, the federation and virtualization abstract the methods for access and combine them with the application of standards for data validation, cleansing and unification.
Data virtualization simplifies data access by end users and business applications through that abstraction, since they are not forced to be aware of source data locations, data integration, or application of business rules. While a straightforward approach to virtualization considers structural mappings of the data models by mapping a standard canonical model to the underlying data sources, virtualization can be expanded to enable a much richer set of access services by incorporating improved resolution of semantic access when coupled with a master data index. We will look at that in the next set of posts.