How do you balance the costs and benefits of copying data? Seems like a simple (or perhaps simplistic) question, but the answer actually can provide the perspective for performance measures that influence a system design. For example, an objective for creating a data mart to support many individual queries each day as part of a workflow process requiring up-to-date data is very different from creating a data mart for generating the reports for the previous day’s transactions.
In the first case, one performance measure might be the average query response time amortized over the total number of queries each day. A second measure of usability is data currency, which measures how up-to-date the delivered results are. In this case, your architecture would optimize for those two variables – the most acceptable response time and the optimal degree of data currency.
In the second case, one performance measure might be the time it takes to generate the report and deliver it in a timely manner. The currency of the data is somewhat secondary, since the reports aggregate data from the previous day, so as long as the data is current as of the end of the previous day’s transactions, the business requirements can be met.
But what happens if you need to support both of these business usage scenarios? A system that optimizes for data currency would benefit from not copying the data, since that creates a pressure for continuously updating the copy every time the source changes. On the other hand, optimizing for report generation would benefit from a copy, especially if the need for data synchronization is low, since you don’t want to have to query multiple source systems repeatedly in order to collect the data needed to generate the report.
This is a common occurrence, and is ably addressed by using techniques such as data federation to essentially create a hybrid approach. Data federation and virtualization abstract the methods of data access by layering a logical model on top of the sources, caching some of the data and monitoring source systems to stream updates to cached versions when necessary. These methods combine user knowledge with analysis of the usage patterns to effectively obfuscate the details that underlie those capabilities. These methods ultimately can leave much of the data in its original source in a way that balances the satisfaction of the combined performance expectations for currency, synchronization and data access speed.