Looking for some best practices for data management when you’re doing analytics? Experience has shown me that data management best practices should encompass the areas of governance, quality and storage. I’ll share a few examples.
The other day I was working on a project in a data warehouse environment where the analytics team wanted to add new data in the semantic layer, but not in the core layer. In a case like this, you need to consider several key questions:
- How can I rebuild the semantics if the data is not in the core layer?
- Where does that data come from?
- How can I get it again?
- How can I rebuild history if it’s not in the core layer?
- Is all the data in the semantic layer supposed to have data governance applied to it? And are there corporate data governance policies that include third-party vendor data?
This is not the first time that I've dealt with a requirement where data is needed in the semantic layer for a specific report or analysis. For example, let’s say a company purchased data from a vendor. You may want to join that data together with your existing data, but only in the semantic layer of the data warehouse, and only for a specific report (probably joined on geographic area, etc.). Why would you need to bring the data into the core layer of the data warehouse from staging? Some assumptions about the data include:
- You do not govern third-party data using your policies. This data is for analysis ONLY. (Don't forget to consider what data quality measures will be required on the third-party data.)
- You must keep persistent history on this data in staging or a file structure if you ever want to rebuild the history in the semantic layer.
- The data is required for analysis, and no other group will need this data.
Another discipline in data warehousing says that all data in the semantic layer must be able to be recreated from the core layer of the data warehouse. This discipline requires more integration and relationship work within the core layer of the data warehouse, but it may pay off later when you want to rebuild the semantics. The goal should be to roll and re-roll semantics any way business users need it. Storing more data will require more work in design, and more space for storage. Data would reside in staging, core and semantic layers of the data warehouse.
Best practices for analytics reside within the corporate data governance policy and should be based on the requirements of the business community. There will come a time when you must address a requirement like some of those listed above. When this happens, your objective is to be as flexible as possible in meeting business needs quickly, without jeopardizing corporate data governance policies.
Some companies have committees or councils that analyze the data governance requirements of any new data that’s brought into the enterprise. Not a bad idea – as long as it doesn’t create a bottleneck for meeting business needs quickly.
What you don’t want is for the business to bring data into the enterprise without first knowing how to manage, store and use it. It’s as simple as that. Consider today’s world, where data streams constantly. You’ll need to be very flexible about storing and using this data to meet new business requirements. Consider including governance and policies on streaming data now – and be willing to enhance existing policies over time, to meet ever-changing corporate needs.Download a TDWI Best Practices Report: Data Warehouse Modernization