Preparing data for analytical purposes involves many different forms of data that must be handled in a particular way. This preparation can be very different from standard integration jobs, which can include the following:
- Profiling for structural analysis. This reads files or data dictionaries and creates metadata that shows table structure, data element types, how many nulls in each element, number of like codes, etc.
- Using a tool to understand relationships across databases and files. These mapper-like products do a good job of analyzing relationships and can visualize them in a way that helps when we're talking with our business personnel.
- Integrating data on a common key (if possible). As an example, we could say that an account number may be common across all the enterprise asset systems and data stores. If so, any use of this data would need an understanding of these alternate keys. Indexing on these alternate keys should, in turn, speed up reports and queries.
- Considering the enterprise perspective of data using guidelines and principles. This entails decoding and normalizing data to be reported from an enterprise perspective, usually from enterprise data assets like a data warehouse, operational data store or operational data hub. This can include financial reporting, upper management reporting and corporate reporting that integrates multiple source systems. Data elements are renamed and transformed into other data types based on data governance guidelines, and data is integrated based on business requirements.
- Doing financial balancing – because sometimes the enterprise requires standardized data that includes financial information. If that's the case, the financial data may require balancing back to the source of the data in a way that allows reporting and auditing.
Requirements for analytical data preparation
Analytical data preparation comes in many different forms. It may require any of the following:
- A profile for structural analysis, As explained before, structural analysis and cataloging of the data sources is extremely important for some analytics, especially on big data platforms. The data is cataloged, analyzed and used by big data analytics users to bring all the raw data together for analysis.
- Raw data – that is, data from the sources. This is data in its purest format, prior to normalization, governance or an enterprise perspective.
- Enterprise standard data, which is also used with standard integration jobs, may also be required for analytics. Periodically we want to see the raw data, as well as the conformed data for analysis. There is not much preparation required for this data other than integration with the raw data.
- Different formats of data. Raw data used for analytics can come in the form of files (such as XML or .csv,) or connections to specific databases (Oracle, DB2, Teradata, SQLServer, Access, MongoDB, Hadoop, etc.).
- Near-real-time data, or streaming data. Some of the analytics we do today are near-real-time where the data is streamed from the sources into the analytics database. Analytics do not stagnate, but change after each iteration, based on frequency.
I see a need for standardized data for analytics as well as a need for raw data analytics. The two together can be very powerful for an organization – but it's important to know when and how each is used. In considering this, you may want to ask the following:
- Are reports written for external partners off raw data analytics?
- What is the security of the raw data platform?
- How is this audited or managed?
- Which platform do I use if I want standardized or structured data as well as unstructured and raw data?
- Where do I put the results of my analysis?
The important thing is to ensure that analytic users have access to both types of data, and that integration or set-up is as easy as possible.Download a TDWI report about data preparation