Data preparation: Managing data for analytics


What data do you prepare to analysis?  Where does that data come from in the enterprise?  Hopefully, by answering these questions, we can understand what is required to supply data for an analytics process.

Data preparation is the act of cleansing (or not) the data required to meet the business needs specified in the analytic requirements. Much like any other requirements document, the sources of the data must be identified.  Then, you have to identify the data out of these sources.

A variety of sources can play a role in data preparation tasks. It can include the data warehouse, an operational data store, or some application system that houses historical data. It could include log files from the internet showing navigation, transaction files from the source on-line sales application that could show fraudulent activity or any other data store.

After the data you need has been identified, you have to gather all the data (usually into one place or database) and format it in a way that makes sense to the analytics application. Note that this assumes you have already installed or implemented this application.

It is also a best practice to use a data profiling tool on the sources and analyze the data prior to trying to load the data.  Look especially for strange codes, nulls or blanks. These can identify areas that may need extra work along the way. For a more detailed explanation of data profiling, download The Practitioner's Guide to Data Profiling.

Besides profiling tools, I have seen people use Microsoft Excel, ETL (extract, transform and load), and any other means of data transport. I've downloaded all the data I needed into excel, formatted it and loaded it into an application for analysis. I call this the "poor man's" approach to data management, but it can be effective, if slow.

One thing to consider, if the process to load the analytics application is going to be on-going, consider automating these processes.  It will save time and money, as multiple human resources will not be required.  However, control and audit of these processes will be required on an on-going basis to consider future usage.



About Author

Joyce Norris-Montanari

President of DBTech Solutions, Inc

Joyce Norris-Montanari, CBIP-CDMP, is president of DBTech Solutions, Inc. Joyce advises clients on all aspects of architectural integration, business intelligence and data management. Joyce advises clients about technology, including tools like ETL, profiling, database, quality and metadata. Joyce speaks frequently at data warehouse conferences and is a contributor to several trade publications. She co-authored Data Warehousing and E-Business (Wiley & Sons) with William H. Inmon and others. Joyce has managed and implemented data integrations, data warehouses and operational data stores in industries like education, pharmaceutical, restaurants, telecommunications, government, health care, financial, oil and gas, insurance, research and development and retail. She can be reached at

Leave A Reply

Back to Top