What is Data Preparation and why does it matter?
Data preparation is all the tasks involved in collecting, processing, and cleansing data for use in analytics and business intelligence. It therefore includes accessing, loading, structuring, purging, unifying (joining), adjusting data types, checking fields to see that only valid values are present, and checking for duplicates and uniform data (for example, two birth dates for one person).
As the amount of data, and the number of sources increase, good data preparation is becoming both more costly and more complex. It is therefore the new paradigm that is shaping the market. It has also become, effectively, self-service data management. Traditional data management processes can produce data up to a point, but dynamic fine detail and last-minute work is now being done in a self-service way using data preparation tools.
What is clear is that it is becoming more and more important to shape the data and to get it right for analytics. Many more companies are now data-driven. Businesses make decisions today based on data and it is vital to be able to access data quickly and prepare it for analysis. Big data environments such as Hadoop mean that it is impossible to move data from these environments. Instead, it is important to process big data in place and combine resultant data with other sources as part of preparing data for analytics.
Data preparation is therefore an essential part of any analytics project. Getting the right data, and preparing it right, means that it is possible to get good answers to analytical questions. Use poor quality data, or data that have been badly prepared, and the results of your analysis are unlikely to be reliable.
Understanding data preparation in the analytics lifecycle
There are two main phases in the analytics lifecycle: discovery and deployment. The discovery process is driven by asking business questions that produce innovations. The first step is therefore defining what the business needs to know. The business question must then be translated into a representation of the problem, which can be solved using predictive analytics.
And, of course, to use predictive analytics requires suitable data, prepared appropriately. Technologies like Hadoop and faster, cheaper computers have made it possible to store and use more data, and more types of data, than ever before. This, however, has only amplified the need to join data in different formats from different sources and transform raw data so that it can be used as input for predictive modeling. With new data types from connected devices, such as machine sensor data or web logs from online interactions, the data preparation stage has become even more challenging. Many organisations still report that they spend an inordinate amount of time, sometimes up to 80 percent, dealing with data preparation tasks.
Data preparation is an ongoing process
Exploring the data involves using interactive, self-service visualisation tools. These need to serve a wide range of users, from the business analyst with no statistical knowledge to the analytically savvy data scientist. They must enable these users to search for relationships, trends and patterns to gain deeper understanding of the data. This step therefore refines the question and the approach formed in the initial “ask” phase of the project, and develops and tests ideas about how to address the business problem. It may, however, be necessary to add, delete or combine variables to create more focused models, and this, of course, involves data preparation again.
In the modelling stage, analytical and machine-learning modeling algorithms are used to find the best option to show the relationships in the data, and answer the business question. Analytical tools search for a combination of data and modeling techniques that reliably predict a desired outcome. There is no single algorithm that always performs best. The “best” algorithm for solving the business problem depends on the data. Experimentation is key to finding the most reliable answer, and automated model building can help minimize the time to results and boost the productivity of analytical teams. Again, more data may be added.
Of course, once you have built your models, you then need to implement or deploy them. But even then, data preparation does not stop. A model is only as good as the data that it uses, so models (and data) must be kept up-to-date as much as possible. Data preparation and management is very much an ongoing process.