This post is the second in a series on data preparation based on a webinar about its role in the analytical life cycle. The first discussed how data preparation fit into the analytical life cycle. This post considers some trends in data preparation and some of the structures and processes that have evolved as a result. There are two main issues that have driven current data preparation patterns: customer demand and data quality issues.
Customer demand
The current situation on data preparation is largely driven by expanding quantities of data and data sources. The advent of big data, along with a wider range of data formats and new data sources like social media and machine sensor data have meant that it is harder to store and use data. At the same time, organisations recognise that it is becoming more and more essential to use data effectively to support decision making.
Users want more and more data. They want to be able to include their own and external data in their analyses. Self-service is popular because it is more flexible, independent, lower-cost, faster, easier to control. And create less work for other departments.
Gartner has commented: “The self-service data preparation software market is expected to reach $1 billion by 2019, with a 16.6% annual growth. Adoption is currently 5% of potential target users and is expected to grow to more than 10% by 2020. Vendors must understand the market opportunities when planning their business strategy.” However, this expansion in self-service creates a headache for data scientists. Self-service requires high-quality data preparation, and unfortunately, that is time-consuming and there are few shortcuts.
The transition between data preparation and analytics is critical. It needs strong analytics and visualisation, but it also needs strong data management so that the data can quickly reveal the required information. Fast markets need agile companies!
A new role has therefore emerged in many companies: the data engineer, who is responsible for data preparation or software engineering. Data engineers do the work before the data is handed over to data scientists for analytical modelling.
The importance of data quality
This new and emerging data engineer role is a recognition of the fact that data quality is essential. Data management, in other words, is not just about collecting and formatting data, but also ensuring that its quality is suitable. Data quality is therefore becoming an essential theme in the data preparation world.
SAS has been ahead of the curve in this area for some time. We recognised quite a long time ago that data preparation is more than just reading data, but also needs to include questions of data quality. Analytical models need to be fed with high-value data. If the data is not clean and high quality, then the output will be correspondingly bad because this is very much a case of "garbage in, garbage out."
Inspection
In SAS® Data Preparation, for example, you can see data that has already been imported and is available for use. You can view a sample of the data to get a feel for it, and at first glance everything may look fine. However, you can also look in a little more detail at the data profile. This may show that data has been stored in a variety of ways – some full-form and some with abbreviations, for example. This can create serious problems for analytical models and needs to be resolved before the different sources of data are brought together.
Fundamentally, therefore, the data now needs to be corrected and standardised. There are several possibilities available for this. For example, for time series analysis, we can filter data and remove any items with missing data. Where there are inconsistencies in how the data is written, this needs to be corrected and cleansed to remove anomalies and duplications. All this is a key part of data preparation and is becoming both more recognised and essential.
Moving towards the future
These two areas have very much driven recent developments in the data preparation and data management world, and in the tools that are available. Self-service tools are more ubiquitous, and are coupled with elements that ensure data quality, giving the best of all worlds.
In my next blog post I look at new regulations and governance requirements for data management.