Data preparation is often seen by companies as a difficult and dangerous job, one best left to IT. However, business departments often do not want to wait for their data, so thick SQL books and spreadsheet applications are booming in most offices. This does not really make sense, however you look at it.
There is little clean data out there …
Few, if any, projects or organisations start with perfect data. This is true of almost any data held by organisations and companies, and it is particularly true if you have to scrape data from websites. For effective self-service analytics, you also need to be able to do at least some sensible self-service data preparation.
There are other problems. Quite apart from anything else, the necessary steps to process the data are recurrent. But each one of us reinvents the wheel each time. Why do we do this? Is it fun to remove all spaces from a text field for the 10th time? To correct capitalization? To search for duplicates and remove them manually? I don’t think so. It is mostly because we don’t know how to do it any other way. There must be an easier way – and that is self-service data preparation.
What does this mean at the solution level?
We have gone too far to go back to enforcing data preparation as an IT task or matter. That is not, however, to say that everyone should be rushing out to find proprietary or open-source data preparation software for themselves. Some of this, at least, should be an organisational responsibility. Specialist departments and businesses need to find a way to reduce duplication of effort while still allowing users to clean and prepare their own data.
Formal self-service data preparation puts business users in the position of being able to prepare their own data for analysis without detailed knowledge of SQL or data integration. It does, however, require effort. Users need tools for self-service data preparation that are fit for purpose, which in practice also means easy to use without significant training.
The wide availability of possible tools means that almost anything is possible. All location information of an address field in a separate column, all postal codes in another? Maybe even generating gender automatically from a list of first names? Editing steps like these are standard and can be applied with the click of a mouse. It does, however, take time and effort to try for yourself, and users may need a bit of encouragement and support, at least in the early days – and this is where IT comes back into the equation.
Organisations need to take time to create an efficient, consistent and repeatable data preparation process, and then put effort into supporting its adoption. This is the role of IT – to select the tools and set out the process for others to follow, and then provide the necessary support to ensure that it happens. If the right self-service tools are used, employees in specialist departments can independently gather information from data, and IT can concentrate on its core tasks. That way, the entire enterprise becomes more productive.