Data quality and data preparation in self-service

1

The rise of self-service analytics, and the idea of the ‘citizen data scientist’, has also brought a number of issues to the fore in organizations. In particular, two common areas of discussion are the twin pillars of data quality and data preparation.

There is no doubt that good quality, well-prepared data is essential for any analytics process, particularly self-service. Without it, nobody can trust the results, and the process will be largely pointless. One of my colleagues, however, likened self-service with unknown data quality as like driving at 100mph in fog: it’s unlikely to turn out well, but a better driver may be able to handle it more easily and for longer. And a better driver, equipped with good tools such as high quality navigational aids, is likely to be in an even better position. In other words, quality matters not just for the data, but also the analyst and the tools.

data quality

Whose job is data quality?

But whose responsibility is data quality? And how do data quality and data preparation actually fit together? Many business users would say that both were an IT responsibility. The IT department, after all, is responsible for setting out the rules on data governance, and (probably) only they have the ability to clean the data properly, and make sure that it is the necessary quality. But I think this is an abdication of responsibility by business users.

Business users are the primary owners of their own data. They generate it, they know it well, and they also understand best when something is wrong. It is far easier for them than the IT department to recognize a problem with data quality. Let’s consider a well-known example: coding data in hospitals (the information that tells you the patient’s diagnosis and any procedures or treatments). Who is more likely to recognize that a code is incorrect, the person who put it in and therefore knows what it should say, and will recognize that (say) nobody under Department X should have procedure Y, or the technical specialist who manages the IT system and is nominally responsible?

The answer is clearly the person who provided the input. So does this mean that the data quality and preparation process should also be self-service? I think the answer is yes, to a certain extent.

Once you introduce self-service analytics, business users can really start to see the benefits of good quality data. They see that you cannot get reliable answers with poor data, and that gives them the incentive to take action to assure the quality of their own data. This process is, and should be, a key part of the data preparation that needs to happen before any analytics.

Modernizing the Analytics Life Cycle includes speeding up supporting processes. Analytics agility demands intuitive data preparation. Tune in to this webinar to learn how.

 

Making good choices

But data quality, and getting the best out of analytics, especially self-service analytics, is about far more than assuring the quality of the data inputs. Users  also need to make choices about the data that they analyse. As one of my colleagues has pointed out, you can take the highest quality salmon, and the best ice cream, but they still will not make a good combination on the same plate. This is where the quality of the ‘driver’ or analyst comes in.

Statisticians used to talk about ‘comparing apples and pears’, but at least they are both fruit. The danger with self-service analytics is that it raises the possibility of comparing apples with frogs, for example. This is perhaps the strongest argument for adding data preparation to the self-service portfolio: users will have a better understanding of the nature of the data that they are analyzing. But it is also an argument for providing good support to any citizen data scientists. They need to understand what they are using, and be able to obtain support where necessary.

And what about tools? One way of managing data preparation in a self-service world is through data virtualisation. This is being raised more and more as Internet of Things (IoT) data becomes more prevalent. The sheer volume of data means that new solutions must be sought, especially since users want answers ever faster. Data virtualization removes several sources of error, and therefore improves data quality. For example, by not having to transfer the data, no errors of transmission can occur. It also makes real-time access to the data much easier.

A three-pronged attack

Self-service analytics, therefore, needs three elements to ensure success. It requires work on the data itself, because data quality and preparation are key. But companies also need to ensure that their staff have access to expertise and good support, and the right tools to do the job. One of the best self-service approach to data preparation is sometimes called Data Wrangling. This new way to prepare data is not based on complex process driven by old fashion ETL tools but on an easier way to “change” and “clean” data with a powerful visual interface (like in SAS Viya).

The question now is how confident you are in using clean, well-prepared data within your organization. Take 5 minutes and complete the data management maturity assessment to find it our.

Share

About Author

Andrea Negri

Senior Principal Technical Support Engineer

Andrea Negri is a Principal Technical Support Engineer helping SAS customers to address critical issues in the areas of architecture, performance and integration. On a personal level, he is a self-confessed technology addict, which helps him to identify new technology, ideas and solutions to help customers solve problems. His skills include SAS Platform installation, configuration, administration and optimization. He is currently working on SAS Platform Governance with a focus on knowledge sharing.

1 Comment

  1. Until front end entry systems are updated with validation and business rules, data quality and the resultant self-service analytics will always be an issue. As they say, Garbage In, Garbage Out (GIGO). In the meantime, it is as you state: "It requires work on the data itself, because data quality and preparation are key."

Leave A Reply

Back to Top