Data quality on Hadoop: The easy way


Bigger doesn’t always mean better. And that’s often the case with big data. Your data quality (DQ) problem – no denial, please – often only magnifies when you get bigger data sets.

Having more unstructured data adds another level of complexity. The need for data quality on Hadoop is shown by user feedback in the latest TDWI Best Practices Report "Hadoop for the Enterprise." 55% of the respondents plan to integrate DQ in the next three years. So how to take care of big data quality?

Well, there is good news. You don’t have to learn Java and implement complex MapReduce code to fix quality issues in the data within a Hadoop cluster. SAS Data Loader for Hadoop comes with data quality directives that help business users detect and repair data problems quickly and easily.

Data Loader Directives-DQ
SAS Data Loader for Hadoop including the data quality directives

As a first step, it is a good idea to profile the data set to get an understanding of how bad the situation is and which columns need some rework. The "Profile Data" directive in SAS Data Loader for Hadoop presents the result as an easy-to-understand report.

Profile Report 1+2
SAS Data Loader for Hadoop - Profile Report

The profiling report helps you understand the strengths and weaknesses of your data. After you define the data quality issues, the "Cleanse Data in Hadoop" directive offers many functions to automatically clean the flawed entries. It is possible to combine the transformation steps, and each one will ask for specifics when configured.

Since the actual cleaning of the data is done within the Hadoop cluster, this directive will work even on extremely large amounts of data. It’s fast, too. All processing is distributed and in parallel.

SAS Data Loader for Hadoop - Cleanse Data directive

The SAS Quality Knowledge Base (used on the Hadoop data nodes to clean the data) contains a vast amount of information and logic that define data management operations performed without any data movement. For example, the Quality Knowledge Base knows that the name “Andrea” is probably a female name in Germany, but might be a male name in Italy. It knows what emails and phone numbers look like in the (many) supported locales and is able to correct the format without any manual intervention.

It's also possible to reliably separate first name and last name or street, number, postal code and city into their own columns. That makes it much easier to perform reporting or analytics on this data.

Cleanse Data 2

Based on HTML 5, the modern, intuitive user interface can also be used from mobile devices. After all, what’s more fun than spending a day on the beach with your tablet, cleaning your 5 TB customer database?

If you are a longtime SAS programmer, you can always launch the data quality functionality out of Base SAS code as well. This way you can make data quality part of your (big data) ETL (or, more appropriately, ELT) processes.

If you want more information regarding data quality on Hadoop, visit SAS Data Quality and the SAS Data Loader for Hadoop. While you're there, sign up for a free trial of SAS Data Loader for Hadoop, and learn how to make big data quality a reality.


About Author

Guido Oswald

Follow Guido @guidooswald on Twitter Guido Oswald, TOGAF zertifizierter Diplom-Informatiker und MBA-Absolvent, ist begeistert davon, Szenarien und Lösungen rund um Business Analytics und Big Data gemeinsam mit Kunden zu entwickeln und umzusetzen. Als Senior Solution Architect bei SAS Schweiz hat er dafür ausreichend Gelegenheit. Guido Oswald arbeitet seit Juli 2011 bei SAS. Er startete seine Karriere bei IBM und war zuletzt beim Schweizer Software-Unternehmen NEXThink beschäftigt. Guido Oswald, a TOGAF certified, graduate computer scientist and MBA Grad, is excited to develop and implement scenarios and solutions around business analytics and Big Data together with customers. As a Senior Solution Architect at SAS Switzerland he has plenty of opportunities to do this. Guido Oswald has worked at SAS since July 2011. He began his career at IBM, and was most recently employed by the Swiss software company NEXThink. Also find me on LinkedIn and XING More posts here: SAS DACH Blog "Mehr Wissen" SAS Voices The Data Roundtable


  1. Guido,
    A useful reminder that data quality is a key success factor in big data analytics, however, it is also worth remembering that there are different aspects/dimensions to data quality. The approach described can help with invalid data and also, to a limited extent, can provide a means to overcome some aspects of missing data. The data quality dimension that these tools will struggle to address, and is arguably the most important, is data accuracy.
    Data that is valid and appears plausible will not necessarily be flagged up as a data quality problem if it is inaccurate. Therefore, analysis using such data may produce spurious outputs. This indicates that you will still need to undertake some form of data accuracy sampling in order to confirm that analysis outputs are correct.

    Julian Schwarzenbach

    • Hi Julian,

      you are right - taking care of DQ in Data-Prep is no substitute for validating data later on in the analytical process.
      Actually both are equally important and it turns out that dealing with large amounts of data (trying to avoid Big Data here.. ;) can become an issue in the data preparation phase. The Data Loader for Hadoop can help here a lot and speed up things. This saved time can be better used in the later phases of the analytical process - e.g. validating results as you correctly suggested.


  2. Pingback: Data Management y hadoop: claves para una integración de datos exitosa - SAS Colombia

Leave A Reply

Back to Top