Bigger doesn’t always mean better. And that’s often the case with big data. Your data quality (DQ) problem – no denial, please – often only magnifies when you get bigger data sets.
Having more unstructured data adds another level of complexity. The need for data quality on Hadoop is shown by user feedback in the latest TDWI Best Practices Report "Hadoop for the Enterprise." 55% of the respondents plan to integrate DQ in the next three years. So how to take care of big data quality?
Well, there is good news. You don’t have to learn Java and implement complex MapReduce code to fix quality issues in the data within a Hadoop cluster. SAS Data Loader for Hadoop comes with data quality directives that help business users detect and repair data problems quickly and easily.
As a first step, it is a good idea to profile the data set to get an understanding of how bad the situation is and which columns need some rework. The "Profile Data" directive in SAS Data Loader for Hadoop presents the result as an easy-to-understand report.
The profiling report helps you understand the strengths and weaknesses of your data. After you define the data quality issues, the "Cleanse Data in Hadoop" directive offers many functions to automatically clean the flawed entries. It is possible to combine the transformation steps, and each one will ask for specifics when configured.
Since the actual cleaning of the data is done within the Hadoop cluster, this directive will work even on extremely large amounts of data. It’s fast, too. All processing is distributed and in parallel.
The SAS Quality Knowledge Base (used on the Hadoop data nodes to clean the data) contains a vast amount of information and logic that define data management operations performed without any data movement. For example, the Quality Knowledge Base knows that the name “Andrea” is probably a female name in Germany, but might be a male name in Italy. It knows what emails and phone numbers look like in the (many) supported locales and is able to correct the format without any manual intervention.
It's also possible to reliably separate first name and last name or street, number, postal code and city into their own columns. That makes it much easier to perform reporting or analytics on this data.
Based on HTML 5, the modern, intuitive user interface can also be used from mobile devices. After all, what’s more fun than spending a day on the beach with your tablet, cleaning your 5 TB customer database?
If you are a longtime SAS programmer, you can always launch the data quality functionality out of Base SAS code as well. This way you can make data quality part of your (big data) ETL (or, more appropriately, ELT) processes.
If you want more information regarding data quality on Hadoop, visit SAS Data Quality and the SAS Data Loader for Hadoop. While you're there, sign up for a free trial of SAS Data Loader for Hadoop, and learn how to make big data quality a reality.