Data quality issues don’t go away just because you have more data.
Big data is sometimes considered exempt from the requirement to be integrated, cleansed and standardized. Unfortunately, chances are that the more data you have, the worse its quality will become.
Big data is all about access to huge amounts of data across various types of data sets. Data scientists are tasked with bringing these huge data troves in-house and finding useful information in them. But can you trust the data? Data scientists waste a lot of valuable time analyzing and crunching data without knowing if the quality of the data is adequate for analytical purposes.
In the early years of big data, loading data sets into Hadoop clusters to facilitate analysis was a major undertaking. Data scientists needed to learn special scripting tools such as Flume or Sqoop to load or unload data to and from Hadoop clusters. In recent years, new tools have made these processes easier – so data scientists can concentrate on analyzing large data sets rather than spending time manipulating them.
Now that it’s easier to load large amounts of data into Hadoop clusters, it’s time to address the second challenge of big data – namely, data quality.
One of the biggest challenges in Hadoop clusters is the lack of an update command in the Hadoop environment. This raises the question of how you can apply and ensure big data quality.
Hadoop clusters are mainly created to support analytics. It’s easy to see that the lack of an update command affects how you design your “facts" and "dimensions,” especially concepts such as slowly changing dimensions. There are several ways to address this issue. The easiest way is by inserting new data and taking care of the time variance through the front end, then filtering the old data out of your result set. Another option is to partition your dimensions by date.
These same methods could be used to address data quality. Data that is cleansed could be treated as new data and inserted into new partitions. However, for a period of time this would mean that dirty data could get into your Hadoop cluster and could be included in reports.
Better big data quality
A better way of applying data quality to Hadoop clusters is to perform data cleansing before data is written into Hadoop. This could happen either by using “staging” data lakes or by on-the-fly application of data quality rules.
The nature of big data forces us to think differently about data quality rules and how they’re applied. Consider that big data is possible only because of the distributed nature of networks and computer systems.
One of the fundamental theories of distributed computing, the CAP theorem, states that it is impossible to simultaneously provide consistency, availability and tolerance to network partitions. This theory has led to an argument against traditional data management practices and relational databases. Along these lines, a lot of companies have sacrificed consistency for the sake of availability and tolerance to partition failure. Technologies like Hadoop and NOSQL databases were developed to ensure that data is available at all times across distributed nodes. Consistency was secondary.
This eventual consistency of data increases the importance of data quality in big data environments. The job of a data scientist is to explore data sets and come up with innovative ways to analyze the data. Consequently, any governance process must be agile enough to allow for application of data quality rules without impairing the data scientist from performing his or her function.
Without a doubt, it’s important to think through both your data quality approach and your data governance processes before tackling your big data problems. Big Quality Data is a big job!