Big data quality

Utilizing big data analytics is currently one of the most promising strategies for businesses to gain competitive advantage and ensure future growth. But as we saw with “small data analytics,” the success of “big data analytics” relies heavily on the quality of its source data. In fact, when combining “small” and “big” data for analysis, neither should lack quality. That raises this question: how can companies assure the quality of big data?

Big data quality challenges

Organization strive to untangle the knowledge that hides in big data, but at the same time they express concerns about the quality of data. Big data sources, unlike most of the in-house data sources, are often external, and their quality is outside of the businesses control. With big data, you get things like incomplete machine or sensor data due to technical failures. Data from text sources lack descriptive metadata to provide the needed context. Unlike structured database fields, textual data is ambiguous and its meaning is highly contextual dependent.

Since most of the content is generated outside of the organization, it is very unlikely that you can go back to the creator for correction. But to make business decisions based on big data, organizations need to trust its data. To build confidence, companies have to make sure in-house “small data” is high-quality, and that big data is of acceptable quality. Therefore, organizations have to put efforts into understanding the effect of bad data on analysis results. But how can companies build the required confidence in big data?

Building trust in vast amounts of semi-structured and unstructured data

Organizations have always measured its data quality by combining technology with proven data quality standards. The same methods can be applied to big data today, but results may be rated different as with small data.

Since quality standards are very difficult to enforce for typical big data sources, companies must deal with the levels of data quality they get. Fortunately, with the vast amounts of data, lower levels of data quality are acceptable. This is because the sheer volume of big data mitigate the effects of relatively small numbers of incorrect data records. After all, if petabytes of data are analyzed to identify historical trends, a few kilobytes of bad data will barely influence the results.

New insight gained from big data is traded with some acceptable fuzziness caused by incorrect or incomplete data. Therefore the focus is more on validating the quality of big data at the time of consumption to make sure it is not too bad. It requires the analyst to determine the actual level of quality and assess the data’s suitability. This blends data quality assessments as a necessary step in each big data analytics activity.

Big data projects cannot disclaim the quality aspect of data

To accomplish data quality checks and assess big data quality, companies will have to make investments in data quality technology – if that's not done already. With data quality assessments being part of the data preparation step, it is crucial for the data analyst to have an efficient yet easy-to use data quality solution to ease daily work. Newly build data quality solutions that integrate and leverage the performance and scalability of Hadoop will become a “must have” for the data analyst when it comes to assessing vast amounts of data.

The importance of data quality for big data has two reasons. First, when leveraging big data it becomes even more important for companies to place emphasis on high-quality master data (think of it as in-house “small data”). When structured in-house data about customer or products is linked with big data for analysis, incorrect master data imposes an unpredictable effect on the results of the analysis. Implementing strong data governance and data quality standards for “small” data sets the foundation for a successful big data initiative.

The second reason is related to assessing the quality of big data while using it. Data quality is mandatory in big data analytics to make sure big data is only used when of acceptable quality. Poor-quality big data may affect analytic results in many different ways. Assessing data before it is used is inevitable to minimize effects of incorrect data from external big data sources.

Data quality technology designed for big data can eases the efforts to deploy data quality routines every time big data processed and can help to ensure the appropriate quality level of data is met. The characteristics of big data make data quality assessments a mandatory activity prior to any big data analysis. It also serves as a preliminary step for an impact assessment to determine the effect of the remaining bad data on the result. This builds confidence in big data and makes sure the impact of incorrect data is well understood and taken into account when decisions are made.