Big data quality


Utilizing big data analytics is currently one of the most promising strategies for businesses to gain competitive advantage and ensure future growth. But as we saw with “small data analytics,” the success of “big data analytics” relies heavily on the quality of its source data. In fact, when combining “small” and “big” data for analysis, neither should lack quality. That raises this question: how can companies assure the quality of big data?

Big data quality challenges

Organization strive to untangle the knowledge that hides in big data, but at the same time they express concerns about the quality of data. Big data sources, unlike most of the in-house data sources, are often external, and their quality is outside of the businesses control. With big data, you get things like incomplete machine or sensor data due to technical failures. Data from text sources lack descriptive metadata to provide the needed context. Unlike structured database fields, textual data is ambiguous and its meaning is highly contextual dependent.

Since most of the content is generated outside of the organization, it is very unlikely that you can go back to the creator for correction. But to make business decisions based on big data, organizations need to trust its data. To build confidence, companies have to make sure in-house “small data” is high-quality, and that big data is of acceptable quality. Therefore, organizations have to put efforts into understanding the effect of bad data on analysis results. But how can companies build the required confidence in big data?

Building trust in vast amounts of semi-structured and unstructured data

Organizations have always measured its data quality by combining technology with proven data quality standards. The same methods can be applied to big data today, but results may be rated different as with small data.

Since quality standards are very difficult to enforce for typical big data sources, companies must deal with the levels of data quality they get. Fortunately, with the vast amounts of data, lower levels of data quality are acceptable. This is because the sheer volume of big data mitigate the effects of relatively small numbers of incorrect data records. After all, if petabytes of data are analyzed to identify historical trends, a few kilobytes of bad data will barely influence the results.

New insight gained from big data is traded with some acceptable fuzziness caused by incorrect or incomplete data. Therefore the focus is more on validating the quality of big data at the time of consumption to make sure it is not too bad. It requires the analyst to determine the actual level of quality and assess the data’s suitability. This blends data quality assessments as a necessary step in each big data analytics activity.

Big data projects cannot disclaim the quality aspect of data

To accomplish data quality checks and assess big data quality, companies will have to make investments in data quality technology – if that's not done already. With data quality assessments being part of the data preparation step, it is crucial for the data analyst to have an efficient yet easy-to use data quality solution to ease daily work. Newly build data quality solutions that integrate and leverage the performance and scalability of Hadoop will become a “must have” for the data analyst when it comes to assessing vast amounts of data.

The importance of data quality for big data has two reasons. First, when leveraging big data it becomes even more important for companies to place emphasis on high-quality master data (think of it as in-house “small data”). When structured in-house data about customer or products is linked with big data for analysis, incorrect master data imposes an unpredictable effect on the results of the analysis. Implementing strong data governance and data quality standards for “small” data sets the foundation for a successful big data initiative.

The second reason is related to assessing the quality of big data while using it. Data qualityConclusion_small is mandatory in big data analytics to make sure big data is only used when of acceptable quality. Poor-quality big data may affect analytic results in many different ways. Assessing data before it is used is inevitable to minimize effects of incorrect data from external big data sources.

Data quality technology designed for big data can eases the efforts to deploy data quality routines every time big data processed and can help to ensure the appropriate quality level of data is met. The characteristics of big data make data quality assessments a mandatory activity prior to any big data analysis. It also serves as a preliminary step for an impact assessment to determine the effect of the remaining bad data on the result. This builds confidence in big data and makes sure the impact of incorrect data is well understood and taken into account when decisions are made.


About Author

Helmut Plinke

Principal Business Solutions Manager

Helmut Plinke acts as Principal Business Solutions Manager for SAS, focusing on data quality and data governance technologies. Helmut is an enthusiast of data quality technologies to improve fitness of data and thereby help businesses to improve efficiency and gain competitive advantages. In his current role Helmut supports customers in designing enterprise data management solutions based on SAS technology. He is a specialist in data quality and data integration technologies for a long time now and has been part of some of the major SAS data quality and data governance projects in DACH and the Netherlands recently. With over 15 years of experience across multiple industries Helmut has also gained a wealth of knowledge and experience in technologies like business intelligence, content management and enterprise application integration from his past roles with other companies. Helmut has published in IS Report and speaks about the topic of data management at SAS and public conferences sharing his project experience and knowledge.

Related Posts

Leave A Reply

Back to Top