The bigness of your data is likely not its most important characteristic. In fact, it probably doesn’t even rank among the Top 3 most important data issues you have to deal with. Data quality, the integration of data silos, and handling and extracting value from unstructured data are still the most fertile fields for making your data work for you. [And if I were to list a fourth data management priority it would be, as I described in this previous post (“External data: Radar for your business”), the integration of external data sources into your business decision support process]
Data Quality: The bigger the data, the bigger the garbage-in problem, which scales linearly with data volume. Before you can extract value from the bigness of the data, you need to address the quality of the data itself. If you haven’t been employing robust, scalable data quality tools, now would be the time.
Have we gotten any better at data quality? My personal, one sample survey would indicate that we have not. With a relatively unusual last name, Sadovy, although only six letters, I’ve seen it misspelled over two dozen different ways in my life, and I thought I’d seen them all by my mid 40’s. But once my three children became college-aged and started receiving daily credit card offers in the mail, several new ways to misspell my name came to light, a credit to the creativity of today’s automated processing systems. Even being a Smith/Smythe or Jones/Joens doesn’t leave you immune to a misplaced bit or byte.
Without a focus on data quality, big data just gives you that many more customer names to get wrong.
Data Integration: If you’ve got a data silo problem, and who doesn’t, then all big data contributes to the process is to make those silos bigger. Which makes the eventual data integration exercise that much more of a challenge.
Enterprise big data comes at you from a dizzying array of directions – from mainframes and ERP systems, from transactional and BI databases, from sensors and social media, from customers and suppliers. To make matters worse, each of these various sources and applications has its own, sometimes proprietary, data model.
And we’re still not finished with the complexities of this issue yet, because enterprise data has one more endearing quality that makes integration difficult – it’s decentralized and distributed. Extracting value from its bigness by creating one humungous centralized, homogeneous data warehouse is simply out of the question. If Sartre had been a philosopher of data science he might have said, “Integration precedes value extraction”.
Unstructured Data: Depending on what study you prefer, it’s claimed that 70 to 90 percent of all data generated is unstructured. This unstructured bigness doesn’t readily fit into predefined columns, rows, data entry or relational database fields. Customer feedback, emails, contracts, Web documents, blogs, Twitter feeds, warranty claims, surveys, research studies, client notes, competitive intelligence, often in different languages and dialects … the list goes on. Who has the time to read all this, let alone find an efficient way to extract the latent value from it?
Unstructured data may be both big and bad, but again, with the right tools, it’s not unmanageable. Text mining, sentiment analysis, contextual analysis – there are automated machine learning and natural language processing techniques available today to deal with the volume and ferret out the insights.
‘Big Data’ is of course a relative term, but when I think ‘big data’ one of the following three data categories seems to be in play:
- High transaction volumes: Millions of customers, billions of transactions (i.e. ATMs or POS), or tens of thousands of SKUs crossed with other attributes such as retail locations, cost and/or service levels.
- Temporally dense: Sensor data, audio.
- Spatially dense: Video, satellite imagery.
The business issue becomes – what do you want to do with all this data? And the place to start is not with the data, or with its bigness, but with the business problems you want to solve, the business insights you want to gain, and the business decisions you want to support. Starting from there and working backwards to the data means running squarely into the issues of data quality, data integration and unstructured text analytics. It’s only after you get a handle on this trio of capabilities that you can begin to effectively tap the big data spigots.
Extracting tangible value and insights from high-quality, integrated data, no matter its volume, velocity or variety, is where the payoff lies. Getting to this payoff in an environment where your data is growing exponentially in all dimensions requires an investment in robust data management tools. The consumers of this data, the business users, don’t know or care about its bigness – they just want the right data applicable to their particular business problem, and they want to be able to trust that data. Trust, access and insights – it’s got “quality” and “integration” and “analytics” written all over it.