What type of big is your data?

The basic big data problem is simple to understand: we create too much data to store and analyze it all.

The problem gets bigger, however, when you consider the related factors: our problems themselves are getting bigger, the analytics needed to solve them are more complex and the data is coming at us faster than ever before.

"Big data is more than just volume or how much data you have," explains David Shamlin, a Senior Director in SAS R&D. "It’s also about how quickly that data might be changing. It’s the variety of the structure of the data. How much does it look like a row and a column versus a Web stream, if you will? And what’s the complexity of work that you want to do with that data?"

Essentially, David breaks the "bigness" of the big data problem into these four important categories:

Volume: This is the basic data volume problem. You have too much of it. How do you store, integrate and analyze it all?
Velocity: The data is coming at you too fast. Think about mobile phone providers with millions of customers making calls, downloading apps and accessing features all day long. How do you make predictions and react to customer changes in real time when the data is coming at you so quickly?
Variety: Data comes in many formats. It is not always presented to us neatly in rows and columns. Sometimes it’s XML, completely unstructured text data or even streaming voice data. One customer David worked with had 50 terabytes of archived email that they wanted to search and analyze. 50 terabytes in a relational database might be managed easily – but 50 terabytes of unstructured data is a whole different problem.
Complexity: Combining two or more of the above issues gives us a complex data problem indeed. Then, considering the types of analytics you might need to solve your problem also adds to the complexity of the problem. The most sophisticated problems require optimization or forecasting algorithms that scan the data many times and recalculate different segments of the data to find an answer.

What does this mean for organizations who want answers to their big data problems? First, you need to understand what type of big data problem you have – and what types of larger business problems you’re trying to solve.

For example, a recent Mckinsey report says retailers can use big data to increase operating margins by more than 60 percent. Consider these additional possibilities that the McKinsey report cites for healthcare:

If US health care were to use big data creatively and effectively to drive efficiency and quality, the sector could create more than $300 billion in value every year. Two-thirds of that would be in the form of reducing US health care expenditure by about 8 percent. In the developed economies of Europe, government administrators could save more than €100 billion ($149 billion) in operational efficiency improvements alone by using big data, not including using big data to reduce fraud and errors and boost the collection of tax revenues. And users of services enabled by personal location data could capture $600 billion in consumer surplus.

At SAS, our customers have seen success applying big data solutions to risk problems in the banking sector and markdown optimization problems in retail.

Because of the complexity of these problems, however, they cannot be solved simply by purchasing more high-powered servers and hardware to store more and more data. Rather, “High performance computing is about advanced deployment strategies and options for applying the core business analytics capabilities of data integration, analytics and reporting to big data problems,” says David Pope, a Principal Solutions Architect for SAS.

Those advanced options include enterprise architecture solutions that structure your analytics platform ideally for big data, in-database offerings that put SAS analytics inside partner databases, and in-memory solutions, including the new high-performance analytics partnerships with Teradata and GreenPlum.

To learn more about the different types of high-performance architectures, revisit the article Four Ways to Divide and Conquer by SAS CTO Keith Collins.

2 Comments

Rick Wicklin on May 25, 2011 11:04 am

Within the "Volume" category there are subcategories according to the type of analysis that you want to perform. Obtaining the means and standard deviations of 100,000 variables is simple. Computing a complex regression model with 100,000 variables is much more challenging! Finding the absolute shortest possible route for a salesman traveling between 100,000 cities is known to be so difficult that various algorithms have been developed that compute an approximate solution. Consequently, to truly understand how to analyze big data you must consider not only the volume of the data but also the computational complexity of the analysis.
Pingback: SAS, Hadoop, Big Data, Big Analyitcs

Blogs

Blogs

What type of big is your data?

About Author

2 Comments