Best Practices for Big Data: Learning from the Past While Looking to the Future

I’ve heard it said that the only thing you can count on in life is change. The same can be said of technology. Change is certain, and the rate of change seems to accelerate with each passing year.

Change requires us to adapt, but as we race to keep up with the latest advances, we need to make sure that we aren’t throwing out important bits of wisdom that have been accumulated along the way.

Take, for example, the analysis of data. The practice of analyzing data has been around for hundreds of years. The principles that were established in the early days of the profession were limited, at least in part, by the technology that was available at the time. For example, the "computers" that were employed by statisticians in the late 1800’s were rooms full of women sitting at counting machines. In that environment, iterative optimization algorithms needed for techniques like maximum likelihood estimation weren’t really feasible. Collecting and analyzing data on the entire population wasn’t even in the realm of possibility. And, so, a whole theory was developed for collecting samples and using those samples to make inferences about the entire population.

Fast forward to 2012. The advent of big computing has certainly changed the game when it comes to analyzing data. The ability to store massive amounts of data and to apply complex analytical techniques has opened up new frontiers and has even given rise to a new breed of analysts commonly referred to as “data scientists.”

Today’s analysts are constantly being presented with new ways to analyze data, giving them the opportunity to push the boundaries of what’s possible. But, what can these modern day analysts learn from the pioneers who came before them? What are the best practices that should guide big data analytics? Which philosophies stand the test of time, and what are the new principles that need to be established? Here are a few things for you to consider:

Data quality is still key. The “garbage in, garbage out” principle still applies in the world of big data analytics – and maybe even more so. As storage becomes cheaper, it’s tempting to save every piece of data coming in from every channel, regardless of the quality. However, your analyses – and your ability to make decisions based on your analysis – are only as good as the data that back them up. Basic principles such as identifying anomalies, removing duplicate information, and treating missing values are vital to insuring the validity of your results.
Sampling. “But, wait a minute,” you might say – “I thought high performance analytics meant that I don’t have to sample any more – I can analyze all of my data all of the time.” And, that’s true. You are no longer REQUIRED to sample due to limitations in your computing environment, but you may CHOOSE to sample for the sake of your analysis. Sampling techniques were developed in part because it was difficult, expensive, or impossible to collect data on the entire population. With the advent of massively parallel computing environments, many of those restrictions have been lifted. And, in many cases, the emphasis has shifted from inference to prediction. However, just because you don’t have to sample doesn’t mean you shouldn’t. There are situations where clever sampling can actually yield a better predictive model than one built on all of the data. For example, in the case of a rare event, a model built on all of the data may not be effective at identifying the event. If the event only occurs 1% of the time, a model that predicts every observation to be a non-event will give you 99% accuracy – but it’s a useless model. By oversampling the event, you can create a model that is better able to differentiate between events and non-events and that will be much more effective in catching those events of interest.
Use your noggin. Access to a high performance analytics infrastructure means that you have more time to think. You can run your models faster, which means that you can try more techniques. You can iteratively turn the dials and knobs to get the most predictive power from your model. You can analyze data at different levels of granularity. The value of high performance analytics is not just in being able to analyze all of your data. It’s in being able to do things differently than you’ve ever done them before. High performance analytics gives you the opportunity to challenge the status quo. After doing some experimentation, you may still find that your best model is your tried and true logistic regression – but now your degree of confidence is higher because you’ve exercised all of the other possibilities.
Don’t forget about deployment. You can create the best possible statistical model, but if you can’t “sell the value” of the model to your management – and if you can’t then use it to drive decisions within your organization – is it really of any value? As you build out a modern architecture for analytics, you need to insure a smooth handoff that will enable you to take the results of your models and deploy them into the operational systems that run your organization. That’s where the return on your investment in high performance analytics will be realized and recognized.

In this high performance world that we’re living in, there are huge opportunities to move analytics forward within your organization. However, I would caution you not to throw away the wisdom that’s been handed down through the years. Take a critical look at what’s been successful in the past – pick out those pieces that still apply today – and then blaze new frontiers that will take your analysis to the next level. If you learn from the past while looking to the future, you’ll be amazed at what you can accomplish!