As you may have heard by now, Hadoop is a popular, open-source framework for storing and analyzing big data in a distributed computing environment. But what does that mean? Essentially, it is a two-part system designed specifically to 1) easily store a lot of data and 2) quickly process that data using multiple, simultaneous passes at the data.
Oh, and it’s open source, which means it’s readily available for anyone to use, given you have the hardware and the skills to make it work. When I asked Tony Hamilton, the new big data expert at SAS, to offer some tips in this area, he suggested we talk about “helping the baby elephant become enterprise ready.”
That baby elephant, of course, is a reference to Hadoop’s namesake: a yellow stuffed elephant belonging to the child of Hadoop co-creator Doug Cutting. From Hamilton’s point of view, the adorable yellow elephant is more than a good name. It’s an apt metaphor for a young but powerful system.
Just like you can’t take a baby elephant out of the jungle and put him right to work without some planning and training, you can’t necessarily download Hadoop from the Web and start using it immediately for business decisions.
“A lot of people think open source is open source: you take it from the web site and run your business. That isn’t quite true,” explains Hamilton. “You have to have all these other features and functions working around it first, especially if you’re going to be business compliant and use it to make sound business decisions.”
With that in mind, let’s review the five areas Hamilton suggests you work on to make the baby elephant enterprise-ready:
- Data access. Data in Hadoop should be accessible in the same way other data sources are accessed for analysis in the enterprise. Unified access is important for big data problems, says Hamilton. Your access tools and connectivity tools should extend into the Hadoop framework in a way that works with other systems, like ERP and CRM, for instance.
- Security. Make sure there are safety measures built around your Hadoop framework. “If you’re going to run a business on this framework, you don’t want just anyone on it,” says Hamilton. “And you want to make sure it doesn’t break down and block you from your data.” This can be done through multiple data management layers with gatekeeping security features between the different layers. Security will also help establish ground rules for the Hadoop environment to interoperate with your traditional environment.
- Performance. Yes, Hadoop is built for big data, but you still have to manage it for optimum performance. Consider how you will meet service level agreements and make sure to understand the capabilities of the environment and the software you’re using. Balancing capacity for workloads and expectations is still important with Hadoop. “You wouldn’t let a baby elephant carry 50 cut-down trees,” says Hamilton. “He’s not the same size as mature elephant.” You have to train Hadoop to carry the right sized load the same way you would train that elephant.
- Integration. Make sure there’s a solid understanding of how Hadoop connects into the other environments in your compute, storage and IO infrastructure. It is important to understand how the hardware fits with the growth of the workload. Make sure it’s behaving at correct levels, understand what information is coming in and be confident about what is going out. “The smartest businesses are blending ideas and insight paradigms for big data with their traditional data sources,” says Hamilton. This will help your analysis become more granular and your decision makers become more confident.
- Real-time. In the Hadoop world, you’re moving from batch processing to real time and making results visibly mobile. Everyone’s got a cell phone or a tablet, and we expect instant results on these devices, says Hamilton. There is a lot of preparation required from data management to model development and visualization to make sure the outputs from your Hadoop environment are on time, current and mature.
Finally, Hamilton likes to talk about the well-worn path of technology development. This might be your first time raising a baby elephant, but it’s likely not your first time implementing a new technology. Consider the lessons you learned from the early days of ERP or Linux. A lot of these projects began for specific purposes or reasons, and then over time governance was applied. Hadoop is in a similar position today. The more you can apply lessons learned from earlier implementations – and from Hamilton’s list above – the more ready you will be to take advantage of Hadoop.
“Like all good software solutions, having a positive relationship with the providers and trainers makes a world of difference. It is key to know who can help you with support and continued education as well as industry trends and usage ideas,” says Hamilton. After all, he adds, “Elephants never forget and when we work with them we want to make sure it is to everyone’s benefit.”
Learn more about big data – and how it's being used in the enterprise – by downloading this Big Data in Big Companies research report.