Gastbeitrag: "Being Data Driven with Hadoop"

We asked Lars George, EMEA Chief Architect at Cloudera, to share his opinion about Hadoop, Big Data and future market trends in Business Analytics. For all those who want to know more about Hadoop we recomment this TDWI whitepaper and how to apply Big Data Analytics.

The last few years have seen a rise of data in many forms and shapes, generated by humans in the ever growing social and business networks of vast Internet properties, but also by machines in the so called Internet of Things - meaning a proliferation of smart devices that report their status or interaction with humans or other machines at a very high rate. This is, for example, the now ubiquitous smart phone that we all like to use to surf the web, receive emails, and - if time permits - play games. Any of these activities creates a flurry of events that is reported and logged at the service provider to form a profile of user behaviour - usually not on the individual but group level. And would we not rather get offers from vendors that target my needs and wishes? The alternative is to be swamped with useless requests and seemingly (well, it is!) unrelated information.

All of this leads us the new age of data driven organisations, those that take this information to increase their service to its customers and eventually yield a better return on their investment, bolstering the bottom-line by better sales numbers and more revenue. It is in effect a win-win situation: customers are better served and feel like they are personally addressed, while businesses can sell more products and services. There is a new technology that drives this approach home, usually described as “big data”, though this is a ambiguous as “cloud”, or any other such buzzwords. There is more to being data driven than handling a lot of data, more aptly described using the “three V’s of Big Data”.

The Three V’s of Being Data Driven

Note that I am refraining from using “big data” in the section title, as it is implying only one part of the equation. The ultimate goal is to be data driven, and that means to be able to process data that can be described with the following features:

Volume - This is the “big” in big data technology. You may have accumulated Terabytes, Petabytes, or even Exabytes of data and need to process it efficiently. It is not enough to move this data on cheaper, long-term storage as it would not be accessible. Here we rather talk about data that is available for analysis without any further preparation. It requires a storage system that can handle the amount of data while being able to process it.

Velocity - Switch on the Large Hadron Collider and you may have to collect one Petabyte per second (source: http://www.lhc-closer.es/1/3/12/0)! Obviously this is an extreme example, but recent users report they have to store 150TB per day, or 1.7GB per second listening to the above mentioned Internet of Things. These are events generated by smart devices with their many users employing them. How can you keep up with such an influx of raw data that needs to be preprocessed and prepared for subsequent analysis?

Variety - The third major aspect of the new age of data is its nature, or structure, type, form. Here we often hear the term “unstructured data”. I like to describe this as: “Not my data.”. It really says that the data arrives in formats that might be structured, just not in a way that suits my current setup or relational schemas. An elaborate and time consuming ETL processing would have to force the data into whatever is needed before moving on. It is the difference between “schema on read” vs. “schema on write”. The latter is often expensive and requires to have the questions ready that are asked from the data.

There are further V’s that have been added over time (and by different vendors), mostly aiming at explaining what the purposes of all of this is. The most suitable one here might be Value. Overall the new approach and technology around big data is just about that: driving the increase of value per stored byte (sometimes referred to a “cost per byte”) by being able to analyse the data over and over again, asking new questions as they arise, while still being able to access the data in full fidelity.

Hadoop - An Ecosystem

Existing technologies - especially around the traditional Enterprise Data Warehouse - have started to show limitations to handle one or more of the V’s of big data so smart companies in the late nineties set out to overcome those boundaries. One approach was to scale up existing technologies even further, changing storage layouts to speed up analytical workloads or increase hardware I/O channels to faster move data within the infrastructure. The majority is akin to increasing the complexity of the system and comes at a cost - usually price. Google set out on a different route, implementing a scalable, reliable, fail-safe alternative based on commodity hardware: 19” rack servers with standard CPUs, disks, and memory. The driving force was Moore’s law, finally providing powerful yet affordable computing hardware that could be used to build something new. Releasing technical publications explaining the technology on a higher level led to the ideas being picked up by a single open-source developer, namely Doug Cutting.

Doug started the Hadoop project in the mid 2000’s out of the need to store web crawl data, just as what Google did. Named after a toy elephant his son owned, he set out to replicate the same ideas, building a file system that is “web scale” and using the given CPUs per machine to run distributed computing as MapReduce jobs. Other large engineering companies such as Yahoo! and Facebook helped further the cause and today we have an entire ecosystem representing the Hadoop idea.

Many auxiliary projects joined over time helping ingest data from various sources (Sqoop for relational data, Flume for event data), process it in many ways (Hive and Impala for SQL access, Pig for imperative scripting, Spark for programmatic access), have a visual interface (Hue), enable random access (HBase), as well as define complex workflows (Oozie). Over time there will be parts waning to be eventually replaced with other, newer and more flexible or powerful abstractions. This is all part of a healthy ecosystem, which develops and flourishes over time.

Interesting fact though is that there is literally no competition to Hadoop. Pretty much every vendor has adopted the standard and those few attempts to later on compete with it have been abandoned since. What is also normal these days is that you do not install Hadoop as a collection of source code packages, trying to make them work in tandem - something that is quite a challenge. Like in the Linux world, today there are distributions available that help you install Hadoop from scratch in a very short amount of time and providing you with proven, tested, and carefully selected components working out of the box. One such distribution is Cloudera’s CDH (link: http://www.cloudera.com), which includes all of the above components and more.

Analytics

Part of the evolution of Hadoop is its recent move from being purely batch driven - which is part of the original design - to something more timely. As it is with all data processing technology, in the end it is all just physics. In other words, to improve performance you have to reduce the amount of data being transferred and handled to as few as possible: the best I/O is the one you don’t do. For that matter, data is laid out in better file formats that follow columnar approaches, reducing the scanned data to just the project columns (Parquet), or data is cached in memory for iterative processing (Spark) - which is difficult to handle with the original MapReduce approach as introduced by Google and Hadoop in succession.

Also Search has been added to CDH as “savvy people know SQL, but everyone knows search”. This enables exploration of data by the entire organisation who then ask new questions, out of which some are then implemented in SQL for more specific processing, or in a native API language (Java or Scala) to gain ultimate control. Automated ingestion allows to keep search indexes in sync as data arrives and reduces the wait time to mere moments.

What is also exciting is how the world of data analysis is opening up their tools to the data stored in Hadoop. SAS, the leading force in analytics, is working closely with the Hadoop community, and Cloudera as a partner to bring their expertise to the ecosystem (link http://www.sas.com/en_us/software/hadoop-big-data.html). New developments enable long term SAS users to run their existing algorithms on this new source of data and eventually deploy the very same into the Hadoop framework to run close to where the data resides.

Other open-source choices are still in flux, starting new efforts and/or merging with others, while slowly converging to more powerful libraries. Especially in the world of predictive analytics there is still a lot of development going on, but stable algorithms have been identified and tried, which make their way into tools like Oryx as well as Spark MLlib. Processing abstractions such as Crunch have been ported from plain MapReduce front-ends to also run atop of emerging (though yet not fully matured) in-memory processing frameworks (Spark).

Big Data Engineering

Another important part in this new world of Hadoop is the automation of data processing, or productionizing workflows. This is an exercise that can be described as the core of “big data engineering”. There is an inherent need to first identify the actual use-case, prototype and approve it. Usually that is part of a proof-of-concept project that comprises decision makers from the organisation with domain knowledge and optionally (though ideally) external experts with experience in the technology (link http://www.cloudera.com/content/cloudera/en/products-and-services/professional-services.html). But once the approach is chosen and the data pipeline components selected, the crucial part is to move this idea into production. This requires building the workflows and automate them in an “information architecture” (IA) that describes not only the current use-case, but an overall approach of sharing the Hadoop cluster resources across many users and organisational units. Again, you can reach out to experienced service providers in this space to solve the technology part, though it is advisable for any truly data driven organisation to build out this knowledge in-house to guarantee success with Hadoop long term.

Times are exciting and Hadoop is the clear leader in the new field of big data, aka being data driven. The platform offers many tried and tested components and with the integration of established solutions from companies like SAS the opportunities are aplenty. While parts of the technology still develops it is now the best time to start and gain more insight from your data, by using the proper big data architecture as well as the exploratory and operational analytics offered by SAS and Cloudera as strong partners in this space. Live long and prosper!

Links

•" Building a Hadoop Data Warehouse: Hadoop 101 for EDW Professionals” by Dr Ralph Kimball (http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/building-a-hadoop-data-warehouse-video.html)

Blogs

Blogs

Gastbeitrag: "Being Data Driven with Hadoop"

About Author