“One does not discover new lands without consenting to lose sight of the shore for a very long time.” - André Gide
Ever heard of OpenOffice, Hadoop, Android, Firefox or MySQL? If so, can you identify the common denominator between these software tools and applications? If you answered, “They’re all open source,” you’re right!
While open source software has been around a long time, many organizations have been somewhat slow on the draw to integrate open source into their enterprise infrastructure. A lot of companies have considered open source solutions for initiatives such as BI/DW and have compared them with proprietary solutions on functionality and cost. Yet the hard truth is: We’ve known about these open source solutions, and still we’ve been able to get by without them on a large scale.
A Big Data Best Practice for Open Source Adoption
With the rapid growth of big data solutions these last few years, open source has taken a significant step forward into the enterprise space. Conversely, more and more enterprise-level organizations have begun to participate in and contribute code to the open source community. The time has come to take open source seriously for big data platforms.
A key reason is that many – if not most – of your current software vendors have integrated a variety of open source projects into their own proprietary big data solutions. Not only has Hadoop been integrated, but also many Hadoop-related projects (see list below). Vendors are also partnering with key big data service providers, such as Cloudera, HortonWorks, MapR, and other niche shops to support their customers’ emerging big data needs.
It’s important to note that while open source software is free and (typically) easy to install, commercial open source vendors have to make their money somehow. They can do this in a variety of ways:
- Develop custom/proprietary code to enhance the free software;
- Provide custom design and development services;
- Provide a development sandbox;
- Host software installations; and
- Offer technical support and training.
As with proprietary software, open source software also requires ongoing support and maintenance. The software may be free, but it won’t always be cheap.
Popular Open Source Projects for Big Data
In the world of open source software, what we typically call a product or application is called a project. Just like products and applications, some open source projects are highly robust and complex, while others are quite simple and straightforward. In addition, many projects are built to play well with others. Such is the case with Hadoop.
The Apache Hadoop Project. This project is managed by the Apache Software Foundation and has two primary functions: to store and process data. As compared to our traditional, relational databases, Hadoop is able to store and process any and all types of data (not just structured data) in a fraction of the time and cost. No wonder it’s so popular.
Apache Hadoop includes four components:
- Hadoop Common – contains libraries and utilities needed by other Hadoop modules
- Hadoop Distributed File System (HDFS) – stores data on commodity hardware that scales easily and cheaply
- Hadoop MapReduce – a programming model for large-scale data processing
- Hadoop YARN (Yet Another Resource Negotiator) – a resource management platform
Of these four components, HDFS (the storage component) and MapReduce (the processing component) are what we hear about the most.
Hadoop-Related Projects. In addition to Apache Hadoop, there are dozens (if not hundreds) of open source projects that have been built to expand and extend its functionality. Listed below are some of the more popular Hadoop-related projects:
- Apache Flume – collects, aggregates and moves large amounts of streaming event data
- Apache HBase – a distributed, scalable, non-relational database (modeled after Google’s BigTable)
- Apache Hive – provides a data warehouse-like structure and SQL-like access (called HiveQL) to data stored in Hadoop HDFS
- Apache Mahout – a machine-learning and data mining library
- Apache Pig – a high-level platform that allows users to create MapReduce programs
- Apache Spark – a newer, faster, data-processing engine (a MapReduce alternative)
- Apache Shark – uses Spark to run Hive queries up to 100x faster in memory (or 10x on disk)
- Apache Sqoop – transfers bulk data between Hadoop and relational databases
- Apache Zookeeper – a centralized service for configuration management and synchronization
Database Technologies. A question often asked about Hadoop is, “Is Hadoop a database – like Oracle or Teradata?” And the simple answer is “No.” Hadoop is not a database technology; it’s a framework and a file system. However, there are several database technologies out there that do support big data initiatives, such as:
- Apache Cassandra – provides rapid access to structured or semi-structured data
- Apache Solr – a search engine that provides full text indexing of documents
- MongoDB – stores large collections of documents
- Neo4J – stores graph-type data, such as social networks
- Redis – provides rapid access to unstructured data
As marketers, what’s important to note is that when you hear the term Hadoop, it’s highly likely that the discussion is about the Hadoop ecosystem (which includes any and all projects listed above, plus the projects, technologies and services not listed)—and not just about the Apache Hadoop project (i.e., HDFS and MapReduce). As you can see, to address your big data requirements with open source software, you will need more than just the standalone Apache Hadoop project. You will most likely need an ecosystem of big data-related projects.
Make no mistake: We’re in geek heaven right now. Open source solutions are here to stay in the big data world.
Key Takeaways for Marketers
- If you’re talking about big data technology, you’re probably talking about open source software.
- Hadoop was built by developers for developers, not marketers. Don’t tackle alone.
- Open source software is free, but that’s where “free” stops. It costs to implement.
- Visit the graveyard near the southern tip of the island. It has open gravesites that you can explore.
- Find out what your company’s position is on open source software and big data. One will most likely inform the other.
- To take this all one step further for marketing, check out this paper, Six Tips for Turning Big Data into Huge Insights, featuring viewpoints from Catalina Marketing's ex-CIO Eric Williams.
This is the 4th post in a 10-post series, “A marketer’s journey through the Big Data Archipelago.” This series explores 10 key best practices for big data and why marketers should care. Our next stop is the Location Isle, where we’ll talk about allowing data to reside where it will provide value.