Hadoop and the future of ETL


I've lived in Las Vegas for two and a half years now. When I decided to move here, I was oblivious to the downtown revitalization taking place as well as the burgeoning tech ecosystem. It turns out that Vegas is an increasingly attractive place for start-ups. Thank 24-7 gambling, low real estate prices and no state income taxes.

When I meet someone working at a start-up, not much time typically passes before the two of us are talking about tech and data. It's just how I'm wired. I often ask how Company XYZ gets its data. Most of the time, APIs are involved. Either the company uses existing APIs (re: Twitter, Google, Facebook, LinkedIn, etc) or has created one for itself, developers and potential partners.

In other words, in all of my conversations with nascent start-ups, the term ETL has never been uttered. Not once. Ever. This begs the question, Has traditional ETL jumped the shark?

It's an interesting question, and I have no doubt that countless mature organizations will continue to use ETL for a really long time. Still, when it comes to moving data around, ETL no longer is the only game in town. For the foreseeable future, more and more organizations will have to juggle multiple means of accessing data. Because of their power, speed and flexibility, APIs have grown in popularity.

Yet ETL may see a renaissance of sorts with the rising popularity of Hadoop – particularly Hadoop 2.0. As Tamara Dull writes in her post How Hadoop can help... even if you don’t have big data:

What if you used Hadoop to handle your ETL processing? You could write MapReduce jobs to load the application data into HDFS, transform it and then send the transformed data to the data warehouse. The bonus? Because of the low cost of Hadoop storage, you could store both versions of the data in HDFS: the before application data and the after transformed data. Your data would all be in one place, making it easier to manage, re-process and possibly analyze at a later date.

Simon says: consider all options

I certainly can't predict the future. Maybe Hadoop will embolden more organizations to use ETL as opposed to APIs. What's more, I’m loathe to claim that APIs are inherently “better” than ETL as a means of accessing and moving data. However, let me say this: If an organization can build or use an API while concurrently addressing security, regulatory or technology, then it should strongly consider doing so. ETL is no longer the only game in town.


What say you?

Tags ETL hadoop

About Author

Phil Simon

Author, Speaker, and Professor

Phil Simon is a keynote speaker and recognized technology expert. He is the award-winning author of eight management books, most recently Analytics: The Agile Way. His ninth will be Slack For Dummies (April, 2020, Wiley) He consults organizations on matters related to strategy, data, analytics, and technology. His contributions have appeared in The Harvard Business Review, CNN, Wired, The New York Times, and many other sites. He teaches information systems and analytics at Arizona State University's W. P. Carey School of Business.

Leave A Reply

Back to Top