As the point person for SAS joining the new Open Data Platform (ODP) initiative, I want to make it clear why SAS is involved with ODP, and why we think it’s important to our customers, and the Hadoop and big data ecosystem as a whole.
SAS is not in it to choose sides on Hadoop distribution vendors. We support all five major distributions -- Cloudera, Hortonworks, IBM, MapR and Pivotal -- with our applications, and requests continue to pour in for more support of region-specific distributions. SAS will continue our collaboration with all Hadoop vendors.
Anyone else working with multiple distributions of Hadoop will understand the challenges involved. Here are three revealing examples from the last few months, each from a different (unnamed) vendor:
- Calling an HDFS API to see if an HDFS directory exists. Some don’t throw an exception and return a null for the directory. Some throw an exception. Which one is right? Who knows, it’s just different. Oh, and by the way, both of these are run-time issues, and thus the only way to catch it is very thorough testing.
- Setting a baseline of Hive 13 so we get access to some new syntax. Try it on one, it works great and we are able to do some really innovative stuff. Try it on another that says it also has Hive 13, and we get “syntax error”? Huh? Turns out, you need Hive 13 plus a certain patch level to get it to work. Who knew?
- Trying to be a good ecosystem citizen, leveraging the HCAT APIs for accessing shared metadata. All is good. Get the latest “dot” release from the vendor, and guess what, they changed the package name of the class used to get the information. Code change necessary. Oh, and by the way, the latest “dot” release did mark the API deprecated, and then removed it three to six months later when they came out with the release.
You could pick any of these and say “not a big deal” or “simple bug” and you won’t get an argument from me. But as an analytics application and solution vendor, this death by a thousand paper cuts forces SAS to spend time and effort on “making sure it still works” instead of “making it work better.” A goal of the ODP effort is to let everyone in the ecosystem -- the Hadoop distribution vendors and the application and solution vendors like SAS -- innovate.
At SAS, we’re not naïve enough to think we are perfect. In this wild west of ever-moving software, like others we often “code it 'til it works” and then hope it's right (after numerous Google searches, etc.) The “write once, test everywhere” approach is alive and well. ODP can help by providing validation tests which depict standard coding practices.
In joining ODP, SAS is supporting our customers and the Hadoop ecosystem. As an ODP member, SAS remains committed to ensuring our applications work with and exploit the Hadoop distribution of our customers’ choice, while being able to bank on the stability and quality expected in demanding business environments. We see ODP as a way to do this.
By fostering collaboration across all members of the Hadoop and big data ecosystem, ODP will help reduce the time spent fixing compatibility issues and boost innovation focused on solving our customers’ biggest challenges.
Thank you for this great initiative. Even though I only got into Hadoop for a few months and only dealt with one vendor so far, I already concur so much with what you said about “code it till it works” and then hope it is right (after numerous Google searches, etc.).
By Cloudera and MapR abstaining from ODP it basically means that it'll "just work" - except for customers of those two distributions - ie. it won't really work, because it won't improve the situation for poor Cloudera or MapR customers.
For example, that issue you give with Hive 0.13 + patch required seems highly likely to refer to Cloudera who backport a selection of patches on top of Hive which otherwise lags behind Hortonworks while trying to maintain compatability/integration with Impala, so the Hive versions are no longer directly comparable between CDH and other distributions.
Why don't the ODP vendors simply run on HDP and sell only what's unique to them on top of that? ODP seems like a very thin abstraction to what Hortonworks already provide.
Of course Cloudera and MapR don't want to join ODP - it forces them to lowest common denominator - basically standard Hortonworks - and makes it very hard for them to sell their own proprietary components such as Cloudera Manager/Navigator or MapR-FS which are competing alternatives to the standardized open source platform.
Big Data Consultant (ex-Cloudera)