As the big data era continues to evolve, Hadoop remains the workhorse for distributed computing environments. MapReduce has been the dominant workload in Hadoop, but Spark -- due to its superior in-memory performance -- is seeing rapid acceptance and growing adoption. As the Hadoop ecosystem matures, users need the flexibility to use either traditional MapReduce or Spark for data processing.
Forester is predicting total market saturation for Hadoop in two years, and a growing number of users are leveraging Spark for its superior performance when compared to MapReduce.
Enter the latest release of SAS Data Loader for Hadoop, which includes support for Spark. Users are now able to leverage Spark alongside SAS and Hadoop. We took a few moments to talk with Brian Kinnebrew, a Senior Solutions Architect at SAS, to answer some questions about how SAS supports organizations that have embraced Spark. If you’re coming to Strata Hadoop World in San Jose March 29-31, you’ll have a chance to meet Brian and the other members of the SAS team.
As I mentioned in my previous post, we’re using this blog series to introduce some of the key technologies SAS will be highlighting at Strata Hadoop World. Each Q&A features the thought leaders you’ll be able to meet when you stop by the SAS booth #1022. Next up is Brian Kinnebrew who explains how new enhancements to SAS Data Loader for Hadoop can support Spark.
How can SAS Data Loader for Hadoop support Apache Spark?
Brian Kinnebrew: If your Hadoop cluster uses the Apache Spark runtime target, some Data Loader directives can benefit from enhanced performance by utilizing in-memory distributed processing. The following directives in Data Loader 2.4 support Apache Spark:
- Cleanse data – Allows you to perform data quality transformations on Hadoop data such as standardization, parsing, matching, and gender analysis.
- Transform data – Allows you to filter, manage, and summarize Hadoop data.
- Cluster-survive data – Uses rules to create clusters of similar records. Employs additional survivorship rules to construct a unique surviving record containing the best values from the cluster. The surviving record replaces the cluster of records in the target. This directive requires the use of Spark.
How will saved directives work with Impala and Spark in SAS Data Loader 2.4 for Hadoop?
Kinnebrew: Directives that were created and saved from a prior version of SAS Data Loader for Hadoop will continue to execute in Hive, even after enabling Impala and Spark in SAS Data Loader 2.4 for Hadoop. In order for these directives to execute in Impala or Spark, new directives must be created. Conversely, saved directives created for the Hive environment in SAS Data Loader 2.4 for Hadoop can be upgraded to execute in Impala or Spark. Complete details can be found in the SAS Data Loader 2.4 for Hadoop User's Guide.
What versions of Apache Spark are supported by the SAS Data Loader 2.4 for Hadoop?
Kinnebrew: Spark 1.2 and 1.3 are supported, however the specific version is dependent on your Hadoop distribution. Complete details can be found in the SAS Data Loader 2.4 for Hadoop: System Requirements.
Does this mean SQL-based engines are no longer necessary?
Kinnebrew: No. In addition to Spark, the latest release of Data Loader for Hadoop provides support for Impala SQL. The following directives in Data Loader 2.4 support Impala SQL:
- Query of join data – Query a table or use joins to combine multiple tables.
- Sort and de-duplicate data – Provides support for (a) grouping rows based on selected columns and summarizing numeric columns for each group, (b) removal of duplicate rows from an existing table, (c) removing, repositioning, and renaming columns in the target table, and (d) sorting target rows based on ascending or descending values in a particular column.
- Run a Hadoop SQL program - Create jobs that execute SQL programs in Hadoop. This directive enables you to browse available SQL functions, obtain syntax and usage information, and click to add function syntax into the directive’s text editor. You can also copy and paste existing SQL programs directly into the text editor.
While these directives can be executed using the Impala SQL environment, they can also be executed using the default HiveQL functions. Additionally, directives that do not support Impala SQL will continue to use Hive as the default SQL environment.
If you’re looking to learn more about SAS Data Loader for Hadoop read the post, Self-Service data preparation transforms data professionals into data rock stars, or jump right in with a free trial of the software.
Stop by the SAS booth #1022 to chat with Brian, pick up a “Data Dude” or “Data Diva” t-shirt and meet the rest of the SAS team. You also won’t want to miss Paul Kent’s and Patrick Hall’s presentation March 30th at 4:20, A survival guide to machine learning: Top 10 tips from a battle-tested solution.
It’s not too late to register for Strata Hadoop World. Use the discount code SAS20 to receive 20% off of your registration.