Jedi SAS Tricks - Maximum Warp with Hadoop

2

I'm gearing up to teach the next "DS2 Programming Essentials with Hadoop" class, and thinking about Warp Speed DATA Steps with DS2 where I first demonstrated parallel processing using threads in base SAS. But how about DATA step processing at maximum warp? For that, we'll need a massively parallel processing (MPP) platform - like Hadoop.

Hadoop is an amazingly flexible platform for inexpensively storing and processing massive amounts of all types of data. With a well-provisioned Hadoop cluster & SAS, even more processing speed can be achieved. I have access to a small Hadoop cluster with the SAS Embedded Process software components installed and SAS on Windows which included licenses for the SAS/Access Interface to Hadoop and the SAS In-Database Code Accelerator for Hadoop. With this arrangement, it's possible to run DS2 DATA step and thread code directly in Hadoop. If you are reading and writing to Hadoop files, the DS2 code goes in and processes in Hadoop, and nothing comes out but the log! Reducing the need to push data to the compute platform should definitely improve processing speed.

I set out to compare processing data with DS2 threads in base SAS to processing the same data in-database in Hadoop. Here is the code I used for my experiment:

LIBNAME hdp HADOOP SERVER="" 
        DATABASE=JediData USER=SASJedi PASSWORD=WarpFactor9;
/* Create the data */
%let MaxObs=1000000;
data t;
   call streaminit(123456);
   do id=1 to &maxobs;
      ru=ceil(rand('UNIFORM')*10);
      rn=ceil(rand('NORMAL',1000,200));
      output;
   end;
run;
 
/* Load the data into Hadoop */
proc delete data=hdp.t;
run;
proc copy in=work out=hdp;
   select t;
run;
 
proc ds2;
thread hdp.T_thread/overwrite=yes;
   vararray double score[0:100] score0-score100;
   method run();
      dcl int i;
      set hdp.t;
      do i=LBOUND(SCORE) to hbound(score);
         Score[i]= (SQRT(((ru * rn) / (rn + ru))*ID))*(SQRT(((ru * rn) / (rn + ru))*rn));
      end;
   end;
endthread;
run;
quit;

Next, I executed the thread in base SAS:

proc ds2;
/*Threaded Alongside*/
data hdp.T_alongside/overwrite=yes;
   dcl thread hdp.T_thread t();
   method run();
   set from t threads=4;
   end;
enddata;
run;
quit;

This produced the following resource utilization stats in the SAS log:

NOTE: PROCEDURE DS2 used (Total process time):
      real time           1:59.04
      cpu time            1:07.43

Next, I ran the DS2 data program and thread in-database with the DS2ACCEL= option on the PROC DS2 statement:

proc ds2 ds2accel=yes;
/*Threaded In-Database*/
data hdp.T_indb/overwrite=yes;
   dcl thread hdp.T_thread t();
   method run();
   set from t;
   end;
enddata;
run;
quit;

This produced the following resource utilization stats in the SAS log:

NOTE: Running THREAD program in-database
NOTE: Running DATA program in-database
...
NOTE: PROCEDURE DS2 used (Total process time):
      real time           1:09.59
      cpu time            0.15 seconds

I managed to cut the elapsed time almost in half, even with my puny Hadoop test cluster! It makes a real difference when you can take the code to the data, instead of having to bring the data to the code.

I'm not going to post a ZIP flie for this blog entry, because I can't give you my Hadoop environment to play with. But if you'd like take DS2 and Hadoop for a test drive, you can see this and lots of other really amazing SAS & Hadoop technology by checking out the SAS Data Loader for Hadoop trial download. Better yet, join me for the next "DS2 Programming Essentials with Hadoop" class and we'll take a deep dive together. Or, if you would rather see a great introduction to Hadoop and an overview of all the ways it interacts with SAS, try our "Introduction to SAS and Hadoop" course, and I think you'll agree: SAS and Hadoop - it's a wonderful thing :-)

Until next time, may the SAS be with you!
Mark

Share

About Author

SAS Jedi

Principal Technical Training Consultant

Mark Jordan (a.k.a. SAS Jedi) grew up in northeast Brazil as the son of Baptist missionaries. After 20 years as a US Navy submariner pursuing his passion for programming as a hobby, in 1994 he retired, turned his hobby into a dream job, and has been a SAS programmer ever since. Mark writes and teaches a broad spectrum of SAS programming classes, and his book, "Mastering the SAS® DS2 Procedure: Advanced Data Wrangling Techniques" is in its second edition. When he isn’t writing, teaching, or posting “Jedi SAS Tricks”, Mark enjoys playing with his grand and great-grandchildren, hanging out at the beach, and reading science fiction novels. His secret obsession is flying toys – kites, rockets, drones – and though he usually tries to convince Lori that they are for the grandkids, she isn't buying it. Mark lives in historic Williamsburg, VA with his wife, Lori, and Stella, their cat. To connect with Mark, check out his SAS Press Author page, follow him on Twitter @SASJedi or connect on Facebook or LinkedIn.

2 Comments

  1. I've done a bit of work with DS2 threads and the Code Accelerator. I have a question on your THREAD and DATA statements; you actually save the them in the Hadoop library. Is that significant? I've never done that, and was wondering if it's any added advantage over the default WORK library for DS2 compilation.

    The ways of the Jedi are often mysterious...

    Carl

    • SAS Jedi

      Carl,
      I have never seen a difference in performance based on where the thread was stored. By the time it's sumbitted to the EP for compilation and execution, it has been retrieved as source code anyway. I would think a one-time read of code wouldn't make much of a performance blip. I thought it would be fun to demonstrate that you can store a thread anywhere you can access with a LIBNAME statement.

      Stay SASy, my friend!
      Mark

Back to Top