I'm gearing up to teach the next "DS2 Programming Essentials with Hadoop" class, and thinking about Warp Speed DATA Steps with DS2 where I first demonstrated parallel processing using threads in base SAS. But how about DATA step processing at maximum warp? For that, we'll need a massively parallel processing (MPP) platform - like Hadoop.
Hadoop is an amazingly flexible platform for inexpensively storing and processing massive amounts of all types of data. With a well-provisioned Hadoop cluster & SAS, even more processing speed can be achieved. I have access to a small Hadoop cluster with the SAS Embedded Process software components installed and SAS on Windows which included licenses for the SAS/Access Interface to Hadoop and the SAS In-Database Code Accelerator for Hadoop. With this arrangement, it's possible to run DS2 DATA step and thread code directly in Hadoop. If you are reading and writing to Hadoop files, the DS2 code goes in and processes in Hadoop, and nothing comes out but the log! Reducing the need to push data to the compute platform should definitely improve processing speed.
I set out to compare processing data with DS2 threads in base SAS to processing the same data in-database in Hadoop. Here is the code I used for my experiment:
LIBNAME hdp HADOOP SERVER="" DATABASE=JediData USER=SASJedi PASSWORD=WarpFactor9; /* Create the data */ %let MaxObs=1000000; data t; call streaminit(123456); do id=1 to &maxobs; ru=ceil(rand('UNIFORM')*10); rn=ceil(rand('NORMAL',1000,200)); output; end; run; /* Load the data into Hadoop */ proc delete data=hdp.t; run; proc copy in=work out=hdp; select t; run; proc ds2; thread hdp.T_thread/overwrite=yes; vararray double score[0:100] score0-score100; method run(); dcl int i; set hdp.t; do i=LBOUND(SCORE) to hbound(score); Score[i]= (SQRT(((ru * rn) / (rn + ru))*ID))*(SQRT(((ru * rn) / (rn + ru))*rn)); end; end; endthread; run; quit; |
Next, I executed the thread in base SAS:
proc ds2; /*Threaded Alongside*/ data hdp.T_alongside/overwrite=yes; dcl thread hdp.T_thread t(); method run(); set from t threads=4; end; enddata; run; quit; |
This produced the following resource utilization stats in the SAS log:
NOTE: PROCEDURE DS2 used (Total process time): real time 1:59.04 cpu time 1:07.43 |
Next, I ran the DS2 data program and thread in-database with the DS2ACCEL= option on the PROC DS2 statement:
proc ds2 ds2accel=yes; /*Threaded In-Database*/ data hdp.T_indb/overwrite=yes; dcl thread hdp.T_thread t(); method run(); set from t; end; enddata; run; quit; |
This produced the following resource utilization stats in the SAS log:
NOTE: Running THREAD program in-database NOTE: Running DATA program in-database ... NOTE: PROCEDURE DS2 used (Total process time): real time 1:09.59 cpu time 0.15 seconds |
I managed to cut the elapsed time almost in half, even with my puny Hadoop test cluster! It makes a real difference when you can take the code to the data, instead of having to bring the data to the code.
I'm not going to post a ZIP flie for this blog entry, because I can't give you my Hadoop environment to play with. But if you'd like take DS2 and Hadoop for a test drive, you can see this and lots of other really amazing SAS & Hadoop technology by checking out the SAS Data Loader for Hadoop trial download. Better yet, join me for the next "DS2 Programming Essentials with Hadoop" class and we'll take a deep dive together. Or, if you would rather see a great introduction to Hadoop and an overview of all the ways it interacts with SAS, try our "Introduction to SAS and Hadoop" course, and I think you'll agree: SAS and Hadoop - it's a wonderful thing :-)
Until next time, may the SAS be with you!
Mark
2 Comments
I've done a bit of work with DS2 threads and the Code Accelerator. I have a question on your THREAD and DATA statements; you actually save the them in the Hadoop library. Is that significant? I've never done that, and was wondering if it's any added advantage over the default WORK library for DS2 compilation.
The ways of the Jedi are often mysterious...
Carl
Carl,
I have never seen a difference in performance based on where the thread was stored. By the time it's sumbitted to the EP for compilation and execution, it has been retrieved as source code anyway. I would think a one-time read of code wouldn't make much of a performance blip. I thought it would be fun to demonstrate that you can store a thread anywhere you can access with a LIBNAME statement.
Stay SASy, my friend!
Mark