In this fast-paced data age, when the sheer volume of data (generated, collected, and waiting to be processed and analyzed) grows at a breathtaking rate, the speed of data processing becomes critically important. In many cases, if data is not processed within an allotted time frame, we lose all its value as it becomes obsolete and ultimately irrelevant. That is why computing power becomes of the essence.
However, computing power itself does not guarantee timely processing. How we use that power makes all the difference. Way too often good old sequential processing just does not cut it anymore and different computing methods are required. One such method is parallel processing.
In my previous post Using shell scripts for massively parallel processing I demonstrated a script-centered technique of running in parallel multiple independent SAS processes in SAS environments lacking SAS/CONNECT.
In this post, we will take a shot at a slightly different task and solution. Instead of having several totally independent processes, now we have some common “pre-processing” part, then we run several independent processes in parallel, and then we combine the results of parallel processing in the “post-processing” portion of our program.
Problem: monthly data ingestion use case
For simplification, we are going to use a scenario similar to one in the previous blog post:
Each month, shortly after the end of the previous month we needed to ingest a number of CSV files pertinent to transactions during the previous month and produce daily SAS data tables for each day of the previous month. Only now, we will go a step further: combining all those daily tables into a monthly table.
Solution: combining sequential and parallel processing
The solution is comprised of the three major components:
- Shell script running the main SAS program.
- Main SAS program, consisting of three parts: pre-parallel processing, parallel processing, and post-parallel processing.
- Single thread SAS program responsible for a single day data ingestion.
1. Shell script running main SAS program
Below shell script mainprog.sh runs the main SAS program mainprog.sas:
#!/bin/sh # HOW TO CALL: # nohup sh /path/mainprog.sh YYYYMM & now=$(date +%Y.%m.%d_%H.%M.%S) # getting YYYYMM as a parameter in script call ym=$1 pgmname=/path/mainprog.sas logname=/path/saslogs/mainprog_$now.log sas $pgmname -log $logname -set inDate $ym -set logname $logname |
The script runs in a background mode as indicated by the ampersand at the end of its invocation command:
nohup sh /path/mainprog.sh YYYYMM &
We pass a parameter YYYYMM (e.g. 202106) indicating year and month of our request.
When we call SAS program mainprog.sas within the script we indicate the name of the SAS log file to be created (-log $logname) and also pass on inDate parameter (-set inDate $ym, which has the same value YYYYMM as parameter specified in the script calling command), and logname parameter (-set logname $logname). As you will see further, we are going to use these two parameters within mainprog.sas program.
2. Main SAS program
Here is an abridged version of the mainprog.sas program:
/* ======= pre-processing ======= */ /* parameters passed from shell script */ %let inDate = %sysget(inDate); %let logname = %sysget(logname); /* year and month */ %let yyyy = %substr(inDate,1,4); %let mm = %substr(inDate,5,2); /* output data library */ libname SASDL '/data/target'; /* number of days in month mm of year yyyy */ %let days = %sysfunc(day(%sysfunc(mdy(&mm+1,1,&yyyy))-1)); /* ======= parallel processing ======= */ %macro loop; %local threadprog looplogdir logdt workpath tasklist i z threadlog cmd; %let threadprog = /path/thread.sas; %let looplogdir = %substr(&logname,1,%length(&logname)-4)_logs; x "mkdir &looplogdir"; *<- directory for loop logs; %let logdt = %substr(&logname,%length(&logname)-22,19); %let workpath = %sysfunc(pathname(WORK)); %let tasklist=; %do i=1 %to &days; %let z = %sysfunc(putn(&i,z2.)); %let threadlog = &looplogdir/thread_&z._&logdt..log; %let tasklist = &tasklist DAY&i; %let cmd = sas &threadprog -log &threadlog -set i &i -set workpath &workpath -set inDate &inDate; systask command "&cmd" taskname=DAY&i; %end; waitfor _all_ &tasklist; %mend loop; %loop /* ======= post-processing ======= */ /* combine daily tables into one monthly table */ data SASDL.TARGET_&inDate; set WORK.TARGET_&inDate._1 - WORK.TARGET_&inDate._&days; run; |
The key highlights of this program are:
- We capture values of the parameters passed to the program (inDate and logname).
- Based on these parameters, assign source directory and target data library SASDL.
- Calculate number of days in a specific month defined by year and month.
- Create a directory to hold SAS logs of all parallel threads; the directory name is matching the log name of the mainprog.sas.
- Capture the WORK library location of the main SAS session running mainprog.sas as:
%let workpath = %sysfunc(pathname(WORK));We use that location in the thread sessions to pass back to the main session data produced by the thread sessions.
- Macro %do-loop generates a series of SYSTASK statements to spawn additional SAS sessions in the background mode, each ingesting data for a single day of a month:
systask command "&cmd" taskname=DAY&i;The SYSTASK statement enables you to execute host-specific commands from within your SAS session or application. Unlike the X statement, the SYSTASK statement runs these commands as asynchronous tasks, which means that these tasks execute independently of all other tasks that are currently running. Asynchronous tasks run in the background, so you can perform additional tasks (including launching other asynchronous tasks) while the asynchronous task is still running.
Restriction: SYSTASK statement is not supported on the CAS server.
- Also, we generate a cumulative list of all tasknames assigned to each thread sessions:
%let tasklist = &tasklist DAY&i;
- Outside the macro %do-loop we use WAITFOR statement which suspends execution of the main SAS session until the specified tasks finish executing. Since we created a list of all daily thread sessions (&tasklist), this will synchronize all our parallel threads and continue mainprog.sas session only when all threads finished executing.
- At the end of the main SAS session we concatenate all our daily data tables that have been created by parallel threads in the location of the WORK library of the main SAS session.
Using SAS macro loop to generate a series of SYSTASK statements for parallel processing is not the only method available. Alternatively, you can achieve this within a data step using CALL EXECUTE. In this case, each data step iteration will generate a single global SYSTASK statement and push it out of the data step boundaries where they will be sequentially executed (just like in the case of macro implementation). Since option NOWAIT is the default for SYSTASK statements, despite all of them being launched sequentially, their corresponding OS commands will be still running in parallel.
3. Single thread SAS program
Here is an abridged version of the thread.sas program:
/* inDate parameter */ %let inDate = %sysget(inDate); /* parent program's WORK library */ %let workpath = %sysget(workpath); libname MAINWORK "&workpath"; /* thread number */ %let i = %sysget(i); /* year and month */ %let yyyy = %substr(inDate,1,4); %let mm = %substr(inDate,5,2); /* source data directory */ %let srcdir = /datapath/&yyyy/&mm; /* create varlist macro variable to list all input variable names */ proc sql noprint; select name into :varlist separated by ' ' from SASHELP.VCOLUMN where libname='PARMSDL' and memname='DATA_TEMPLATE'; quit; /* create fileref inf for the source file */ filename inf "&srcdir/source_data_&inDate._day&i..cvs"; /* create daily output data set */ data MAINWORK.TARGET_&inDate._&i; if 0 then set PARMSDL.DATA_TEMPLATE; infile inf missover dsd encoding='UTF-8' firstobs=2 obs=max; input &varlist; run; |
This program ingests a single .csv file corresponding to the &i-th day of &inDate (year and month) and creates a SAS data table MAINWORK.TARGET_&inDate._&i. To be available in the main SAS session the MAINWORK library is defined here in the same physical location as the WORK library of the main parental SAS session.
We also use a pre-created SAS data template PARMSDL.DATA_TEMPLATE - a zero-observations data set that contains descriptions of all the variables and their attributes.
Additional resources
- Using shell scripts for massively parallel processing
- Running SAS programs in parallel using SAS/CONNECT®
- Parallel Processing Your Way to Faster Software and a Big Fat Bonus: Demonstrations in Base SAS®. SGF 2017, by Troy Martin Hughes
- Parallel Processing with Base SAS. SGF 2018, by Jim Barbour
Thoughts? Comments?
Do you find this post useful? Do you have processes that may benefit from parallelization? Please share with us below.
WANT MORE GREAT INSIGHTS MONTHLY? | SUBSCRIBE TO THE SAS TECH REPORT
8 Comments
Can I do this in SAS Studio?
Hi Martin, most likely you would not be able to use SAS Studio for this. The technique Leonid describes has you using SYSTASK to launch multiple SAS sessions, and SYSTASK to call outside processes is disabled in most SAS Studio environments.
Your example is run in a Unix environment. Can this technique be used to run parallel SAS jobs on a PC based installation of SAS Enterprise Guide?
Thank you for your question, Audrey. Could you please clarify your SAS installation? Where is your SAS server? On the same PC as Enterprise Guide, on a different machine (Unix, Windows)? Do you have SAS/CONNECT installed in your environment?
I have used the technique you describe above with phenomenal success in a Linux SAS environment, e.g. run times of 24 hours or more get reduced to 3 hours.
I've recently migrated to a new job running SAS jobs with excessive run times remotely via SAS Enterprise Guide and Microsoft Remote Desktop. I do not know whether the remote installation has SAS Connect or not. Can you advise what technique to use to set up parallel jobs remotely with or without SAS Connect?
My understanding is that you run SAS EG on a Microsoft Windows Remote Desktop. However, behind the scene SAS EG (which is just client) connects to a SAS server where your EG code is executed. If you run
proc setinit; run;
in EG, the SAS log will provide information about your SAS server including Operating System as well as list all SAS products licensed on that server. If SAS/CONNECT is licensed you can use method described in my earlier post Running SAS programs in parallel using SAS/CONNECT®. If it is not licensed (highly unlikely), you can check if you can execute OS commands from within your SAS session. Just runproc options; run;
and the SAS log will show option XCMD or NOXCMD. The first one (XCMD) means that you can run OS commands from SAS and therefore you can use the method described in this post. If you have NOXCMD you may reach out to your SAS admin and try persuading them to change that to XCMD; however, in some environments, security policies may not allow that. Please let me know your findings and the outcome, and do not hesitate to ask more questions.And remember, the xcmd option must be set before launching the SAS session.
Thank you, Jerry. Good point, enabling XCMD option is a must for executing Operating System commands from a SAS session (using any of the following: X statement, SYSTEM function, CALL SYSTEM routine, SYSTASK statement, PIPE option in a FILENAME statement, %SYSEXEC command).