I've got a new trick that you can use to track progress in a long-running SAS program, while using SAS Enterprise Guide.
I've previously written about the SYSECHO statement and how you can use it to update the Task Status window with custom messages. SYSECHO is a "global" statement in the SAS language, which works great within the boundaries of a PROC step to annotate your Task Status messages. For certain interactive PROCs (like SQL), you can pepper in SYSECHO statements to give more frequent and detailed updates.
But the DATA step, which is often used for long-running operations, cannot use multiple SYSECHO statements. When SAS compiles the DATA step it processes all global statements up front, so that only one use of the SYSECHO statement takes effect. This means that if you want to use multiple SYSECHO calls to report which observation or values are being processed as the DATA step runs, you can't.
That is, you couldn't, until now. With SAS 9.3 Maintenance 2 you can use a new function, named DOSUBL, to process any SAS statements (including global statements) "just in time" as your DATA step runs. DOSUBL accepts one or more SAS statements and executes them immediately, inline with the currently running DATA step. The statements must represent a "complete" segment of a SAS program: no partial PROCs or partial DATA steps. But it can execute any PROC step (bounded by RUN or QUIT) or a DATA step (bounded by RUN) or -- key for our example -- a SAS global statement.
NOTE: DOSUBL is an experimental function in SAS 9.3, and you really need the M2 revision or later for it to work well. It's on target for a production release with SAS 9.4. It's also the topic of a paper by Rick Langston at SAS Global Forum 2013 (here's a preview from Jason's paper last year).
If you have SAS 9.3M2, try running this from your SAS Enterprise Guide session. Note how I put a condition around the DOSUBL call; I don't want to process the statement for every record, as that might get too chatty in the status window. Instead, I call it once for every 5 records.
/* Tracking where you are in the DATA step iterations */ data _null_; set sashelp.class; if mod(_n_,5)=0 then rc = dosubl(cats('SYSECHO "OBS N=',_n_,'";')); s = sleep(1); /* contrived delay - 1 second on Windows */ run; |
Here's another example that reports on progress, but instead of triggering the update based on record count, I've triggered it based on the current BY value in process:
proc sort data=sashelp.cars out=sortedcars; by make; run; /* Tracking which "category" value is being processed */ data _null_; set sortedcars; by make; if first.make then do; rc = dosubl(cats('SYSECHO "OBS N=',_n_,' MAKE=',make,'";')); end; run; |
With a little imagination, you can devise some sophisticated SYSECHO statements that report on other aspects of the DATA step processing: the timings, the program data vector, and so on. It might become a useful debugging technique (in lieu of the DATA step debugger, which can't run in a client application like SAS Enterprise Guide which now exists in SAS Enterprise Guide 7.13!).
Other references:
Tracking progress in SAS programs with SAS Enterprise Guide
Executing a PROC from a DATA step
26 Comments
This is great. My first crack at wrapping a data step "progress report" in a macro is below. I would love to see some more experienced coders take a crack. The macro is at the end.
To report back the number of rows processed and time elapsed after every five rows, would use this:
To report back the above stats, plus percent completion and estimated time remaing:
The option above will only work if you set nobs=pants (can change pants to shorts or whatever) for the set statement. The estimates for percent completion and time left will only be accurate if there is no where clause (because you wouldn't be processing all the rows) and sas processes every x rows of data in about the same amount of time. There are likely other limitations.
Thanks for the contribution, Grant!
I added some line breaks in your code to make it easier to read on the blog -- hopefully added them in the correct spots.
Is there an acronym or nerdy etymology behind the name DOSUBL? Curious.
Yes, it stands for Do Our Submission Until Brad Leaves.
Just kidding - it's not that exciting. I'm guessing that since DOSUB and DOSUBL work sort of like SUBMIT blocks (remember from SAS/CONNECT, SCL, and other SAS sub-languages), that's the origin of the name. DOSUB submits a program via file reference. DOSUBL submits whatever "lines" of SAS program you include. And SAS function names do not exceed 7 characters in length (thank you mainframe file system), so we must compress our creative names.
Ok, its dōō-sŭb-ĕl (Do-Sub-L) not dŏs-ōō-bĕl (Dos-u-bel). Got it. thanks
Pingback: What's new in the SAS programming language?
Chris, I absolutely love this blog and the information you provide. I have an issue with this SYSECHO code, though. Our enterprise system recently upgraded from a standard SAS server environment, with basically 1 node to a SAS Grid system with multiple nodes. Since the SAS Grid upgrade, I have been unable to use SYSECHO. Are you familiar with this issue? Is this a limitation w/ a node/SAS Grid system? How can I work around it?
Thank you,
Matt,
Thanks for the kind words. Regarding your question, I think this is a limitation of running in the grid that has been addressed in SAS 9.4.
EG connects to a SAS session (we call that the SAS Workspace), and normally when you run your programs EG receives notification for various events: PROC starts, data created/modified, and global statements processed like SYSECHO.
When your program runs via a grid node, the SAS Workspace is farming your job off to another SAS session on the grid. Some of the events are propagated back to your main SAS Workspace and thus back to EG, but perhaps SYSECHO isn't among them. I did a little research and found that this is a known issue, and was fixed in SAS 9.4 (but I don't have a SAS 9.4 grid environment to verify).
If you have a choice, perhaps you can change the way your program/project is processed so that it doesn't use the SAS Grid. You would not want to circumvent the configuration prescribed by your IT department (and thus sabotage the load-balancing benefit of the grid), but if you have just one or two programs that use this capability, perhaps it's okay. SAS Enterprise Guide allows you to set "grid policy" in Tools->Options->SAS Programs, in File->Project Properties, and in the Properties for individual programs (see Code Submission tab).
Hi,
I just played with this and was able to create an ASCII progress bar that fills itself depending on the number of observations processed by a DATA step. Cool! (in case anyone is interested, here is the line of code:
rc = dosubl(cat('SYSECHO "Step 1: [',repeat("X",floor(i/n*10)),repeat("_",10-floor(i/n*10)),"] ",put(i/n, percent.),' / Total: ',put(n,comma4.),' obs.";'));
you can replace i and n for _N_ and NOBS respectively or adjust as needed. I'm trying now to change the X and _ characters for ASCII extended codes 176 and 178, which resemble a half filled square and 70% filled square or whatever and would hence look much better).
Question, does anyone have a suggestion on how to track and show progress (using SYSECHO) for a PROC SQL JOIN operation? I believe for a DATA MERGE operation a similar trick as above would
Thanks.
Creative use of this trick! Here's a complete example that shows it in action.
proc sort data=sashelp.cars out=sortedcars; by make; run; data _null_; /* NOBS= captures total records to process */ set sortedcars nobs=nobs; by make; if first.make then do; rc = dosubl(cat('SYSECHO "Processing CARS: [', repeat("*",floor(_N_ /NOBS*10)), repeat("_",10-floor(_N_ /NOBS*10)),"] ", put(_N_/NOBS, percent.),' / Total: ', put(NOBS,comma4.),' obs.";')); x=sleep(250,.001); end; run;
I don't think you can get it to work with SQL JOIN in the same way, as there is no implicit loop to hook into as there is in the DATA step.
Hmm. This DOSUBL looks interesting. Say one were to combine it with something like log4SAS, then one could presumably write out timings (et cetera) to a log file from within PROC's (et cetera) while simultaneously SYSECHO'ing the same information to SAS EG. Might be handy for efficiency tuning if nothing else.
One could also presumably run something like a PROC SQL from within a DATA step. If one were to encounter a variable, say something that could be used as a primary key to some kind of SQL table, then one could invoke PROC SQL after identifying a particular value of said key and thereafter, presumably utilize the results set from the SQL query *without* leaving the aforementioned DATA step. Nice.
DATA some_data; Really Brilliant SAS code goes here that identifies a value of a primary key; DOSUBL( execute a PROC SQL, supplying the above identified primary key); Even more brilliant SAS code goes here that uses Returned_Rows_From_Query; RUN;
The above could be used, in effect, to create an SQL "windowing" function. In particular, I'm processing records by a Transaction Identifier Number (TIN) for a particular person identified by a Person Identifier Number (PIN). I only want to look at persons who have certain types of transactions, but when such a transaction is identified, I then need to look at all TIN's for that PIN -- even if I have already read past those other TIN's in the sequential processing that occurs in a DATA step.
However, with DOSUBL, if I encounter a "trigger" TIN (a TIN of the type I'm looking for -- thus identifying a PIN of interest), I could then use DOSUBL to submit a SQL query to gather all TINS for that PIN (something like select TIN, data1, data2, ... from PIN_Table where PIN = PIN_of_interest) and then process the results set of TIN's using a DO loop, gathering data from the other TIN's that are related to the "trigger" TIN -- TIN's that I had already read past during my DATA step.
Yes, there are other ways to do it, but this seems like it might be easier to manage than having to programmatically save off TIN records as I process each PIN, managing the flushing of TIN's once I read to the next PIN (or processing the saved TIN's if I identify a "trigger" TIN).
Not sure what the performance impacts might be. Potentially, doing a lot of little mini SQL queries for a significant percentage of the records in a large data set could get "expensive". Interesting to try though.
Jim
I'm running EG 4.3 with an AIX server running SAS 9.3 (9.03.01M2P08152012). I've encountered some "interesting" effects on macro variables when executing a DOSUBL in code that was executed through a DOSUBL. In particular, the following code when executed in a block invoked by another DOSUBL makes macro variable values go all screwy:
if mod(_n_,5)=0 then
rc = dosubl(cats('SYSECHO "OBS N=',_n_,'";'));
I've written up what I've found on the SAS EG community pages:
https://communities.sas.com/t5/SAS-Enterprise-Guide/DOSUBL-Bug/m-p/254668#U254668
Jim
It appears that the above described problem may be limited to a version or a few versions of SAS. For example SAS 9.3 M2 does display the behavior described in https://communities.sas.com/t5/SAS-Enterprise-Guide/DOSUBL-Bug/m-p/254668 On the other hand, SAS 9.4 M3 does not.
For those wishing to use DOSUBL with macro variables, perhaps careful testing is advised for versions of SAS other than 9.4 M3.
Jim
Jim -- I'm very impressed with the energy that you've put into this! Thanks for sharing your experiments and results.
Hi, Chris.
I've been accused of being thorough. :)
I will say that in roughly 20 years of working with SAS, this is the first real bug that I've encountered in the software -- and to be fair DOSUBL is experimental in v9.3. Perhaps though my "experiments" might suggest that people be judicious in their use of DOSUBL on versions prior to 9.4. My testing with SAS 9.4 M3 appears to indicate that it is working properly with macros. Perhaps my code snippets could be used to validate/invalidate other version/maintenance combinations and a technical note or some such issued delineating the potential macro variable pitfalls of certain version/maintenance combinations. God willing, any problems would be caught in testing, but sometimes it's nice to know a particular route is a dead end before one embarks on the journey.
Jim
Dear Chris:
I've been playing with SYSECHO submitted through DOSUBL on one of my projects. It appears that there is a significant amount of overhead associated with SYSECHO submitted through DOSUBL. I ran the code that immediately follows against raw data files containing approximately 20,000,000 records.
IF MOD(_Trade_Count,&Increment) = 0 THEN
_RC = DOSUBL(CATS('SYSECHO "Reading T _Trade_Count=',PUT(_Trade_Count,COMMA11.),'";'));
For the value of &Increment, runs were made with values of 100, 1000, 10000, 100000, and 1000000. Below, I'll list my results. Note that the time of execution, both "real" and CPU, varies greatly even though I'm performing the very same operations on the very same data in each run. The only thing that changed run to run was the value of &Increment. In terms of "real" time, note that the run with Increment=100 took approximately seventeen times longer than the runs with Increment=100000 or more.
The "cost" of SYSECHO submitted through DOSUBL seems to be fairly significant. Perhaps one rule of thumb might be to divide the magnitude of the maximum number of records by 100 and use the result as one's SYSECHO increment. For example, if one is running with records numbering in the 10's of millions, a good increment might be 100,000.
I've conducted multiple tests. While the results vary run to run, the progression seen below seems to hold (small increment = slow run; large increment = fast run). Note that this was run on one particular implementation of SAS 9.4 on a particular AIX machine using SAS EG 7.13 HF5 (64 bit). Results in other environments may vary although I suspect that they will be along these lines.
Thank you for your time,
Jim
Increment=100
NOTE: DATA statement used (Total process time):
real time 17:38.42
cpu time 4:37.42
Increment=1,000
NOTE: DATA statement used (Total process time):
real time 4:32.70
cpu time 53.81 seconds
Increment=10,000
NOTE: DATA statement used (Total process time):
real time 1:23.65
cpu time 15.56 seconds
Increment=100,000
NOTE: DATA statement used (Total process time):
real time 57.55 seconds
cpu time 12.28 seconds
Increment=1,000,000
NOTE: DATA statement used (Total process time):
real time 56.96 seconds
cpu time 12.22 seconds
SYSECHO raises an event that EG processes, and the interaction could have a small cost. Might be interesting to run your test in a Base SAS environment where EG isn't in the mix. Also, SYSECHO is simply a global statement for SAS to process, so you could try with another such statement (like FOOTNOTE;) and see whether just processing a global statement adds overhead.
For my part, I've always positioned this as a trick to get some feedback via EG and not necessarily a best practice for production jobs. I think it's handy, and I've used it to instrument SAS programs under development. I usually remove the statement when running in Batch, but not sure it has an impact there.
Chris,
Yes, that would be interesting, to run where there is no EG involved. I will run on our UNIX/AIX box on Monday.
Chris, weird question: Is there any way in SAS to submit a process, much like DOSUBL, but *not* wait for the process to finish before proceeding on? I'm aware of DOSUBL, in which the process must finish, and I'm aware of CALL EXECUTE which can submit a process or processes that will execute serially after the current DATA step finishes, but is there a way to "fire and forget" a process, i.e. to not have the originating DATA step involved in the timings? I'm looking to launch completely independent processes. I've Googled about, but I don't see anything of this sort.
Thanks,
Jim
If you have SAS Grid Computing, or even just SAS/CONNECT, you might be able to spawn another SAS session from your current session to perform a piece of work. SAS Grid Computing includes a number of macros and functions to help fork off multiple processes, manage which can run in parallel, and tell your SAS code to "wait" until all are complete before running a finalized section of code. The code analyzer in EG can help you to arrange your code for grid computing.
Chris,
I ran a quick test on UNIX/AIX. I only ran for 1000, 10000, 100000, and 1000000. 100 is too painful. :)
The results show a similar pattern. See below.
Jim
1000
NOTE: DATA statement used (Total process time):
real time 3:57.31
cpu time 40.44 seconds
10000
NOTE: DATA statement used (Total process time):
real time 2:24.81
cpu time 18.59 seconds
100000
NOTE: DATA statement used (Total process time):
real time 2:00.69
cpu time 15.76 seconds
1000000
NOTE: DATA statement used (Total process time):
real time 1:56.22
cpu time 14.63 seconds
Okay, good to know!
And, since everyone was just dying to know, I ran it with Increment=100 over lunch. All things considered, it seems like 10000 is a happy medium although with truly large files (100+ million), perhaps 100000 is a better choice.
Jim
100
NOTE: DATA statement used (Total process time):
real time 31:55.86
cpu time 4:24.38
1000
NOTE: DATA statement used (Total process time):
real time 4:09.70
cpu time 41.51 seconds
10000
NOTE: DATA statement used (Total process time):
real time 1:50.68
cpu time 18.03 seconds
100000
NOTE: DATA statement used (Total process time):
real time 1:40.14
cpu time 15.56 seconds
1000000
NOTE: DATA statement used (Total process time):
real time 1:37.27
cpu time 15.42 seconds
Chris:
I notice that SYSECHO has kind of been hidden in Enterprise Guide version 8.1 which I've just started using (liking that Ignite theme). Yes, the messages from SYSECHO can be read if one opens the Status pane, but a) the Status pane takes up more real estate than I like and b) one has to manually open it. Is there any way to restore the full glory :) of SYSECHO so that one can see it in the status bar at the bottom of the Enterprise Guide screen? I've check what documentation I can find, but I've come up empty so far.
Thanks,
Jim
Jim, I'm not aware of a trick here. The Status pane was improved to be more helpful in general, but it might have been at the cost some of the SYSECHO experience.
Rats. :)
Well, thank you, Chris.
Overall version 8.1 is a nice upgrade -- although things have been pretty sweet since pretty 7.1.3 (I started on version 4, and current users just don't know how good they've got it).
I'll continue to hold out hope for some future 8.x.x release that restores full ease of use for SYSECHO. :)
Jim
Chris,
Two other people made the (unsolicited) comment to me that they missed the SYSECHO feature displaying at the bottom of the screen (everyone is just getting started with version 8.1 here).
By the way, the DOSUBL command has some nice functionality. With it, we are able to execute a SAS procedure from within the run boundaries of a DATA step. We read a file in our DATA step and then line-by-line pass values to a SAS procedure that is executed via a macro submitted via DOSUBL. The results of the procedure are passed back to the DATA step through the Global Symbol Table and retrieved via a SYMGET.
I don't think I would have been able to solve this task had I not been made aware of DOSUBL via your post. My thanks.
Jim