Avoid the pitfalls of parallel jobs

4

parallelI recently received a call from a colleague that is using parallel processing in a grid environment; he lamented that SAS Enterprise Guide did not show in the work library any of the tables that were successfully created in his project.

The issue was very clear in my mind, but I was not able to find any simple description or picture to show him: so why not put it all down in a blog post so everyone can benefit?

Parallel processing can speed up your projects by an incredible factor, especially when programs consist of subtasks that are independent units of work and can be distributed across a grid and executed in parallel. But when these parallel execution environments are not kept in sync, it can also introduce unforeseen problems.

This specific issue of “disappearing” temporary tables can happen using different client interfaces, because it does not depend upon using a certain software, but rather on the business logic that is implemented. Let’s look at two practical examples.

SAS Studio

We want to run an analysis – here a simple proc print – on two independent subsets of the same table. We decide to use two parallel grid sessions to partition the data and then we run the analysis in the parent session.  The code we submit in SAS Studio could be similar to the following:

%let rc = %sysfunc( grdsvc_enable(_all_, server= SASApp));
signon grid1;
signon grid2;
proc datasets library=work noprint;
delete sedan SUV;
run;
rsubmit grid1 wait=no ;
data sedan;
set sashelp.cars;
where Type=”Sedan”;
run;
endrsubmit;
rsubmit grid2 wait=no ;
data SUV;
set sashelp.cars;
where Type=”SUV”;
run;
endrsubmit;
waitfor _ALL_ grid1 grid2;
proc print data=sedan;
run;
proc print data=SUV;
run;

After submitting the code by pressing F3 or clicking Run, we do not get the expected RESULT window and the LOG window shows some errors:

ThePitfallsofParallelJobs

SAS Enterprise Guide

Suppose we have a project similar to the following.

ThePitfallsofParallelJobs2

Many items can be created independently of others. The orange arrows illustrate the potential tasks which can be executed in parallel.

Let’s try to run these project tasks in parallel. Open File, Project Properties. Select Code Submission and flag “Allow parallel execution on the same server.”

ThePitfallsofParallelJobs3

This property enables SAS Enterprise Guide to create one or more additional workspace server connections so that parallel process flow paths can be run in parallel.

Note: Despite the description, when used in a grid environment the additional workspace server sessions do not always execute on the same server. The grid master server decides where these sessions start.

After we select Run, Run Process Flow to submit the code, SAS Enterprise Guide submits the tasks in parallel respecting the required dependencies.  Unfortunately, some tasks fail and red X’s appear over the top right corner of the task icons. In the log summary we may find the following:

ERROR: File WORK.QUERY_FOR_PRODUCTS.DATA does not exist.

What's going on?

The WORK library is the temporary library that is automatically defined by SAS at the beginning of each SAS session or job. The WORK  library stores temporary SAS files that are written by a data step or a procedure and then read as input of subsequent steps. After enabling the parallel execution on the grid, tasks run in multiple SAS sessions, and each grid session has its own dedicated WORK library that is not shared with any other grid session, or with the parent work session which started it.

In the SAS Studio example, the data steps outputs their results – the SEDAN and SUV tables – in the WORK library of a SAS session, then the PROC PRINT tries to read those tables from the WORK library of a different SAS session. Obviously the tables are not there, and the task fails.

ThePitfallsofParallelJobs4

This is a quite common issue when dealing with multiple sessions – even without a grid. One simple solution is to avoid using the WORK library and any other non-shared resources. It is possible to assign a common library in many ways, such as in autoexec files or in metadata.
The issue is solved:

ThePitfallsofParallelJobs5

No shared resources, am I safe?

Well, maybe not. Coming back to the original issue presented at the opening of this post, sometimes we oversee what ‘shared’ means. I’ll show you with this very simple SAS Enterprise Guide project: it’s just a simple query that writes a result table in the WORK library. You can test it on your laptop, without any grid.

ThePitfallsofParallelJobs6

After running it, the FILTER_FOR_AIR table appears in the server’s pane:

ThePitfallsofParallelJobs7

Now, let’s say we have to prepare for a more complex project and we follow again the steps to “Allow parallel execution on the same server. “ Just to be safe, we resubmit the project to test what happens. All seems unchanged, so we save everything and close SAS Enterprise Guide. Say I forgot to ask you to write down something about the results. We reopen the project and knowing the result table in the WORK library was temporary, we rerun the project to recreate it.

This time something is wrong.

ThePitfallsofParallelJobs8

The Output Data pane shows an error, and the Servers pane does not list the FILTER_FOR_AIR table anymore.
Even if we rerun the project, the table will not reappear.

The reason lies, again, in the realm of shared v.s. local libraries.

As soon as we enable “Allow parallel execution on the same server,” SAS Enterprise Guide starts at least one additional SAS session to process the code, even if there is nothing to parallelize. Results are saved only there, but SAS Enterprise Guide always tries to read them from the original, parent session. So we are again in the trap of local WORK libraries.

ThePitfallsofParallelJobs9

Why didn’t we uncover the issue the first time we ran the project? If you run your code, at least once, without the “Allow parallel execution on the same server” option, the results are saved in the parent session. And, they remain there even after enabling parallelization. As a result, we actually have two copies of the FILTER_FOR_AIR table!
As soon as we close SAS Enterprise Guide, both tables are deleted. So, on the next run, there is nothing in the parent session WORK library to send to SAS Enterprise Guide!

The solution? Same as before – only use shared libraries.

Is this all?

As you might have guessed, the answer is no. Libraries are not the only objects that should be shared across sessions. Every local setting – be it the value of an option, a macro, a format – has to be shared across all parallel sessions. Not difficult, but we have to remember to do it!

 

 

 

 

 

 

 

Share

About Author

Edoardo Riva

Principal Technical Architect

Edoardo Riva is a Principal Technical Architect in the Global Enablement and Learning (GEL) Team within SAS R&D's Global Technical Enablement Division.

4 Comments

  1. John Bentley on

    This is a well-written paper with great examples. For me, it's a great reminder that there are will always be SAS users who don't know concepts and techniques that some other users take for granted... I did a few papers on SAS and parallel processing at SAS Conferences in the early 2000's.

    • Thank you John for your comment. You are right, there are some concepts that you can never take for granted. Even experienced users always find themselves dealing with something new, and you cannot imagine how many people who never used parallel programming fall on issues like these.
      If you are interested, I will be presenting this and other techniques for an effective usage of SAS Studio with SAS Grid Manager at the upcoming SAS Global Forum 2016 in Las Vegas with my paper SAS6422 - Deep Dive with SAS® Studio into SAS® Grid Manager 9.4

  2. Alvaro Alcala on

    Thank you very much Edoardo for your post, it is very clarifying on what the problem is.
    Nevertheless, using a shared library for parallel execution has the problem of different projects (or users) writing in the same table (as EG by default automatically names the tables very similar).
    Isn't there any way of doing parallel processing in EG using temporary libraries?

    Thank you very much.

    • Hello Alvaro,
      using shared libraries does not mean using a single library across the whole enterprise. You can have libraries that are shared across multiple sessions, but still unique by user or project or... whatever you chose.
      Just to give you an example, you can define in metadata a library "mylib" as pre-assigned, and as its library path use "/shared/$USER" (assuming you are on Unix, adjust the syntax on Windows). This way you will have different paths per user, but in everybody's projects they will always appear as "mylib". You may similarly create an environment variable to identify different business projects, the possibilities are endless !

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top