Is your new grid behaving oddly?

2

New default parameter values for Platform Suite for SASNew default parameter values for Platform Suite for SAS

Sometimes, when your kids grow older, they change their habits and you don’t recognize their behaviors any more. “We play this game every year at the beach. Don’t you like it anymore?” you ask. “Dad, I’m not seven years old any more”.

Well, Platform Suite for SAS is not seven any more. And its default behavior has changed, too.

Recognizing the release

Platform Suite for SAS ships with SAS Grid Manager offering, and (almost) every SAS maintenance changes the bundled release. It includes different products that do not share the same numbering sequence.  we are currently (as of 9.4M2) shipping Platform Suite for SAS version 8.1, which includes LSF 9.1.1.

Any new release adds additional features, expands the list of supported operating systems and increases the flexibility in configuring your environments. But the default values of the main parameters, that characterize how the software behaves out of the box, are usually untouched. Until now.

A faster start

After installing a new environment, we suggest to submitting many jobs, all at once, to check that LSF dispatches them correctly. If you have ever done this more than once, you surely remember that jobs take a while to start. The following screenshot shows the results of a bhist command issued about a minute after submitting 15 jobs on a newly installed, 3-nodes-grid that uses LSF 8.01 (SAS 9.4 and 9.4M1).

15 jobs submitted to a LSF 8 grid

15 jobs submitted to a LSF 8 grid

You can see that jobs are kept pending (column highlighted in red) while LSF decides which host is the best to run them. LSF starts jobs in subsequent batches every 20 seconds, and after a minute some jobs have not started yet (column highlighted in green).

Here is another screenshot, that shows the same bhist command issued only thirty seconds after submitting 15 jobs on a newly installed, 3-nodes-grid that uses LSF 9.11 (SAS 9.4M2). Can you spot the difference? All jobs start almost immediately, without losing any time in the pending state:

15 jobs submitted to a LSF 9 grid

15 jobs submitted to a LSF 9 grid

Isn’t this good?

Well, it depends. Platform LSF is a system built and tuned for batch interaction. As such, many components need some “think time” before being able to react. At the end, when submitting a 2-hour-long batch job, is it important if it takes 1 or 25 seconds to start?

Things change when we use LSF to manage interactive SAS workloads. End users do care if a SAS session takes 5 or 25 seconds to start when submitting a project from SAS Enterprise Guide. If it takes more than 60 seconds, the object spawner may even time out. The practice we use is to tune some LSF parameters, as shown in this blog post, to reduce grid services sleep times so that interactive sessions start faster.

Looking at the above results, how fast jobs start with LSF 9.1.1, gives us a hint that the new release has these parameters already tuned by default. That’s good! Isn’t it?

Comparing the values

To understand which parameters have new default values, it is possible to compare the Platform LSF Configuration Reference Version 9.1 with Platform LSF Configuration Reference Version 8.01. I verified what I found there by checking the actual LSF configuration files in two grid installations, then I built the following table to compare the main parameters we usually tune:

New default parameter values for Platform Suite for SAS3

The “Default” column reports values that are automatically set in configuration files after a default deployment. When a parameter is not defined in the configuration files, it takes the value listed under the “Undefined” column. As you can see, all of the actual values have been lowered.

What does this mean?

With the new default LSF 9 values, a SAS 9.4M2 grid is more responsive to interactive users and can accept more jobs that are submitted all at once, increasing the overall job throughput. Grid-launched workspace servers now start almost immediately (if there are enough resources to run them, obviously) with no timeouts or long waiting.

… BUT …

There is one problem with this configuration.

If you are familiar with LSF tuning, you may remember this note from the official documentation:

JOB_ACCEPT_INTERVAL: If 0 (zero), a host may accept more than one job. By default, there is no limit to the total number of jobs that can run on a host, so if this parameter is set to 0, a very large number of jobs might be dispatched to a host all at once […]  It is not recommended to set this parameter to 0.

Wait a minute. It is not recommended to set this parameter to 0, and the default for LSF 9.1.1 is 0?

If you check what actually happens inside the grid, you will find that yes, jobs start faster, but this has a price. LSF doesn’t have time to check how server load is impacted by these new jobs. LSF simply dispatches all jobs to the SAME server, until the node is full. Only then it sends jobs to another server. The following screenshot shows how all jobs end up running on the same server until it is full (same colors).

New default parameter values for Platform Suite for SAS4

Imagine if the jobs are grid-launched workspace servers. Many sessions will land on the same host all at once. The same users that were happy because their sessions start immediately, soon will complain because they will be contending with each other for resources from the same server.

How can I change this?

To solve this issue, bring back the value of the JOB_ACCEPT_INTERVAL parameter to 1. As with LSF 8, you may want to also lower a bit MBD_SLEEP_TIME and SBD_SLEEP_TIME, but that depends on the actual load on each grid environment. This screenshot shows the same environment running the same jobs, after being tuned. Jobs take a bit more to start, but they are now distributed evenly across all machines.

New default parameter values for Platform Suite for SAS5

The JOB_ACCEPT_INTERVAL parameter can also be set at the queue level, so a more advanced tuning could be to use different values (0, 1 or even more) based on the desired behavior of each queue. This second option is an advanced tuning that usually cannot be implemented during an initial configuration, as it requires careful design, testing and validation.

Changes in SAS 9.4M3

Did you think this long blog was over? Wait, there is more. With SAS 9.4M3 things changed again. This release bundles Platform Suite for SAS version 9.1, which includes LSF 9.1.3. It’s a small change in the absolute number, but the release includes a new parameter:

LSB_HJOB_PER_SESSION Specifies the maximum number of jobs that can be dispatched in each scheduling cycle to each host. LSB_HJOB_PER_SESSION is activated only if the JOB_ACCEPT_INTERVAL parameter is set to 0.

Now you can configure JOB_ACCEPT_INTERVAL=0 to achieve increased grid responsiveness and job throughput, and at the same time put a limit on how many jobs are sent to the same server before LSF starts dispatching them to a different node.

Or you can simply accept the default values for all the parameters: they have changed again, but this time they all fit our recommended practice, as shown in this final table:

New default parameter values for Platform Suite for SAS6

 

 

 

Share

About Author

Edoardo Riva

Principal Technical Architect

Edoardo Riva is a Principal Technical Architect in the Global Enablement and Learning (GEL) Team within SAS R&D's Global Technical Enablement Division.

2 Comments

  1. Scott Vodicka on

    Edoardo,

    Very good information. I agree that you have to be very careful setting JOB_ACCEPT_INTERVAL=0, which as you stated uses all the slots on one server. That would defeat the load balancing feature of LSF (grid).

    Are there recommended settings for the parameters in lsb.params for SAS Grids that support a large number of interactive users (EG & Studio)? The values I'm most interested in are: MBD_SLEEP_TIME, SBD_SLEEP_TIME, and JOB_SCHEDULING_INTERVAL.

  2. Jan Klaverstijn on

    Hi Eduardo,

    Thanks for the concise overview. We are searching for ways to shave time off the startup of grid launched workspace servers. The combo of JOB_ACCEPT_INTERVAL=0 and LSB_HJOB_PER_SESSION= seems promising. We are at LSF 9.1.3 (speaking of confusing version numbers). I will do some experimentation but are there any considerations you'd like to share on a good value for this parameter?

Leave A Reply

Back to Top