One grid to rule them all – tuning your environment for SAS Enterprise Guide

8

Most organizations enjoy a plethora of SAS user types—batch programmers and interactive users, power users and casual—and all variations in between. Each type of SAS user has its own needs and expectations, and it’s important that your SAS Grid Manager environment meets all their needs.

One common solution to this dilemma is to set up separate configurations based on a mix of requirements for departments, client applications and user roles. The grid options set feature in SAS 9.4 makes this task much easier. A grid options set is a convenient way to name a collection of SAS system options, grid options and required grid resources that are stored in metadata.

Why it’s important to tune SAS Grid Manager for interactive users

SAS Enterprise Guide users running interactive programs typically expect the results to be returned almost immediately. At present, the current out-of-the-box grid options are set for long-running batch jobs. These options include a latency of 20 seconds on the start of every server session, so SAS Enterprise Guide may experience unhappy delays.

More good news is the fact that SAS Enterprise Guide and other SAS software products are grid-aware. Once the optimum grid options set is defined and named, it is applied automatically whenever a user accesses the application and submits a job.

In this post, I’ll use Platform RTM for SAS to walk you through a few simple steps and provide a set of options that you can use as a baseline for tuning SAS Grid grid for your SAS Enterprise Guide users.

1) Reduce grid services sleep times.

The first tuning to perform is usually at the cluster level, to reduce grid services sleep times so that the interactive session starts faster. In Platform RTM, select Config►LSF►Batch Parameters and edit these settings:

MBD_SLEEP_TIME

SDB_SLEEP_TIME

MBD_REFRESH_TIME

JOB_SCHEDULING_INTERVAL

Never set these values to 0. You should tailor the actual values to your grid, considering factors such as number of nodes, number of concurrent users, patterns of utilization and so forth. You may need multiple iterations to tune performance to suit the needs of your SAS user type. Figure 1 shows a recommended starting point.

Figure 1.  Reduce grid services sleep time

Figure 1. Reduce grid services sleep time

2) Increase the number of job slots.

SAS Enterprise Guide and SAS Add-In for Microsoft Office are designed to keep the server session open for the full duration of the client session unless a user explicitly chooses to disconnect from the server. For SAS Grid Manager, this open session means that one job slot on that server is taken.

Therefore, for SAS Enterprise Guide use, you have to increase the number of job slots for each machine (use the MXJ parameter) from a default of 1 per core up to 5 or even 10 per core, depending on volume of usage. This step will increase the number of simultaneous SAS sessions on each grid node.

Interactive workloads are usually sporadic, intermittent, with short CPU bursts followed by periods of inactivity when the user is reviewing the results or exploring the data. Because these jobs are not I/O- or compute-intensive like large batch jobs, more jobs can be safely run on each machine

3) Implement CPU utilization thresholds for each machine.

Next, it is advisable to implement CPU utilization thresholds for each machine to prevent servers from being overloaded. With this limit in place, even if many users submit CPU-intensive work at the same time, SAS Grid Manager can manage the workload by suspending some jobs and resuming them when resources are available.

Changes in Step 2 and Step 3 are made at the host level. In RTM, select Config►LSF►Batch Hosts►default, edit Max Job Slots value and add the Advanced Attribute ut. See Figure 2.

Figure 2.  Increase the number of job slots and set CPU utilization thresholds.

Figure 2. Increase the number of job slots and set CPU utilization thresholds.

 

4) Create dedicated queues.

Even with this tuning, one user can easily use up all of the slots of a grid by starting many SAS Enterprise Guide sessions or by writing code that uses all the available slots for a single SAS session. When a machine runs out of slots, it is closed for use and work is routed to the next available slot. If all machines are closed and no machine has a free slot, no user can get another workspace. It doesn’t matter that the user with many open sessions is not actually using the resources. He or she might go for lunch, leaving his session open on a results page with no CPU, no I/O, nothing used on the server.

The best way to prevent this is by creating a dedicated queue called EGDefault, with a UJOB_LIMIT parameter low enough (for example, 3 slots as shown in Figure 3). After that, each user will be then limited to 3 concurrent server sessions, whether started from the same client or from different SAS Enterprise Guide instances. When using SAS Enterprise Guide parallel features, the value of UJOB_LIMIT should be higher, provided that proper server sizing has been performed to accommodate for the additional resources required.

In RTM, you can create this queue selecting Config►LSF►Queues►Add. To make this the default queue for SAS Enterprise Guide users, all you have to do is create a grid options set in SAS Management Console and add this EGDefault queue as a grid option to it.

Figure 3.  Set job limits in an EGDefault queue.

Figure 3. Set job limits in an EGDefault queue.

5) Create other grid options sets as needed.

There will always be ad hoc users or projects that do not fit into default categories (for example, they might be running jobs that have a high priority or jobs that require a large number of computing resources). For users requiring higher priority for their jobs or require more computing resources, it is just a case of defining a new queue such as EGPower. To prevent misuse, it's common to limit access to this special queue to selected users.

In previous releases, additional queues would been created by defining a special user group and then adding it to the USERS parameter in the queue definition. While effective, this has the disadvantage of duplicating user-related management both in metadata and in grid configuration files. With SAS 9.4, it possible to apply metadata security to grid options sets to keep all in one place—that is, in metadata.

6)  Set options for other interactive and batch queues.

Finally, if you have other queues, for example, ones dedicated for example to SAS® Data Integration Studio users or to batch processing, put job slot limits there, too, to compensate the large increase to the Max Job Slots parameter we made for default hosts. Figure 4 shows the Advanced Attribute PJOB_LIMIT added to a batch queue, to enforce the limit of one batch job per physical core on every host.

Figure 4.  Set job slots parameter for batch queue.

Figure 4. Set job slots parameter for batch queue.

When you have all queues defined, your final configuration may look like the following:

Figure 5.  All queues at a glance in RTM.

Figure 5. All queues at a glance in RTM.

For more details about using SAS Enterprise Guide in a SAS Grid Manager, you can refer to my SAS Global Forum 2014 presentation:   Effective Use of SAS® Enterprise Guide® in a SAS® 9.4 Grid Manager Environment.

Refer to Working with Grid Options Sets in Grid Computing in SAS® 9.4 documentation for more information on creating grid options sets.

 

Share

About Author

Edoardo Riva

Principal Technical Architect

Edoardo Riva is a Principal Technical Architect in the Global Enablement and Learning (GEL) Team within SAS R&D's Global Technical Enablement Division.

8 Comments

  1. That is good information on technical issues on using a grid.
    Missing is a design question for when to choose a grid approach.

    Today there are two trends:
    - virtualization has become a goal on his own.
    There are projects being proud just on the number of OS boxes being delivered.
    The real business profits to achieve are some things of an other world.
    - hardware costs are dropping size is decreasing capacity is increasing. The Speed per processor does not grow but number of (logical/virtual) cpu's and internal memory capacity is still growing (moore's law)

    The ethical question:
    As a relative cheap box (hardware) often easily can run all needed heavy load for analytics, why should you split that up in many virtual ones and than adding the grid complexity to get those resources back again. Why?
    Does it not make more sense to use a grid when the hardware building blocks is not powerfull enough to deliver the needed resources so you have to combine them.

    • Edoardo Riva
      Edoardo Riva on

      Hi Jaap,
      thanks for your comment. A full answer may be worth another post on its own, but I can still summarize my thoughts in few lines here.

      First, sometimes it may be better to segregate some services 'outside' of the heavy-duty compute machine(s), just because their usage of resources is totally different. Think for example about the SAS Metadata Server, SAS Middle tier components, SAS Grid Control Server services, etc. . In this situation, virtualization may be a good candidate.

      Regardless of this consideration, SAS Grid Manager software can be useful even with a single heavy-analytics box, whether that is a big, expensive, old-style unix server or a modern relatively-cheap multicore linux machine. In this scenario, the added value does not come from aggregating resources, but controlling them. Actually, the part of the name of our software that I like the most is "Manager". SAS Grid Manager can give a SAS Admin complete control over the services and jobs running in the environment, without spending hours on the phone with the IT dept to understand what's going on.

      But, as I wrote, that's worth another post 🙂

      Edoardo

  2. Hi Edoardo,

    Very interesting article. Thank you.

    One thing I tend to do, rather than hard code the MXJ values for hosts, is to create an entry in lsb.resources like this:

    Begin Limit
    NAME=SlotsPerProcessor
    PER_HOST=all
    SLOTS_PER_PROCESSOR=25
    End Limit

    This will allow all hosts to have 25 slots for every CPU core. Thus if you add CPU cores, LSF will automatically adjust and you don't have to manually change the configuration each time.

    I always use the UT settings in the host definitions as you have described to ensure that a host won't be overloaded.

    Cheers
    Paul

    • Edoardo Riva

      Hi Paul,
      thanks for reading and adding your comment.
      I's a suggestion worth adding to the bag of tricks of every SAS Grid Administrator!

      Edoardo

  3. Great post. I want to point one thing with (1). When you change MBD_SLEEP_TIME, you have to keep in mind that MAX_SBD_FAIL (default 3) kind of depends on it. This is the maximum number of retries before it considers a host unreachable and assumes that all jobs running on that host have exited and any rerunnable jobs are scheduled to be rerun on another host. Now the interval between the retries is by default set to MBD_SLEEP_TIME/10. So when a network blip or something happens, and the 3 retires happen in 3 secs, there is a good chance that it will consider a host unreachable. So to maintain the same interval as before, one can increase the number of retires itself. For example, change MAX_SBD_FAIL to 15 or so. Hope that helps.
    Krish Putta

    • Edoardo Riva

      Hi Krish,
      thanks for sharing this detail. You are absolutely right.
      I just checked the official documentation and it seems this is a new behavior of the latest LSF release.
      SAS Grid Manager 9.4 and 9.4M1 include LSF version 8.01 and its documentation reads: "The interval between retries is defined by MBD_SLEEP_TIME.".
      SAS Grid Manager 9.4M2 includes LSF version 9.11 and the new documentation reads: "The minimum interval between retries is defined by MBD_SLEEP_TIME/10.".
      Edoardo

      • Jan Klaverstijn on

        Edoardo,

        reducing MBD_SLEEP_TIME helped us a great bit in reducing the time it takes to start a grid launched workspace server. Before that the users would be looking at the hourglass for over 15 seconds before EGuide would have connected to a server. Not a good experience.

        Thing is, it seems to be the second part of the parameter (10 1) that makes the difference. I cannot find documentation of that second subparameter. Any clues?

        • Edoardo Riva

          Jan,
          I'm glad that you were able to tune your environment.
          MBD_SLEEP_TIME accepts only one value and, as you have witnessed, it affects how fast jobs are dispatched.
          Another parameter that influences SAS Enterprise Guide is MBD_REFRESH_TIME and this accepts two values. I think this is the one you set to "10 1". MBD_REFRESH_TIME influences how often client software is informed that a job has actually been started. By default its value is "60" , which is equivalent to "60 10". The second value is the one that impacts us. The default value would mean that, even if a workspace server is started immediately, LSF does not notify SAS Enterprise Guide for 10 seconds, so the client keeps waiting.
          You can find a very detailed description of each parameter in the official documentation available at http://support.sas.com/rnd/scalability/platform/index.html. Look for the "Platform LSF Configuration Reference" corresponding to the software release that you have.
          Edoardo

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top