New year resolution: Don't let old stuff slow down your SAS Grid

As another year goes by, many people think about new year’s resolutions. It’s probably the third year in row that I’ve promised myself I’d start exercising in the fabulous SAS gym. Of course, I blame the recently concluded holiday season. With all the food-focused events, I couldn’t resist, ate way too much and now feel like I can barely move. The same can happen to a SAS Grid environment, as I learned by helping some colleagues that were debugging a SAS Grid that “ate” too many jobs and often refused to move. In this blog I’ll share that experience in the hopes that you can learn how to keep your SAS Grid from slowing down.

The issue

The symptoms our customer encountered were random “freezes” of their entire environment without any predictable pattern, even with no or few jobs running on the SAS Grid, LSF daemons stopped responding for minutes. When it happened, not only did new jobs not start, but it was also impossible to query the environment with simple commands:

$ bhosts

LSF is processing your request. Please wait ...
LSF is processing your request. Please wait ...
LSF is processing your request. Please wait ...

Then, as unpredictably as the problem started, it also self-resolved and everything went back to normal… until the next time.

Eventually we were able to find the culprit. It all depends on the way mbatchd, the Master Batch Daemon, manages its internal events file: lsb.events.

Let’s see what this file is, and why it can cause troubles.

The record keeper

You can find details about the LSF events file in the official documentation.

What is it?

The LSF batch event log file lsb.events is used to display LSF batch event history and for mbatchd failure recovery. Whenever a host, job, or queue changes status, a record is appended to the event log file.

How is this file managed?

Use MAX_JOB_NUM in lsb.params to set the maximum number of finished jobs whose events are to be stored in the lsb.events log file. Once the limit is reached, mbatchd starts a new event log file. The old event log file is saved as lsb.events.n, with subsequent sequence number suffixes incremented by 1 each time a new log file is started. Event logging continues in the new lsb.events file.

The official documentation does not state much more, but additional online research reveals an interesting detail:

lsb.events file is moved to lsb.events.1, and each old lsb.events.n file is moved to lsb.events.n+1. The mbatchd never deletes these files. If disk storage is a concern, the LSF administrator should arrange to archive or remove old lsb.events.n files occasionally.

So what?

Well, by default LSF rolls over the events file every 1000 records (this is the default value up to LSF version 8). This small value is a legacy from the past, when an LSF restart could take forever if the value was too high. In our situation, a high job throughput caused the events file to roll over so often, that mbatchd produced tons and tons of old events files. And for each roll over, it had to rename from lsb.events.n to lsb.events.n+1 each and every file present in the log directory. This became such an overwhelming task, that mbatchd stopped responding to everything for minutes, just to do it.

Do not fall into the same trap

To avoid the problem we encountered, I encourage you to implement a good maintenance practice, with monitoring the LSF log directory (by default <LSF_TOP>/work/<cluster name>/logdir ) and periodically archiving or deleting older files.

With older versions of LSF, this issue can be prevented with the tuning described in this blog. All of the versions of LSF that have shipped with SAS 9.4 have solved this problem, according to LSF documentation:

Switching and replaying the events log file, lsb.events, is much faster. The length of the events file no longer impacts performance.

That’s why the latest documentation suggests keeping a higher number of records in the events file before triggering a roll over: set MAX_JOB_NUM in lsb.params to 10000 or even 100000 for high throughput environments.

With a bit of maintenance your SAS Grid can perform much better. I really have to do the same, and start the new year at the gym!

Blogs