This blog is a continuation of an earlier blog entitled “To grid or not to grid?” In that blog, one of the reasons to say “yes to SAS Grid” is to see if you can gain some performance improvements from modifying your existing SAS processes by converting them to a distributed processing format. If improving performance of individual SAS applications is one of your reasons to implement SAS Grid Manager, please read on.
To start with, all of your existing SAS jobs can run on your new SAS Grid, but not all of your existing SAS jobs can be turned into distributed processing applications that can be run simultaneously across multiple nodes of the SAS Grid. For example, processes that rely heavily on OLAP processing do not lend themselves to parallelization. Statistical analysis tasks that need to create a matrix in memory cannot be parallelized either.
When identifying SAS applications to modify for distribution across a SAS Grid, the first thing to look for is jobs that take many hours and even days to complete. In addition to long execution times, there are several profiles of SAS applications or jobs that are good candidates for running in parallel across a SAS Grid.
One profile is a SAS job (or combination of multiple SAS jobs) that take an extraordinary amount of time to execute because it is processing a large amount of data (hundreds of gigabytes approaching terabytes). This type of application requires running the same SAS task or tasks, over and over, on either all of the data or different subsets of the data or both.
There are many examples of SAS tasks that fit this profile. Here are a few:
- Large steps within a SAS job that are processing hundreds of millions of rows, for example, large DATA steps or statistical simulations such as Monte Carlo methods or mining massive data files.
- Any SAS task that does BY GROUP processing. In particular, forecasting steps, frequency counts and many statistical tasks.
In this profile, you would break up the large SAS task into multiple subtasks, start a distributed SAS process for each subtask (see diagram below) and run them simultaneously. As you can see, there is a join data step at the end before the final results. We need to mention that you will need to experiment with how many subtasks to run so that the time required for the join data step does not negate the performance gains of the simultaneous subtasks.
Jobs with independent tasks
Another profile is a SAS job that has lots of independent SAS tasks that are run by default in sequential fashion, but these independent SAS tasks can easily be run in parallel.
- A good example of this is all of the modeling done with a SAS Enterprise Miner project. SAS Enterprise Miner is SAS Grid aware and will generate the code to do your model training in parallel using SAS Grid.
- And for all the SAS Enterprise Guide users, there are ways to be able to take advantage of the SAS Grid in the flow of your jobs. For more details on how to do that, please review this SAS Global Forum paper: Effective Usage of SAS® Enterprise Guide® in a SAS® 9.4 Grid Manager Environment.
- SAS Risk Dimensions can be deployed to run independent tasks in parallel
- SAS Data Integration Studio is also SAS Grid aware and can therefore generate the SAS code to automatically process your data transformations in parallel using SAS Grid.
SAS Code Analyzer
Both of the above profiles can use the SAS Code Analyzer that was released with SAS 9.2 to help determine if the SAS code you are running is a candidate for distributed processing. The SCAPROC procedure assists with the difficult and tedious tasks of determining which steps can be run in parallel. This is especially helpful with legacy SAS jobs. Details on how to use this procedure can be found in this SAS Global Forum paper: Introducing the SAS® Code Analyzer.
In summary, all existing SAS jobs can run on your SAS Grid. However, not all SAS processes are good candidates for distributed processing due to the inherent sequential nature of the analytics or flow of the program logic. Please work with your SAS account team to have a technical review of your planned SAS applications, especially if improved performance of individual SAS applications is your primary reason to go to a SAS Grid.