Sorting data in SAS: can you skip it?

TL;DR

The next time that you find yourself writing a PROC SORT step, verify that you're working with the SAS Base engine and not a database. If your data is in a database, skip the SORT!

The details: When to skip the PROC SORT step

Many SAS procedures allow you to group multiple analyses into a single step through use of the BY statement. The BY statement groups your data records by each unique combination of your BY variables (yes, you can have more than one), and performs the PROC's work on each distinct group.

When using the SAS Base data engine (that's your SAS7BDAT files), BY-group processing requires data records to be sorted, or at least pre-grouped, according to the values of the BY variables. The reason for this is that the Base engine accesses data records sequentially. When a SAS procedure is performing a grouped analysis, it expects to encounter all records of a group in a contiguous sequence. What happens when records are out of order? You might see an error like this:

ERROR: Data set SASHELP.CLASS is not sorted in ascending sequence. 
The current BY group has Sex = M and the next BY group has Sex = F.

I first described this in 2010: Getting out of SORTs with SAS data.

In a recent post, Rick Wicklin discussed a trick you can use to tell SAS that your data are already grouped, but the group values might not be in a sorted order. The NOTSORTED option lets you avoid a SORT step when you can promise that SAS won't encounter different BY group values interleaved across the data records.

Sorting data is expensive. In data tables that have lots of records, the sort processing requires tremendous amounts of temporary disk space for its I/O operations -- and I/O usually the slowest part of any data processing. But here's an important fact for SAS programmers: a SORT step is required only for SAS data sets that you access using the Base engine*. If your data resides in database, you do not need to sort or group your data in order to use BY group processing. And if you do sort the data first (as many SAS programmers do, out of habit), you're wasting time.

I'm going to use a little SAS trick to illustrate this in a program. Imagine two copies of the ubiquitous CLASS data set, one in a Base library and one in a database library. In my example I'll use the SPDE engine as the database, even though it's not a separate database server. (Yes! You can do this too! SPDE is part of Base SAS.)

/* these 3 lines will create a temp space for your */
/* SPDE data */
/* See: https://blogs.sas.com/content/sasdummy/use-dlcreatedir-to-create-folders/ */
options dlcreatedir;
libname t "%sysfunc(getoption(WORK))/spde"; 
libname t clear;
 
/* assign an SPDE library. Works like a database! */
libname spde SPDE "%sysfunc(getoption(WORK))/spde";
/* copy a table to the new library */
data spde.class;
 set sashelp.class;
run;
 
/* THIS step produces an error, because CLASS */
/* is not sorted by SEX */
proc reg data=sashelp.class;
    by sex;
    model age=weight;
run;
quit;
 
/* THIS step works correctly.  An implicit          */
/* ORDER BY clause is pushed to the database engine */
proc reg data=spde.class;
    by sex;
    model age=weight;
run;
quit;

Why does the second PROC REG step succeed? It's because the requirement for sorted/grouped records is passed through to the database using an implicit ORDER BY clause. You don't see it happening in your SAS log, but it's happening under the covers. Most SAS procedures are optimized to push these commands to the database. Most databases don't really have the concept of sorted data records; they return records in whatever sequence you request. Returning sorted data from a database doesn't have the same performance implications as a SAS-based PROC SORT step.

How does SAS Enterprise Guide generate optimized code?

Do you use SAS Enterprise Guide tasks to build your analyses? If so, you might have noticed that the built-in tasks go to great lengths to guarantee that your task will encounter data that are properly sorted. Consider this setup for the Linear Regression task:

Here's an example of the craziness that you'll see from the Linear Regression task when you have a BY variable in a Base SAS data set. There is "defensive" keep-and-sort code in there, because we want the task to work properly for any data scenario.

/* -------------------------------------------------------------------
   Determine the data set's type attribute (if one is defined)
   and prepare it for addition to the data set/view which is
   generated in the following step.
   ------------------------------------------------------------------- */
DATA _NULL_;
 dsid = OPEN("SASHELP.CLASS", "I");
 dstype = ATTRC(DSID, "TYPE");
 IF TRIM(dstype) = " " THEN
  DO;
  CALL SYMPUT("_EG_DSTYPE_", "");
  CALL SYMPUT("_DSTYPE_VARS_", "");
  END;
 ELSE
  DO;
  CALL SYMPUT("_EG_DSTYPE_", "(TYPE=""" || TRIM(dstype) || """)");
  IF VARNUM(dsid, "_NAME_") NE 0 AND VARNUM(dsid, "_TYPE_") NE 0 THEN
   CALL SYMPUT("_DSTYPE_VARS_", "_TYPE_ _NAME_");
  ELSE IF VARNUM(dsid, "_TYPE_") NE 0 THEN
   CALL SYMPUT("_DSTYPE_VARS_", "_TYPE_");
  ELSE IF VARNUM(dsid, "_NAME_") NE 0 THEN
   CALL SYMPUT("_DSTYPE_VARS_", "_NAME_");
  ELSE
   CALL SYMPUT("_DSTYPE_VARS_", "");
  END;
 rc = CLOSE(dsid);
 STOP;
RUN;
 
/* -------------------------------------------------------------------
   Sort data set SASHELP.CLASS
   ------------------------------------------------------------------- */
PROC SORT
 DATA=SASHELP.CLASS(KEEP=Age Height Sex &_DSTYPE_VARS_)
 OUT=WORK.SORTTempTableSorted &_EG_DSTYPE_
 ;
 BY Sex;
RUN;
TITLE;
TITLE1 "Linear Regression Results";
PROC REG DATA=WORK.SORTTempTableSorted
  PLOTS(ONLY)=ALL
 ;
 BY Sex;
 Linear_Regression_Model: MODEL Age = Height
  /  SELECTION=NONE
 ;
RUN;
QUIT;

This verbose code drives experienced SAS programmers crazy. But unlike a SAS programmer, the SAS Enterprise Guide code generator does not understand all of the nuances of your data, and thus can't guess what steps can be skipped. (SAS macro programmers: you know what I'm talking about. Think about all of the scenarios you have to code for/defend against in a generalized macro.)

And when running the same task with a database table? SAS Enterprise Guide detects the table source is a database, and builds a much more concise version:

TITLE;
TITLE1 "Linear Regression Results";
PROC REG DATA=SPDE.CLASS
  PLOTS(ONLY)=ALL
 ;
 BY Sex;
 Linear_Regression_Model: MODEL Age = Weight
  /  SELECTION=NONE
 ;
RUN;
QUIT;

Consider the data source, and -- if you can -- skip the SORT!

* The Base engine is not the only sequential data engine in SAS, but it's the most common.

15 Comments

Rick Wicklin on February 4, 2016 12:44 pm

Good points, Chris. You say that you can "skip the sort" because many databases can serve up ordered BY groups to the SAS procedures. In addition, if you aren't reading the data sequentially,the NOTSORTED option doesn't even make sense! As more data are stored in distributed file systems (such as hadoop and "cloud computing" platforms) the new normal will be that data does not have an "intrinsic" order. Perhaps we are seeing the beginning of the end for the NOTSORTED option.
- Chris Hemedinger on February 4, 2016 12:48 pm
  
  Thanks Rick -- as with all things SAS, I think we'll see PROC SORT and the NOTSORTED option in our SAS programs for years to come. But we'll see an increasing number of scenarios where it's not needed, and the savvy SAS programmer will need to know when that makes sense and when it's a waste.
Peter on February 5, 2016 9:01 am

Chris

If an analysis request would fail without the sort why not allow that failure - to advise the analyst to sort his input to the analysis?
As a side issue, I notice your example using the value of the work option (to retrieve that path). Why is this more appropriate than the pathname() function ( - which would seem more appropriate to me)

Peter
- Chris Hemedinger on February 5, 2016 9:36 am
  
  Peter, if I understand you you're suggesting that the SAS error message tell the user exactly what steps to take to avoid the error. Good idea! But I think it strays into SAS trying to guess what you're after, and that's not always obvious.
  
  Re: the pathname function - good point! I think it's a habit I have to reference the WORK option value. Pathname might be a better practice.
  - Peter on February 9, 2016 6:55 pm
    
    I tried to suggest - have the standard error message could tell the analyst that the data are not in the required order and that would be "good enough".
    Perhaps I misunderstand your blog - Does it really indicate SAS Enterprise Guide seeks to satisfy an analyst's request to use BY- processing on data which is not in that order? (without checking with the customer, it will sort that data!)
    .
    - Chris Hemedinger on February 9, 2016 9:38 pm
      
      Hi Peter,
      
      Yes, SAS Enterprise Guide tasks will sort the data if needed to make BY groups work -- whether that's what you want or not. EG tasks don't want non-programmers to stumble over errors that could be confusing, so the code here is ultra-defensive.
      
      If your data is in a DBMS (or almost anything other than a Base SAS library), EG tasks will NOT sort the data for you in a PROC SORT/SQL preamble, because the ORDER BY is implicitly specified to the database engine.
Peter Lancashire on February 9, 2016 7:17 am

PROC SQL can read SAS data sets in much the same way as from a relational database. SAS can index its data sets. That begs the question: why do we need to sort manually at all? Why does SAS not just access its own data sets in the same way it does via PROC SQL when it sees a BY statement? Let SAS figure out whether a physical sort or a temporary index is better. Like a database.

Anything that makes SAS a less "resistant" coding environment has to be good. Give the dinosaurs an option to make it harder; the rest of us can just breathe a sigh of relief that SAS works slightly more like the rest of the software we use. You could even have a new statement "ORDER BY", familiar to SQLers, to do all this and leave the antique BY in place.
- Chris Hemedinger on February 9, 2016 9:34 am
  
  Hi Peter,
  
  The notion of "observation sequence" is intrinsic in many SAS procedures and of course in the DATA step. Since SAS has its roots as statistical software, but later evolved into a powerful data manipulation language, it's not surprising that the concept of "sequence" is so ingrained in the language. Ryan K from Zencos recently posted a nice summary of DATA step vs. SQL.
  
  If you haven't already, you might want to check out PROC DS2, which brings a database-savoir to the concept of data processing.
Philip C on February 15, 2016 3:53 pm

So SPDE does "implicit sorting". ( http://support.sas.com/rnd/scalability/spde/syntax.html ) I take it this "implicit sorting" is faster than a regular sorting. Where can I learn more about implicit sorting.
- Chris Hemedinger on February 15, 2016 4:07 pm
  
  Philip, I think the "implicit" means that the engine will ORDER the data as needed if BY processing is involved. Not sure it's "faster" -- just automatic when your SAS program calls for BY groups. And since the sorting takes place in the SPD engine, you're not dealing with moving data records to scratch space to rearrange them.
  - Philip C on February 15, 2016 4:17 pm
    
    ok, but you've made me curious about SPDE.
    - Chris Hemedinger on February 15, 2016 4:19 pm
      
      Excellent. Then my work is done here.
Douglas Dame on March 7, 2016 1:14 pm

(1) It's not ALWAYS true that pushing a sort to the database will save time. In my previous job ... meaning I can't get to the doco, so this is from memory ... I had one proc sql data extract from a SQL-Server database that seemed unexpectedly slow, with a list of 6 vars in the Order By. I removed the Order By, and sorted the return dataset (maybe 4 million records) in SAS, at about 1/4 of the elapsed wall time. Surprised by these results, I tested it both ways three times, and the results were consistent. Taking out the Order By was the only change I made to the proc sql code. I remain rather baffled by this.

(2) AFAIK, you can still avoid a physical sort on a large SAS dataset by indexing it first, then using a BY statement on whatever you do next. This incurs two costs, one to index the data and then a second because Mr SAS has to jump around when retrieving the data instead of doing its normal very fast "full table scan" approach. If you are only reading the data a few times, this usually works faster than a sort. On a dataset that's used a lot, at some point the cumulative effect of the jump-around read ends up consuming more time (than one sort and then a bunch of ful table scans.)

But sometimes a dataset is just too big to sort with the skimpy resources your IT dept will give you, so it's good to know work-arounds.
Rahul Gupta on November 14, 2016 2:16 am

Hi Chris

How can I stop database to sort the data. I want SAS to sort my data. I do not want implicit sorting.
Is there any option to do so?

Regards
Rahul
- Chris Hemedinger on November 14, 2016 8:57 am
  
  If you submit a PROC with a BY statement, the "order by" directive will push to the database automatically. To keep the sorting in SAS (which you might want to do if you want a specific collating sequence), you would need to pull the working set into SAS first (with a PROC SQL, for example), and then sort the data set with PROC SORT and whatever options you need. Also, remember that some grouping operations do not require sorted data, such as PROC SUMMARY/MEANS with the CLASS statement. In general, if the PROC offers a CLASS statement you can group your calculations across one or more categories without first sorting the data.

Blogs