Getting out of SORTs with SAS data

5

Why does this SAS program produce an error?

proc means data=sashelp.cars
   mean median min max;
by origin;
run;

It's because SASHELP.CARS is a SAS data set, and SAS data sets observations are stored and processed sequentially, and a BY group operation requires that the observations are already grouped and sorted by the BY group values. Here is the error, which pretty much explains it:

ERROR: Data set SASHELP.CARS is not sorted in ascending sequence.
The current BY group has Origin = USA and the next BY group has Origin = Asia.

One remedy is to sort the data first:

proc sort data=sashelp.cars
  out=work.cars;
by origin;
proc means data=work.cars
  mean median min max;
by origin;
run;

That works great, but has two downsides. First, it's an extra step that takes time. For small data it's not a big deal, but sorting a large data source can be expensive. The other problem is that it creates a copy of the data set (temporary, in this example). If the data set is large, the copy will also be large and can occupy a lot of storage space.

Here's a variation that creates an SQL data view instead of an actual sorted copy of the data. It solves the storage problem, because a data view is really a set of instructions for how to access the data, and doesn't actually contain a copy of the data. In this example, the view definition says to access the data observations in a sequence sorted by the values of the ORIGIN column:

proc sql;
create view work.cars as
  select * from sashelp.cars
    order by origin;
quit;
proc means data=work.cars
  mean median min max;
by origin;
run;

But when you access the view, SAS still has to sort the data according to the view definition, so you still have sorting work that must be performed in SAS, and that takes time. In fact, this is why SAS documentation advises against using ORDER BY in a view definition.

Now, what if your data source is in a database, such as Teradata or Oracle or even Microsoft Access? Database systems do not typically store and access records sequentially as SAS does, so you do not need to sort the data prior to performing a grouping operation. When you access the data source using a SAS/ACCESS product (for Oracle, Teradata, or ODBC for example), SAS automatically passes the correct ORDER BY instruction to the database when you run a SAS procedure. For example, when you run this program:

proc means data=ORACLE2.CARS
  mean median min max;
by origin;
run;

the SAS/ACCESS engine converts this to the proper database query, like this (and you can see this for yourself by enabling SASTRACE options):

SELECT "MANUFACTURER", "MODEL", "TYPE", "PRICE", "CITYMPG", "HIGHWAYMPG", "ENGINESIZE", "HORSEPOWER", "FUELTANK", "PASSENGERS", "WEIGHT", "ORIGIN" FROM egtest.CARS_1993 ORDER BY "ORIGIN" NULLS FIRST
This is important, because database tables can be very large and you really don't want to ask SAS to pull the table, sort it, and then perform the analysis.

The mechanism that SAS uses to "delegate" such work to the database system is called "pass-through" or sometimes "push-down". It's how SAS gets your database to do the heavy I/O work that the database is made for, before SAS conducts the analysis work that SAS is designed for.

Note that SAS has some special offerings for certain databases that make the database perform even more work, even SAS analytics, without having to move the data around.

Tags SAS tips
Share

About Author

Chris Hemedinger

Director, SAS User Engagement

+Chris Hemedinger is the Director of SAS User Engagement, which includes our SAS Communities and SAS User Groups. Since 1993, Chris has worked for SAS as an author, a software developer, an R&D manager and a consultant. Inexplicably, Chris is still coasting on the limited fame he earned as an author of SAS For Dummies

5 Comments

  1. another downside of PROC SORT seems to be, that if you sort a permanent dataset and add a WHERE conditions without specifying a OUT= dataset, then you end up with a permanent dataset, that is missing the levels not included in the WHERE statement. --> very dangerous. once I had to read in all my data again into the permanent dataset from the original data sources because of this.

  2. Yes, the "sort in place" syntax is handy, but dangerous in this way. You can protect yourself by making the important data set read-only either in the file system, or by assigning the libname with ACCESS=READONLY.

  3. I am facing problem while downloading data sets at http://ftp.sas.com/
    samples/A56936. The link is from SAS.publishing.SAS.9.2.SQL.Procedure.Users.Guide.Mar.2008.pdf
    the link saved as directed in folder crook and the procedure in editor window is;
    //libname new 'C:PARULSAS';
    filename trans 'C:PARULcrook';

    proc cimport library=new infile=trans;
    run;//
    the error in logged is:
    Libref NEW was successfully assigned as follows:
    Engine: V9
    Physical Name: C:PARULSAS
    2 filename trans 'C:PARULcrook';
    3
    4 proc cimport library=new infile=trans;
    5 run;

    ERROR: Invalid file, C:PARULcrook.
    NOTE: The SAS System stopped processing this step because of errors.
    NOTE: PROCEDURE CIMPORT used (Total process time):
    real time 0.00 seconds
    cpu time 0.00 seconds

  4. Pingback: Implement BY processing for your entire SAS program - The SAS Dummy

  5. Pingback: Sorting data in SAS: can you skip it? - The SAS Dummy

Back to Top