Why does this SAS program produce an error?
proc means data=sashelp.cars mean median min max; by origin; run;
It's because SASHELP.CARS is a SAS data set, and SAS data sets observations are stored and processed sequentially, and a BY group operation requires that the observations are already grouped and sorted by the BY group values. Here is the error, which pretty much explains it:
ERROR: Data set SASHELP.CARS is not sorted in ascending sequence.
The current BY group has Origin = USA and the next BY group has Origin = Asia.
One remedy is to sort the data first:
proc sort data=sashelp.cars out=work.cars; by origin; proc means data=work.cars mean median min max; by origin; run;
That works great, but has two downsides. First, it's an extra step that takes time. For small data it's not a big deal, but sorting a large data source can be expensive. The other problem is that it creates a copy of the data set (temporary, in this example). If the data set is large, the copy will also be large and can occupy a lot of storage space.
Here's a variation that creates an SQL data view instead of an actual sorted copy of the data. It solves the storage problem, because a data view is really a set of instructions for how to access the data, and doesn't actually contain a copy of the data. In this example, the view definition says to access the data observations in a sequence sorted by the values of the ORIGIN column:
proc sql; create view work.cars as select * from sashelp.cars order by origin; quit; proc means data=work.cars mean median min max; by origin; run;
But when you access the view, SAS still has to sort the data according to the view definition, so you still have sorting work that must be performed in SAS, and that takes time. In fact, this is why SAS documentation advises against using ORDER BY in a view definition.
Now, what if your data source is in a database, such as Teradata or Oracle or even Microsoft Access? Database systems do not typically store and access records sequentially as SAS does, so you do not need to sort the data prior to performing a grouping operation. When you access the data source using a SAS/ACCESS product (for Oracle, Teradata, or ODBC for example), SAS automatically passes the correct ORDER BY instruction to the database when you run a SAS procedure. For example, when you run this program:
proc means data=ORACLE2.CARS mean median min max; by origin; run;
the SAS/ACCESS engine converts this to the proper database query, like this (and you can see this for yourself by enabling SASTRACE options):
SELECT "MANUFACTURER", "MODEL", "TYPE", "PRICE", "CITYMPG", "HIGHWAYMPG",
"ENGINESIZE", "HORSEPOWER", "FUELTANK", "PASSENGERS", "WEIGHT",
"ORIGIN" FROM egtest.CARS_1993 ORDER BY "ORIGIN" NULLS FIRST
This is important, because database tables can be very large and you really don't want to ask SAS to pull the table, sort it, and then perform the analysis.
The mechanism that SAS uses to "delegate" such work to the database system is called "pass-through" or sometimes "push-down". It's how SAS gets your database to do the heavy I/O work that the database is made for, before SAS conducts the analysis work that SAS is designed for.
Note that SAS has some special offerings for certain databases that make the database perform even more work, even SAS analytics, without having to move the data around.