One great thing about being a SAS programmer is that you never run out of new things to learn. SAS often gives us a variety of methods to produce the same result. One good example of this is the DATA step and PROC SQL, both of which manipulate data. The DATA step is extremely powerful and flexible, but PROC SQL has its advantages too. Until recently, my knowledge of PROC SQL was pretty limited. But for the sixth edition of The Little SAS Book, we decided to move the discussion of PROC SQL from an appendix (who reads appendices?) to the body of the book. This gave me an opportunity to learn more about PROC SQL.
When developing my programs, I often find myself needing to calculate the mean (or sum, or median, or whatever) of a variable, and then merge that result back into my SAS data set. That would generally involve at least a couple PROC steps and a DATA step, but using PROC SQL I can achieve the same result all in one step.
Consider this example using the Cars data set in the SASHELP library. Among other things, the data set contains the 2004 MSRP for over 400 models of cars of various makes and car type. Suppose you want a data set which contains the make, model, type, and MSRP for the model, along with the median MSRP for all cars of the same make. In addition, you would like a variable that is the difference between the MSRP for that model, and the median MSRP for all models of the same make. Here is the PROC SQL code that will create a SAS data set, MedianMSRP, with the desired result:
*Create summary variable for median MSRP by Make; PROC SQL; CREATE TABLE MedianMSRP AS SELECT Make, Model, Type, MSRP, MEDIAN(MSRP) AS MedMSRP, (MSRP - MEDIAN(MSRP)) AS MSRP_VS_Median FROM sashelp.cars GROUP BY Make; QUIT;
The CREATE TABLE clause simply names the SAS data set to create, while the FROM clause names the SAS data set to read. The SELECT clause lists the variables to keep from the old data set (Make, Model, Type, and MSRP) along with specifications for the new summary variables. The new variable, MedMSRP, is the median of the old MSRP variable, while the new variable MSRP_VS_Median is the MSRP minus the median MSRP. The GROUP BY clause tells SAS to do the calculations within each value of the variable Make. If you leave off the GROUP BY clause, then the calculations would be done over the entire data set. When you run this code, you will get the following message in your SAS log telling you it is doing exactly what you wanted it to do:
NOTE: The query requires remerging summary statistics back with the original data.
The following PROC PRINT produces a report showing just the observations for two makes – Porsche and Jeep.
PROC PRINT DATA = MedianMSRP; TITLE '2004 Car Prices'; WHERE Make IN ('Porsche','Jeep'); FORMAT MedMSRP MSRP_VS_Median DOLLAR8.0; RUN;
Here are the results:
Now PROC SQL aficionados will tell you that if all you want is a report and you don’t need to create a SAS data set, then you can do it all in just the PROC SQL step. But that is the topic for another blog!