If you've ever tried to use PROC FREQ to create a frequency table of two character variables, you know that by default the categories for each variable are displayed in alphabetical order. A different order is sometimes more useful. For example, consider the following two-way table for the smoking status and weight status of 5,167 patients in a medical study, as recorded in the Sashelp.Heart data set, which is distributed with SAS software:
proc freq data=sashelp.heart; tables Smoking_Status*Weight_Status / norow nocol nopct; run;
The alphabetical order of the categories is not the best choice for these data. A better table would order the weight categories as "Underweight," "Normal," and "Overweight." Similarly, the smoking categories should be ordered from 'Non-smoker' to "Very Heavy (> 25)."
In most introductory SAS courses—and at almost every SAS conference—you will hear recommendations about how to encode ordinal variables when you create a SAS data set. Two common ways to make specify the order of categories are:
- Create (or sort) the data in the order that you want the frequency table to appear. Use the ORDER=DATA option on the PROC FREQ statement to instruct the procedure that it should order categories as they appear in the data set.
- Encode the data as a numerical variable with values 1, 2, 3, ..., and use a SAS format to provide the text that will be displayed for each category. The FREQ procedure will order the (numerical) variables by 1, 2, 3, ..., but will display the formatted values as the headers for the frequency tables.
The first approach is simple, but it has limitations. Often the data set is handed to you; you don't get to create it. You can use PROC SORT to create a sorted copy of the data, but you still need to provide variables with values such as 1, 2, 3, ..., in order to sort the data. Furthermore, if some joint categories are empty (for example, there are no underweight non-smokers in the data), then the ORDER=DATA option might not accomplish what you want. (However, a workaround is to create a weight variable and using the ZEROS option in the WEIGHT statement to include empty levels.)
The second approach requires that you add sorting variables to the data set and apply a format to those variables. First, define the format:
proc format; value WtFmt 1 = 'Underweight' 2 = 'Normal' 3 = 'Overweight'; value SmFmt 1 = 'Non-smoker' 2 = 'Light (1-5)' 3 = 'Moderate (6-15)' 4 = 'Heavy (16-25)' 5 = 'Very Heavy (> 25)'; run;
Next, create new sorting variables and apply the formats to "recreate" the original variables:
data Heart / view=Heart; format Smoking_Cat SmFmt. Weight_Cat WtFmt.; set sashelp.heart; select (Weight_Status); when ('Underweight') Weight_Cat=1; when ('Normal') Weight_Cat=2; when ('Overweight') Weight_Cat=3; when (' ') Weight_Cat=.; end; select (Smoking_Status); when ('Non-smoker') Smoking_Cat=1; when ('Light (1-5)') Smoking_Cat=2; when ('Moderate (6-15)') Smoking_Cat=3; when ('Heavy (16-25)') Smoking_Cat=4; when ('Very Heavy (> 25)') Smoking_Cat=5; when (' ') Smoking_Cat=.; end; run;
I have created a data set view rather than a data set in order to save storage space, which might be important for large data sets. You can now call PROC FREQ, as follows:
proc freq data=heart; tables Smoking_Cat*Weight_Cat / norow nocol nopct; run;
By using this technique, you can specify any order for the categories of a contingency table. The technique extends to other analyses as well.
For more on using PROC FORMAT for data analysis, see the following articles:
- Specify the colors of groups in SAS statistical graphics by using data attributes. [Note: In SAS 9.4, an alternative is to use the STYLEATTRS statement to set the style attributes for groups for a single call to the SGPLOT procedure.]
- Count the number of missing values for each variable.