Recoding variables can be tedious, but it is often a necessary part of data analysis. Almost every SAS programmer has written a DATA step that uses IF-THEN/ELSE logic or the SELECT-WHEN statements to recode variables. Although creating a new variable is effective, it is also inefficient because you have to create a new data set that contains the new variable. For large data sets, this is wasteful: most of the data remain the same; only the recoded variables are different.
There is an alternative approach: You can use PROC FORMAT in Base SAS to define a custom SAS format. When you use PROC FORMAT, the data are never changed, but all the SAS reports and analyses can display the formatted values instead of the raw data values. You can use the same format for multiple data sets. You can even define multiple formats to analyze the same variable in multiple ways.
An example of using a format to recode a variable
In the simplest situation, a recoding of a variable converts each raw value to an easier-to-interpret value. For example, suppose that the gender variable for patients is recorded as a binary 0/1 variable. This is a terrible choice because it is not clear whether 0 represents males or females. The following example shows the typical IF-THEN logic for recoding 0 as "Female" and 1 as "Male" by creating a new data set and a new variable:
/* original data: Gender is binary variable, which is hard to understand! */ data Have; input Gender @@; datalines; 1 0 0 0 1 1 0 0 . 1 1 0 0 0 0 1 1 1 . 1 1 ; /* Recode by using IF-THEN or SELECT-WHEN. This can be inefficient. */ data HaveRecode; set Have; /* use IF-THEN logic to recode gender */ length Gender_Recode $6; if Gender=0 then Gender_Recode = "Female"; else if Gender=1 then Gender_Recode = "Male"; else Gender_Recode = " "; run; proc freq data=HaveRecode; tables Gender_Recode Gender; run;
The table for the Gender_Recode variable is shown. The data, which was originally coded as a binary indicator variable, has been duplicated by creating a character variable that contains the same information but is more understandable. Of course, now you have to use the new variable name to analyze the recoded data. If you have already written programs that refer to the Gender variable, you have to update the programs to use the new variable name. Yuck!
A more efficient choice is to use a custom-defined format. The beauty of using a format is that you do not have to change the data. Instead, you simply define a format that changes the way that the data are used and displayed in SAS procedures. (A data view is a third alternative, but formats have additional advantages.)
You can define the following format (called GenderFmt.), which displays the gender data as "Female" and "Male" without modifying the data set:
/* use a format to recode gender */ proc format; value GenderFmt 0 = "Female" 1 = "Male" other = " "; run; /* apply the format to original data; no need to create new data set */ proc freq data=Have; format Gender GenderFmt.; /* the name of the format includes a period */ tables Gender; run;
Notice that the analysis is run on the original data and use the original variable name. No additional data sets, views, or variables are created.
Use a format to recode a character variable
Did you know that you can use PROC FORMAT to define formats for character variables? Formats for character variables are used less often than formats for numeric variables, but the syntax is similar. The main difference is that the name of a character format starts with the '$' symbol.
In addition to recoding the values of a categorical variable, formats are useful because they enable you to merge or combine categories by defining a many-to-one mapping. For example, the following character format recodes values of the TYPE variable and also combines the 'SUV' and 'Wagon' categories into a single category. Although it is not needed for this example, notice that the format also includes an 'Other' category, which can be used to combine small groups. The 'Other' category will also handle invalid data.
/* Create sample data from Sashelp.Cars. Exclude hybrids. Optionally sort the data */ proc sort data=sashelp.cars(where=(Type^='Hybrid')) out=Cars; by MPG_City; run; proc format; value $CarTypeFmt 'Sedan' = 'Family Car' 'Sports' = 'Sports Car' 'SUV','Wagon' = 'Big Car' 'Truck' = 'Truck' Other = 'Other'; run; proc freq data=Cars; format Type $CarTypeFmt.; /* the name the format includes a period at the end */ tables Type; run;
Using a format enables you to analyze the original data (omit the FORMAT statement) or apply the format (include the FORMAT statement). You can even define multiple formats if you want to slice and dice the data in various ways.
Use a format to bin numeric variables
One of my favorite SAS tricks is to use a format to bin numeric variables into categories. In the following example, the MPG_City variable is used to group vehicles into four categories based on how fuel-efficient the vehicles are. You can use this format to perform any computation that requires a classification variable. The example shows a two-way frequency analysis of the two variables for which we have defined custom formats:
proc format; value MPGFmt low -< 15 = "Gas Guzzler" /* < 15 */ 15 -< 20 = "Not Good" /* [ 15, 20) */ 20 -< 25 = "Good" /* [ 20, 25) */ 25 - high = "Great"; /* > 25 */ run; proc freq data=Cars order=data; format MPG_City MPGFmt. Type $CarTypeFmt.; tables MPG_City * Type / nocol norow nopercent; run;
Store and retrieve formats
Formats are stored in a catalog, which is stored separately from the data. By default, SAS stores the formats in a catalog named WORK.FORMATS. Like everything else stored in WORK, that catalog will vanish when you end the SAS session. Therefore, you need to store the formats to a permanent libref if you want to reuse the formats across SAS session.
SAS supports several features that help you to maintain a permanent library of formats. Here are two facts about format catalogs:
- You can use the LIBRARY= option on the PROC FORMAT statement to specify a libref in which to store the format catalog. By default, the catalog will be named FORMATS.
- SAS maintains a list of librefs to search through to find formats. By default, it looks in WORK and a special libref named LIBRARY.
These facts imply that you can do two simple things to create a permanent library of formats. First, define a permanent libref named LIBRARY (the name is important!) that will contain your catalog of formats. Second, specify the LIBRARY=LIBRARY option when you define the format, as follows:
libname library "C:/MyFormats"; /* the libref 'LIBRARY' has special significance! */ proc format library=library; /* adds format to the permanent catalog LIBRARY.FORMATS */ value $CarTypeFmt 'Sedan' = 'Family Car' 'Sports' = 'Sports Car' 'SUV','Wagon' = 'Big Car' 'Truck' = 'Truck' Other = 'Other'; run;
When you start a new SAS session, you will need to define the LIBRARY libref again if you want to access the formats. For convenience, many people put the LIBNAME statement in their AutoExec.sas file. Because SAS searches for formats in the LIBRARY.FORMATS catalog, SAS will automatically find the $CarTypeFmt. format.
SAS provides many other options for storing formats and for specifying the search locations for formats. For details, see the SAS usage note "How can I permanently store and use formats that I have created?" or John Ladd's 2012 paper, "Yes, We Can... Save SAS Formats."
In summary, if you need to recode data, custom-defined formats provide an easy alternative to physically changing the data. This article discusses five advantages to using formats to recode data:
- The data do not change. You can use the original variable names in the analyses.
- You can apply formats to both character and numerical variables.
- You can use formats to merge categories and to bin numeric variables.
- You apply a format to multiple variables in multiple data sets.
- You can save formats in a permanent libref and use them across SAS sessions.
Do you maintain a library of SAS formats at your workplace? Leave a comment to share your experience and your best practices.