Recoding variables can be tedious, but it is often a necessary part of data analysis. Almost every SAS programmer has written a DATA step that uses IF-THEN/ELSE logic or the SELECT-WHEN statements to recode variables. Although creating a new variable is effective, it is also inefficient because you have to create a new data set that contains the new variable. For large data sets, this is wasteful: most of the data remain the same; only the recoded variables are different.
There is an alternative approach: You can use PROC FORMAT in Base SAS to define a custom SAS format. When you use PROC FORMAT, the data are never changed, but all the SAS reports and analyses can display the formatted values instead of the raw data values. You can use the same format for multiple data sets. You can even define multiple formats to analyze the same variable in multiple ways.
An example of using a format to recode a variable
In the simplest situation, a recoding of a variable converts each raw value to an easier-to-interpret value. For example, suppose that the gender variable for patients is recorded as a binary 0/1 variable. This is a terrible choice because it is not clear whether 0 represents males or females. The following example shows the typical IF-THEN logic for recoding 0 as "Female" and 1 as "Male" by creating a new data set and a new variable:
/* original data: Gender is binary variable, which is hard to understand! */ data Have; input Gender @@; datalines; 1 0 0 0 1 1 0 0 . 1 1 0 0 0 0 1 1 1 . 1 1 ; /* Recode by using IF-THEN or SELECT-WHEN. This can be inefficient. */ data HaveRecode; set Have; /* use IF-THEN logic to recode gender */ length Gender_Recode $6; if Gender=0 then Gender_Recode = "Female"; else if Gender=1 then Gender_Recode = "Male"; else Gender_Recode = " "; run; proc freq data=HaveRecode; tables Gender_Recode Gender; run;
The table for the Gender_Recode variable is shown. The data, which was originally coded as a binary indicator variable, has been duplicated by creating a character variable that contains the same information but is more understandable. Of course, now you have to use the new variable name to analyze the recoded data. If you have already written programs that refer to the Gender variable, you have to update the programs to use the new variable name. Yuck!
A more efficient choice is to use a custom-defined format. The beauty of using a format is that you do not have to change the data. Instead, you simply define a format that changes the way that the data are used and displayed in SAS procedures. (A data view is a third alternative, but formats have additional advantages.)
You can define the following format (called GenderFmt.), which displays the gender data as "Female" and "Male" without modifying the data set:
/* use a format to recode gender */ proc format; value GenderFmt 0 = "Female" 1 = "Male" other = " "; run; /* apply the format to original data; no need to create new data set */ proc freq data=Have; format Gender GenderFmt.; /* the name of the format includes a period */ tables Gender; run;
Notice that the analysis is run on the original data and use the original variable name. No additional data sets, views, or variables are created.
Use a format to recode a character variable
Did you know that you can use PROC FORMAT to define formats for character variables? Formats for character variables are used less often than formats for numeric variables, but the syntax is similar. The main difference is that the name of a character format starts with the '$' symbol.
In addition to recoding the values of a categorical variable, formats are useful because they enable you to merge or combine categories by defining a many-to-one mapping. For example, the following character format recodes values of the TYPE variable and also combines the 'SUV' and 'Wagon' categories into a single category. Although it is not needed for this example, notice that the format also includes an 'Other' category, which can be used to combine small groups. The 'Other' category will also handle invalid data.
/* Create sample data from Sashelp.Cars. Exclude hybrids. Optionally sort the data */ proc sort data=sashelp.cars(where=(Type^='Hybrid')) out=Cars; by MPG_City; run; proc format; value $CarTypeFmt 'Sedan' = 'Family Car' 'Sports' = 'Sports Car' 'SUV','Wagon' = 'Big Car' 'Truck' = 'Truck' Other = 'Other'; run; proc freq data=Cars; format Type $CarTypeFmt.; /* the name the format includes a period at the end */ tables Type; run;
Using a format enables you to analyze the original data (omit the FORMAT statement) or apply the format (include the FORMAT statement). You can even define multiple formats if you want to slice and dice the data in various ways.
Use a format to bin numeric variables
One of my favorite SAS tricks is to use a format to bin numeric variables into categories. In the following example, the MPG_City variable is used to group vehicles into four categories based on how fuel-efficient the vehicles are. You can use this format to perform any computation that requires a classification variable. The example shows a two-way frequency analysis of the two variables for which we have defined custom formats:
proc format; value MPGFmt low -< 15 = "Gas Guzzler" /* < 15 */ 15 -< 20 = "Not Good" /* [ 15, 20) */ 20 -< 25 = "Good" /* [ 20, 25) */ 25 - high = "Great"; /* > 25 */ run; proc freq data=Cars order=data; format MPG_City MPGFmt. Type $CarTypeFmt.; tables MPG_City * Type / nocol norow nopercent; run;
Store and retrieve formats
Formats are stored in a catalog, which is stored separately from the data. By default, SAS stores the formats in a catalog named WORK.FORMATS. Like everything else stored in WORK, that catalog will vanish when you end the SAS session. Therefore, you need to store the formats to a permanent libref if you want to reuse the formats across SAS session.
SAS supports several features that help you to maintain a permanent library of formats. Here are two facts about format catalogs:
- You can use the LIBRARY= option on the PROC FORMAT statement to specify a libref in which to store the format catalog. By default, the catalog will be named FORMATS.
- SAS maintains a list of librefs to search through to find formats. By default, it looks in WORK and a special libref named LIBRARY.
These facts imply that you can do two simple things to create a permanent library of formats. First, define a permanent libref named LIBRARY (the name is important!) that will contain your catalog of formats. Second, specify the LIBRARY=LIBRARY option when you define the format, as follows:
libname library "C:/MyFormats"; /* the libref 'LIBRARY' has special significance! */ proc format library=library; /* adds format to the permanent catalog LIBRARY.FORMATS */ value $CarTypeFmt 'Sedan' = 'Family Car' 'Sports' = 'Sports Car' 'SUV','Wagon' = 'Big Car' 'Truck' = 'Truck' Other = 'Other'; run;
When you start a new SAS session, you will need to define the LIBRARY libref again if you want to access the formats. For convenience, many people put the LIBNAME statement in their AutoExec.sas file. Because SAS searches for formats in the LIBRARY.FORMATS catalog, SAS will automatically find the $CarTypeFmt. format.
SAS provides many other options for storing formats and for specifying the search locations for formats. For details, see the SAS usage note "How can I permanently store and use formats that I have created?" or John Ladd's 2012 paper, "Yes, We Can... Save SAS Formats."
In summary, if you need to recode data, custom-defined formats provide an easy alternative to physically changing the data. This article discusses five advantages to using formats to recode data:
- The data do not change. You can use the original variable names in the analyses.
- You can apply formats to both character and numerical variables.
- You can use formats to merge categories and to bin numeric variables.
- You apply a format to multiple variables in multiple data sets.
- You can save formats in a permanent libref and use them across SAS sessions.
Do you maintain a library of SAS formats at your workplace? Leave a comment to share your experience and your best practices.
Working as a clinical data manager, I started to fall in love at first with format, for the reason you mentioned but also because in some PROC, the 'preloadfmt' helps a lot.
However we often ends up using a triplet of coded/display/order (and sometimes more variables to drive the reports). In such case, I prefer to use extra dataset, and lost interest in format (I prefer 1 dataset compared to 3+ format).
You pointed to the use of views. Do you have some tips or use cases when to prefer format vs metadata dataset? Or any preferred article covering this tradeoff?
Thank you for your thoughts. DATA step views are programs that compute variables when they are run. That is, the variable values are dynamically generated, rather than stored as numbers. The article, "What is a DATA step view and why is it important?" discusses the advantages and disadvantages of DATA step views.
Even if you DO need to recode the underlying variable, you can still specify the encoding with PROC FORMAT, and then use the PUT function to assign the formatted value to a new variable. Often this makes your code more concise, more readable, and less repetitive, especially if you need to recode multiple variables.
Great point. In case it is not clear, Ryan is suggesting that you can replace the IF/THEN logic by using the PUT statement with a custom format:
Gender_Recode = put(Gender, GenderFmt.); /* apply custom format to raw data */
For instance, to divide a big dataset into two on the basis of whether some variable with a huge set of values; like flagging half the names in a giant database:
set a_long_list_of_values_to_go_in_the_bin end = eof;
start = name_of_the_variable_to_get_binned;
label = ‘Y';
fmtname = '$OK';
if eof then do;
start = 'Other';
label = 'N';
fmtname = '$OK';
proc format cntlin = fmt;
in_the_bin= put(name_of_the_variable_to_get_binned, $OK.);
One might even set the binned dataset up to be a dataset view.....
I use binning a lot by proc format. But as a lazy analyst I would like to be helped by that. For example put the variable into 5 equally sized bins, but with meaningful (rounded) limits that can be modified easily. Is there something available in SAS - apart from running Freq and decide myself?
Yes. I assume by "equally sized" you mean that you want approximately 1/5 of the observations in each bin? You can use GROUPS=5 on the PROC RANK statement to generate an indicator variable with values 1,2,3,4,5. You can then apply the format to make those values meaningful. You can also use the CNTLIN option to define the format by using a data set that contains the limits (quantile boundaries) for your user-defined format. Be aware, however, that if your data are rounded, you are not guaranteed to get 1/5 of the observations in each bin.
That looks good. Maybe I'll build a macro to transform it into a value statement for proc format. For a start.
AFAIK Proc Rank is CPU-bound, being single-threaded. Then there is also the little-known Proc HPBIN (9.4), which is multi-threaded - even distributed sometimes - thus scalable upon large datasets on multi-core machines. By the way, it would be very interesting to compare different ways of computing deciles or centiles in SAS, taking into account this 'scalability' factor : Procs univariate, stdize, means, rank or HPbin have their pros & cons, I guess. The percentiles bounds are frequently required in banking or financial regulatory analysis (e g stress testing) and not only the lower or upper values (1% 5% 95%, 99% etc.) but also the whole distribution.
Given your comment"Recoding variables can be tedious, but it is often a necessary part of data analysis. Almost every SAS programmer has written a DATA step that uses IF-THEN/ELSE logic or the SELECT-WHEN statements to recode variables. Although creating a new variable is effective, it is also inefficient because you have to create a new data set that contains the new variable. For large data sets, this is wasteful:"
You should not forget the principle of VIEWs
This implements the recoding as data are read
Thanks for the reminder. For more information about the advantages of VIEWS and why you might want to use them to create temporary variables, see "What is a DATA step view and why is it important?"
While the LIBRARY option is extremely helpful for relatively static data/ formats, and don't forget INFORMATS, the FMTSEARCH option controlling the order that libraries are searched for formats can be helpful when using data that is not always loaded. I work on many projects that have durations of a couple months and it is nice that I can add a projects formats to the search path so I don't have to come up with different format names such as gender when the active project has gender coded differently (and is sometimes critical as one project has 6 "genders" in addition to "not recorded" codes.
Don't forget the custom INFORMATS that can be created so you can read the data, or convert data, with the added advantage of other and _error_ options to provide invalid data notes in the log.
And multilabel formats are sometimes a very slick answer to some complicated report table layouts.
Great tips. Thanks for writing and sharing.
As ballardw mentions, SAS also supports informats. There are several ways to define a custom informat. I recently wrote about how to use PROC FCMP to define a function that can be used to define an informat.
Pingback: How to use PROC HPBIN to bin numerical variables - The DO Loop
Pingback: The essential guide to binning in SAS - The DO Loop