Reshaping data from wide to long format

15

As Cat Truxillo points out in her recent blog post, some SAS procedures require data to be in a "long" (as opposed to "wide") format. Cat uses a DATA step to convert the data from wide to long format. Although there is nothing wrong with this approach, I prefer to use the TRANSPOSE procedure.

In a wide format, each subject is represented by one row and many variables. For example, variables for a customer in a bank might include the customer's name, checking account balance, savings account balance, mortgage amount, and whether the customer is part of a loyalty program.

data wide;
input Customer $  SavBal CheckBal Mortgage Loyalty;
datalines;
Rick 2000 1000 100000 1
Bob  2500  500 250000 0
Tom  8000 2000      . 1
;
 
proc print data=wide noobs;  run;

In a long format, each subject spans multiple rows. Each row contains at least three fields: a unique subject identifier, a variable identifier, and a value field. For example, the data in the SavBal, CheckBal, and Mortgage variables can also be represented by the following long format:

When I convert data from wide to long format, I use three sets of variables:

  1. A variable that uniquely identifies each subject
  2. Variables in the wide data that should be converted to rows in the long data
  3. Variables in the wide data set that should be retained and repeated for each new row in the long data.

The first variable uniquely identifies each subject and is used in the BY statement of the TRANSPOSE procedure. The banking data does not contain such a variable (a customer's name might not be unique), but you can add one by using the following DATA step:

/** 1. add a Subject variable (if necessary) **/
data wide;
   set wide;
   Subject+1; 
run;

The second set of variables are columns in the wide data that should be converted to rows in the long data. These variables will be used in the VAR statement of the TRANSPOSE procedure. In the banking example, the data in the SavBal, CheckBal, and Mortgage variables are converted from columns to rows.

/** 2. Transpose the variables from wide to long format **/
proc transpose data=wide 
   out=long1(rename=(Col1=Value)) name=Account;                    
   by Subject;                   /** for each subject **/
   var SavBal CheckBal Mortgage; /** make a row for these variables **/
run;

The third set of variables are columns in the wide data set that should be retained and repeated for each new row in the long data. For example, you might want to retain the Customer and Loyalty variables. (This step is optional, since sometimes you don't want to retain any variables.) You can accomplish this by using the MERGE statement in a DATA step:

/** 3. Merge with any variables that are not transposed **/
data long;
merge long1 
      wide(keep=Subject 
           Loyalty Customer); /** these variables are not transposed **/
by Subject; 
run;

The following table shows the long version of the banking data:

This is the same result that Cat achieved with the DATA step, with different data.

I use PROC TRANSPOSE because I'm lazy. By using PROC TRANSPOSE, I can locate an example that I've used successfully in the past and modify that example to convert my current data from wide to long format. I don't have to think about DATA step arrays. All I have to do is to identify the three kinds of variables mentioned in this blog post.

More than one person has weighed in on whether they prefer PROC TRANSPOSE or DATA step arrays. For more information on using PROC TRANSPOSE to manipulate data, search for "transpose" at www.lexjansen.com. You will find a variety of SUGI and SAS Global Forum papers the use PROC TRANSPOSE, many of them written by expert SAS programmers.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

15 Comments

  1. Oh, I just found out that the macros of my comment have already be mentioned in the Article from Cat Truxillo.

    br, Thomas

  2. Pingback: Readers’ choice 2011: The DO Loop’s 10 most popular posts - The DO Loop

  3. Pingback: Output percentiles of multiple variables in a tabular format - The DO Loop

  4. Pingback: Plotting multiple series: Transforming data from wide to long - The DO Loop

  5. How would the coding change if we had two headers that we wanted to convert to rows (i.e., if we had Account and Product). For example, a Product for SavBal would be Savings; similarly a Product for CheckBal would be Chequing).

  6. Pingback: Comparative histograms: Panel and overlay histograms in SAS - The DO Loop

    • Rick Wicklin

      Yes. Several ways, in fact. When going from long to wide, you need to consider whether there are an equal number of observations for each group. You can find a discussion of three methods and their advantages/disadvantages in Zdeb (2012). I tend to use the DATA step array method (pp 5-7) more then the other methods.

  7. Great example, very helpful. In your example, you showed how to make a new column called VALUES by choosing the variables:
    var SavBal CheckBal Mortgage; /** make a row for these variables **/
    What if I have another set of variables that I wanted to make another column out of ? How would I create another column and transpose the additional data which are characters?

  8. Pingback: Graph wide data and long data in SAS - The DO Loop

  9. Pingback: Automate the placement of reference lines in PROC SGPLOT - The DO Loop

Leave A Reply

Back to Top