Next week's blog entry will build on this one, so I want you to take notes, OK?
It's not headline news that in most cases, the best way to handle a repeated measures analysis is with a mixed models approach, especially for Normal reponses (for other distributions in the exponential family, GLIMMIX also has some advantages over GEEs. But that's another blog post for another day). You have more flexibility in modeling the variances and covariances among the time points, data do not need to be balanced, and a small amount of missingness doesn't spell the end of your statistical power.
But sometimes the data you have are living in the past: arranged as if you were going to use a multivariate repeated measures analysis. This multivariate data structure arranges the data with one row per subject, each time-based response a column in the data set. This enables the GLM procedure to set up H matrices for the model effects, and E and T matrices for testing hypotheses about those effects. It also means that if a subject has any missing time points, you lose the entire subject's data. I've worked on many repeated measures studies in my pre-SAS years, and I tell you, I've been on the phones, email, snail mail, and even knocked on doors to try to bring subjects back to the lab for follow-ups. I mourned over every dropout. To be able to use at least the observations you have for a subject before dropout would be consolation to a weary researcher's broken heart.
Enter the mixed models approach to repeated measures. But, your data need to be restructured before you can use MIXED for repeated measures analysis. This is, coincidentally, the same data structure you would use for a univariate repeated measures, like in the old-olden days of PROC ANOVA with hand-made error terms (well, almost hand-made). Remember those? The good old days. But I digress.
The MIXED and GLIMMIX procedures require the data be in the univariate structure, with one row per measurement. Notice that these procedures still use CCA, but now the "case" is different. Instead of a subject, which in the context of a mixed model can be many things at once (a person, a clinic, a network...), the "case" is one measurement occurence.
How do you put your wide (multivariate) data into the long (univariate) structure? Well, there are a number of ways, and to some extent it depends on how you have organized your data. If the multivariate response variable names share a prefix, then this macro will convert your data easily.
What if you want to go back to the wide structure (for example, to create graphs to profile subjects over time)? There's a macro for that as well.
What if your variables do not share a prefix, but instead have different names (such as SavBal, CheckBal, and InvestAmt)? Then you will need an alternative strategy. For example:
This needs some rearrangement, but there are two issues. First, there is no subject identifier, and I will want this in next week's blog when I fit a mixed model. Second, the dependent variables are not named with a common prefix. In fact, they aren't even measured over time! They are three variables measured for one person at a given time. (I'll explain why in next week's blog).
So, my preference is to use arrays to handle this:
Which results in the following:
I tip my hat to SAS Tech Support, who provide the %MAKELONG and %MAKEWIDE macros and to Gerhard Svolba, who authored them. If someone wants to turn my arrays into a macro, feel free. I'll tip my hat to you, too.
Tune in next week for the punchline to the joke:
"Three correlated responses walk into a bar..."
5 Comments
Rock on, Arthur! Seven years isn't really too long to wait for a handy macro! Thank you for sharing it here.
@Catherine: I'm seven years late in responding but, in my defense, I just discovered your blog entry a couple of weeks ago
Gerhard, Joe Matise, Matt Kastin and I presented a macro at this year's SGF that, when presented, came awfully close to the macro you suggested in the last paragraph of your blog. The only thing it was missing was the ability to create a variable that contained sequential subject numbers.
However, we agree with the utility of having that capability, so have added it to the macro. The macro doesn't use arrays, but does accomplish the major data crunching using a datastep. As such, on a relatively small file (i.e., 500,000 records), it runs 12.5 times faster than Gerhard's MAKELONG macro and more than 72 times faster than a less efficient use of PROC TRANSPOSE.
I think you'll really like the %untranspose macro! In addition to the preparation and performance time savings, it includes a few bonus capabilities:
• Less code thus less time to prepare and lower chance of user error
• Very Extensible
• Can handle any type of transposed variable name that proc transpose can create (i.e., wide file variable names can include any combination of a prefix, variable name, delimiter, id value, and suffix)
• Works with any combination of character and/or numeric variables
• You choose whether to output records that only contain missing values
• All variables maintain their original types, lengths, formats and labels
• Let’s you create a non-redundant file of metadata (if desired)
• The data and out parameters can include any desired data step options
• And, as of today, includes a parameter (create_byvar) that, if used, will create a variable that contains sequential record numbers
Of course, if you like it, you'll probably want to keep it as a stored compiled macro. However, to try it out, you might want to use:
data promoexperiment;
input promotion $ CashBak SavBal CheckBal InvestAmt;
cards;
Gift 75 106 100 106
Gift 75 104 101 102
Gift 75 106 103 107
Gift 75 110 102 106
Gift 75 101 104 107
Gift 75 76 99 100
Gift 75 109 101 102
Gift 75 95 99 100
Gift 75 88 100 100
Gift 75 84 103 96
Gift 75 88 101 100
Gift 75 84 94 97
Gift 75 92 105 102
Gift 75 69 95 82
Gift 75 95 100 100
Gift 75 98 99 99
Gift 75 80 103 100
Gift 75 96 107 104
Gift 75 103 110 113
Gift 75 88 103 103
Gift 75 104 101 105
Gift 75 91 110 107
Gift 75 105 103 102
Gift 75 89 101 102
Gift 75 93 104 105
Gift 125 111 102 105
Gift 125 109 99 103
Gift 125 111 101 108
Gift 125 115 103 107
Gift 125 106 101 106
Gift 125 81 102 104
Gift 125 114 102 108
Gift 125 100 99 102
;
filename ut url 'http://tiny.cc/untranspose_macro';
%include ut ;
%untranspose(data=promoexperiment, out=promouni,
var=SavBal CheckBal InvestAmt, copy=CashBak, create_byvar=subject);
The macro, paper, powerpoint, and a tip sheet can all be downloaded from:
https://github.com/art297/untranspose
If you’re uncomfortable using tiny urls, the actual url of the github raw file is:
https://raw.githubusercontent.com/art297/untranspose/master/untranspose.sas
If you discover that you like the macro, you'll probably also want to download it's counterpart for transposing from either long or wide to wider, namely %transpose. You can download it (both the paper, powerpoint, and macro) from https://github.com/art297/transpose
If you wanted to test it, here is an example using the github raw file:
data have;
format idnum 4.;
input idnum date var1 $;
informat date date9.;
format date yymon7.;
cards;
1 01jan2001 SD
1 01feb2001 EF
1 01mar2001 HK
2 01jan2001 GH
2 01apr2001 MM
2 01may2001 JH
;
filename tr url 'http://tiny.cc/transpose_macro';
%include tr ;
%transpose(data=have, out=want, by=idnum, var=var1, id=date,
format=yymon7., delimiter=_, sort=yes, guessingrows=1000)
Again, if you’re uncomfortable using tiny urls, the actual url of the github raw file is:
https://raw.githubusercontent.com/art297/transpose/master/transpose.sas
@Arthur Tabachneck
I'm 2 years late seeing this comment, but a BIG thank you for sharing this! I'm going to try it out today on something I'm working on right now.
Pingback: Reshaping data from wide to long format - The DO Loop
I like to use PROC TRANSPOSE to convert the data from wide to long format. For details, see
http://blogs.sas.com/iml/index.php?/archives/83-Reshaping-Data-from-Wide-to-Long-Format.html