These tips will also streamline your programming, even if you are not using arrays. I'll provide complete code examples for each tip at the end of each section, as well as explain how the code is constructed along the way.
Use arrays to zoom out for greater perspective
One of the biggest uses of arrays, of course, is to reshape your data from one observation per identifying variable per data point, to one observation per ID containing all the data points for that ID. For example, consider the following data set:
Now, you want the values of the EVENT variable to become the names of the variables in the output data set, as shown here:
How do you know which variables to create if your data set is more complicated than shown above?
Create a macro variable
You can use the SQL procedure to create a macro variable (called EVENTLIST here) that contains all the unique values of EVENT. You then can use that macro variable in your array definition statement:
proc sql noprint; select distinct event into :eventlist separated by ' ' from work.events; quit; |
When you run the following statement, the resulting log shows that &EVENTLIST contains the unique values for the EVENT variable:
%put &eventlist;
Here is the log information:
3365 %put &eventlist;
aaa bbb ccc ddd
You can then use the EVENTLIST macro variable in an ARRAY statement to define the variables that you are including:
array events_[*] &eventlist;
Use CALL MISSING to set all variables
However, you do not want the values for the previous ID to contaminate the next ID, so you need to reset the array variables to missing with each new ID. You can use a DO loop that uses FIRST.variable and LAST.variable processing in order to set each value to missing:
if first.id then do; do i = 1 to dim(events_); events_[i] = .; end; end; |
DO loops were commonly used in the past to initialize all elements of an array to missing. And it still needs to be used if you want to set all the elements to a non-missing value, such as 0. However, you can replace this code with one function call. The CALL MISSING routine sets all the variables in the array to missing in one statement:
if first.id then call missing(of events_[*]);
Compare via OF variable-list syntax
The OF variable-list syntax is another helpful feature. You can use OF with a list of variables or an array reference.
So, now you have the EVENT values as variable names. How are you going to compare the value of a variable in an observation with a variable name in the array? You can reference each element of the array that you created with EVENTS_[i], but that will return the value of that element. For this comparison, you need to obtain the variable name for each element in the array.
To return the name of each variable in the array, you can use the VNAME function:
if event = vname(events_[i]) then do;
Now you can find when the value of the EVENT variable in the original data set matches the name of the variable in the array.
Additional ways to extract variables information in arrays
Other variable information functions can also be used in the same way to extract information about each of the variables in the array.
In this example, a value that is read from each observation can match only one variable name in the array, so you want to stop when you achieve that instead of continuing the DO loop.
To stop the DO loop at that point, use the LEAVE statement, which stops processing the current loop and resumes with the next statement in the DATA step.
Because you want to output only one observation per ID, you must explicitly use the OUTPUT statement at the end of each ID to override the default output that occurs at the end of each iteration of the DATA step:
if last.id then output;
Combine above methods with DATA Step
Putting all these methods together, here is a short DATA Step that will reshape the original data set into the output data set that you want:
data events_out (drop =i event); set events; by id; array events_[*] &eventlist; retain &eventlist; if first.id then call missing(of events_[*]); do i = 1 to dim(events_); if event = vname(events_[i]) then do; events_[i] = 1; leave; end; end; if last.id then output; run; |
Bonus shortcuts using arrays
Want to use even more shortcuts with arrays? Here, your original data has a numeric variable called RATING, which is included in every observation:
You want to find the lowest rating and corresponding event and also the highest rating and corresponding event for each ID. You can start off by defining the arrays and reading each observation in the same way as in the previous example. In this case, you have two arrays, so you need to read both the EVENTS and RATINGS variables into their own arrays:
array ratings_[5];
array events_[5] $3.;
…more SAS statements…
ratings_[count]=rating;
events_[count]=event;
After you have read all the observations for an ID, how do you find the lowest rating? Of course, you can use a DO loop to loop through the array and check each value to see whether it is the lowest and keep track of which one it is. Or you can use the OF array syntax with the MIN and MAX functions:
lowest_rating=min(of ratings_[*]);
highest_rating=max(of ratings_[*]);
This code returns the value of the lowest and highest ratings for ID 1, which are 3 and 9, respectively. But you also want to know which events were rated the highest and lowest for this ID. First, you have to determine which elements in the RATINGS_ array had the lowest and highest values. You know the values, but you do not yet know which specific elements match those values. Use the WHICHN function to determine the indexes for those values in the array. Keep in mind that if there is more than one element that matches the value using the WHICHN function, the function returns only the index of the first element in the array to match:
min_index=whichn(lowest_rating, of ratings_[*]);
max_index=whichn(highest_rating, of ratings_[*]);
For ID 1, these functions return 3 and 5, respectively. Now you can use the indexes to retrieve the corresponding elements in the EVENTS_ array:
lowest_rated_event=events_[min_index];
highest_rated_event=events_[max_index];
Putting all these methods together, here is the complete code for returning the desired results:
data rating (keep = id lowest_rated_event lowest_rating highest_rated_event highest_rating); set events; by id; array ratings_[5]; array events_[5] $3.; retain ratings: events:; if first.id then call missing(of ratings_[*],of events_[*], count); count + 1; ratings_[count] = rating; events_[count]=event; if last.id then do; lowest_rating = min(of ratings_[*]); highest_rating = max(of ratings_[*]); min_index =whichn(lowest_rating,of ratings_[*]); max_index=whichn(highest_rating, of ratings_[*]); lowest_rated_event=events_[min_index]; highest_rated_event=events_[max_index]; output; end; run; |
Here is the resulting output:
You can expand the functionality and capability of arrays by incorporating other features of the SAS programming language in your DATA Step code. I hope these tips and links improve your array use and lead you to explore even more ways to work with arrays. The resource below complements this post and provides additional tips and tricks.
Adventures in Arrays | Learn more about array processing!
1 Comment
Helpful tips 🙂
Thanks for sharing ..