A SAS programmer wanted to find the name of the variable for each row that contains the largest value. This task is useful for wide data sets in which each observation has several variables that are measured on the same scale. For example, each observation in the data might represent a date, and the variables might represent the high temperature for certain cities on that date. The task is to find the name of the variable (the city) that had the highest temperature for each date. Of course, you could also find name of the variable that contains the snallest value for each observation.
This article shows two ways to program this problem in SAS.
Example data
Let's create some hypothetical data to illustrate the method. The following DATA set records the daily high temperature for four US cities (Miami, Phoenix, Dallas, and Las Vegas) on four dates.
/* temperatures in degrees F */ data HighTemps; informat Date anydtdte12.; format Date DATE10.; input Date Miami Phoenix Dallas Las_Vegas; datalines; 01JAN2023 77.0 65.9 56.9 56.6 01APR2023 82.8 81.6 73.8 74.8 01JUL2023 90.2 107.0 94.7 103.6 01OCT2023 88.1 95.8 85.0 95.9 ; |
For this small example, you can quickly identify the hottest city on each date. However, imagine trying to find the highest temperatures if there are 100 cities and 365 dates!
Use the SAS DATA step
A lot of SAS programmers reach for the DATA step for tasks like this. You can complete the task by using an array and three functions that deserve to be known more widely:
- Use the OF operator (sometimes called the OF keyword) to specify that the MAX function should operate across all columns specified in an array.
- Use the WHICHN function to return the index of the (first) value in the array that matches a specified value.
- Use the VARNAME function to return a string that contains the name of a variable in an array.
The following statements compute the name of the variable that contains the largest value in each row:
data High; set HighTemps; array Vars[*] Miami--Las_Vegas; maxTemp = max(of Vars[*]); /* for each row, the maximum value */ colIdx = whichn(maxTemp, of Vars[*]); /* for each row, first column with value */ City = vname(Vars[colIdx]); /* get name of variable */ run; proc print data=High noobs; var Date City colIdx maxTemp; run; |
For these data, Miami was the warmest city in the winter and spring, Phoenix was hottest in the summer, and Las Vegas was warmest in the fall.
Use the SAS IML language
The SAS IML language is often more convenient than the DATA step if you want to perform column-wise operations. For row-wise operations, like this task, they are equally convenient. The SAS IML language contains special subscript reduction operators that make this task easy:
- Use the <:> subscript operator to obtain the index of the column that contains maximum value in each row.
- Use the <> subscript operator to obtain maximum value in each row.
The following statements compute the name of the variable that contains the largest value in each row:
proc iml; Vars = {'Miami' 'Phoenix' 'Dallas' 'Las_Vegas'}; use HighTemps; read all var Vars into T; read all var "Date"; close; /* <:> is the index of the largest column (or row) */ maxTemp = T[, <>]; /* for each row, the maximum value */ colIdx = T[, <:>]; /* for each row, column with highest value */ City = Vars[colIdx]; print Date[F=DATE10.] City colIdx maxTemp; QUIT; |
Except for a minor difference in formats, the IML output is the same as the DATA step computation.
Row numbers of long data sets
Although the purpose of this article is to show how you can compute a maximum across columns, you can also use SAS to compute the maximum down rows by using PROC MEANS, PROC SQL, or PROC UNIVARIATE. And, of course, if you convert the data from a wide format to a long format, you can use these techniques to solve the problem by working on rows instead of columns. The implementation is longer, so I leave the details as an exercise for the motivated programmer.
Summary
In summary, you can use the DATA step or the SAS IML language to find the name of the variable that contains the maximum value for each row of data. One way to solve the problem in the DATA step is to use an array and three functions that deserve to be better known: the OF operator, the WHICHN function, and the VNAME function.
1 Comment
Thanks Rick, this was simple but elegant. The VNAME function and CALL VNAME are useful from time to time for miscellaneous tasks and worth knowing. I've used it when I loop through an array containing all numeric (or character) variables to check for problematic values and print a report that needs to include following for each problematic value: variable name (via VNAME()), observation number (via _N_), and value.