This blog answers three questions related to using PROC RANK with groups and ties. Note that question two actually provide an alternative for using the DATA step when PROC RANK cannot provide what you need.
- What does PROC RANK do behind the code when you use the GROUPS= option in the PROC RANK statement?
- What do you do if you want equal groups, regardless of tied values?
- What do you do if you want groups to start at 1 rather than 0?
What does PROC RANK do when you use the GROUPS= option?
Most of us just accept the ranked values that are calculated by PROC RANK. But have you ever tried to figure out the calculation that happens behind the code? When you use the GROUPS= option in the PROC RANK statement, the values are assigned to groups that range from 0 to the number-of-groups minus 1, based on tied values. If you have many tied values, you might not obtain the number of groups that you request because observations with the same value are assigned to the same group.
The formula for calculating group values is as follows:
FLOOR(rank*k/(n+1))
In this formula:
- rank is the data value's rank order
- k is the value of the GROUPS= option
- n is the number of nonmissing values
Consider the following example. If you want to see the original value as well as the ranking, use both the VAR and RANKS statements in your PROC RANK step. The RANKS variable contains the rank value. If you use a BY statement, the raw rankings start over in each BY group.
Example 1
data; input x; cards; 1 1 1 1 1 1 8 8 ; run; proc rank data=test groups=5 out=test_rank ties=mean /* low and high */; var x; ranks rank_x; run; proc print data=test_rank; run; |
Output
x | rank_x |
---|---|
1 | 1 |
1 | 1 |
1 | 1 |
1 | 1 |
1 | 1 |
1 | 1 |
8 | 4 |
8 | 4 |
The following table shows the variable values, the raw ranks, the ties values, and the resulting ranks:
X | RAW_RANK | TIES=MEAN | TIES=LOW | TIES=HIGH | RANK_MEAN | RANK_LOW | RANK_HIGH |
---|---|---|---|---|---|---|---|
1 | 1 | 3.5 | 1 | 6 | 1 | 0 | 3 |
1 | 2 | 3.5 | 1 | 6 | 1 | 0 | 3 |
1 | 3 | 3.5 | 1 | 6 | 1 | 0 | 3 |
1 | 4 | 3.5 | 1 | 6 | 1 | 0 | 3 |
1 | 5 | 3.5 | 1 | 6 | 1 | 0 | 3 |
1 | 6 | 3.5 | 1 | 6 | 1 | 0 | 3 |
8 | 7 | 7.5 | 7 | 8 | 4 | 3 | 4 |
8 | 8 | 7.5 | 7 | 8 | 4 | 3 | 4 |
Using the formula that is shown previously, k=5 and n=8. Since TIES=MEAN, you sum the raw ranks of the tied values of X and divide by the number of observations. For X=1, the rank is (1+2+3+4+5+6)=21/6=3.5. For X=2, the rank is (7+8)=15/2=7.5. Similarly, if you use TIES=LOW, for X=1, the rank is 1; for X=2, the rank is 7. Finally, if you use TIES=HIGH, for X=1, the rank is 6; for X=2, the rank is 8. When you insert those values into the formula for the first observation, you obtain the following results:
TIES=MEAN: Floor(3.5*5/9)=Floor(1.9)=1
TIES=LOW: Floor(1*5/9)=Floor(0.5)=0
TIES=HIGH: Floor(6*5/9)=Floor(3.3)=3
What do you do if you want equal groups, regardless of tied values?
Suppose that you want to create groups that have the same number of observations in each one, regardless of tied values. PROC RANK cannot do this. However, you can use the DATA step to accomplish this task.
You need to sort the data set by the ranking variable and then use the same formula in the DATA step, as shown below.
Example 2
proc sort data=test; by x; run; data ranks; set test nobs=numobs; group=floor(_n_*5/(numobs+1)); run; proc print data=ranks; run; |
Output
x | group |
---|---|
1 | 0 |
1 | 1 |
1 | 1 |
1 | 2 |
1 | 2 |
1 | 3 |
8 | 3 |
8 | 4 |
What do you do if you want the groups to start at 1 rather than 0?
When you use the GROUPS= option, the values that are assigned to the groups start at 0. There is no way to indicate for PROC RANK to start the groups at 1. However, once you have the data set with the ranked values, you can add 1 using DATA step logic, as shown in this example:
Example 3
data test; input y; cards; 11 22 10 15 25 ; run; proc sort data=test; by y; run; proc rank data=test out=test_rank1 groups=3; var y; ranks rank_y; run; data test_rank1; set test_rank1; rank_y+1; run; proc print data=test_rank1; run; |
Output
y | rank_y |
---|---|
10 | 1 |
11 | 2 |
15 | 2 |
22 | 3 |
25 | 3 |
Conclusion
PROC RANK has many statistical applications, such as helping you understand how your data is distributed. Hopefully, this blog has provided you with a better understanding of how ranks are determined when tied values are present in the data and when you want to assign those ranks to groups.
1 Comment
Thanks for the article. The problem of unequal groups occurs most often when the observations are rounded to the nearest unit, which often happens for age, length, weight, and other physical measurements. For an additional discussion of groups in the presence of tied values, see "Binning data by quantiles? Beware of rounded data."