In a previous blog post, I presented a short SAS/IML function module that implements the trapezoidal rule. The trapezoidal rule is a numerical integration scheme that gives the integral of a piecewise linear function that passes through a given set of points.
This article demonstrates an application of using the trapezoidal rule: computing the area under a receiver operator characteristic (ROC) curve.
Many statisticians and SAS programmers who are familiar with logistic regression have seen receiver operator characteristic (ROC) curves. The ROC curve indicates how well you can discriminate between two groups by using a continuous variable. If the area under an ROC curve is close to 1, the model discriminates well; if the area is close to 0.5, the model is not any better than randomly guessing.
Let Y be the binary response variable that indicates the two groups. Let X be a continuous explanatory variable. In medical applications, for example, Y might indicate the presence of a disease and X might indicate the level of a certain chemical or hormone. For this blog post, I will use a more whimsical example. Let X indicate the number of shoes that a person has, and let Y indicate whether the person is female.
The following data indicate the results of a nonscientific survey of 15 friends and family members. Each person was asked to state approximately how many pairs of shoes (5, 10, ..., 30+) he or she owns. For each category, the data show the number of females in that category and the total number of people in that category:
data shoes; input Shoes Females N; datalines; 5 0 1 10 1 3 15 1 2 20 3 4 25 3 3 30 2 2 ;
ods graphics on; proc logistic data=shoes plots(only)=roc; ods select ROCcurve; model Females / N = shoes / outroc=roc; run;
Notice that the graph has a subtitle that indicates the area under the ROC curve. If you want to check the result yourself, the points on the ROC curve are contained in the ROC data set:
proc print data=roc; var _1mSpec_ _Sensit_; run;
I used these points in my previous blog post. You can refer to that post to verify that, indeed, the area under the ROC curve is 0.88, as computed by the SAS/IML implementation of the trapezoidal rule.
By the way, the area under the ROC curve is closely related to another statistic: the Gini coefficient. Murphy Choy writes about computing the Gini coefficient in SAS in a recent issue of VIEWS News, a newsletter published by members of the international SAS programming community. The Gini coefficient is related to the area under the ROC curve (AUC) by the formula G = 2 * AUC – 1, so you can extend the program in my previous post to compute the Gini coefficient by using the following SAS/IML statement:
Gini = 2 * TrapIntegral(x,y) - 1;
How well does the number of shoes predict gender in my small sample? The answer is "moderately well." The logistic model for these data predicts that a person who owns fewer than 15 pairs of shoes is more likely to be male than female. A person with more than 20 pairs of shoes is likely to be female. The area under the ROC curve (0.88) is fairly close to 1, which indicates that the model discriminates between males and females fairly well.
The logistic model is summarized by the following plot (created automatically by PROC LOGISTIC), which shows the predicted probability of being female, given the number of pairs of shoes owned. Notice the wide confidence limits that result from the small sample size.