Quantile regression: Better than connecting the sample quantiles of binned data

I often see variations of the following question posted on statistical discussion forums:

I want to bin the X variable into a small number of values. For each bin, I want to draw the quartiles of the Y variable for that bin. Then I want to connect the corresponding quartile points from bin to bin.

In other words, the question asks how to create a plot like the following:

The plot attempts to show how the first quartile, second quartile (median), and third quartile of the response variable vary as the explanatory variable changes. However, I do not recommend using this plot when the explanatory variable is continuous. When I see a question like this, I sometimes respond by saying "you might want to look at quantile regression."

This article takes a quick look at quantile regression. What is quantile regression? How can you perform quantile regression in SAS? And how does it relate to the "binned quantile plot" that is shown above?

Start with the familiar: Standard regression

Before discussing quantile regression, let's introduce some data and think about a typical analysis. The data are salaries (in 1991) for 459 statistics professors in U.S. colleges and universities. A professor's salary (the response) is assumed to be related to his or her years of service (the explanatory variable). For these data, the explanatory variable is already binned into 25 discrete values. These data and this analysis are based on an example in the documentation for the QUANTREG procedure. You can download the data and the SAS statements that are used in this article.

A traditional regression analysis predicts the mean salary for a professor, given the years of service. In SAS, there are many regression procedures, such as the parametric GLM procedure and the nonparametric LOESS procedure. For these procedure, you can also call the regression directly from the SGPLOT procedure. For example, the following statements add a loess curve and a cubic regression curve to the data:

title "Salary by Experience";
proc sgplot data=salary;
   loess x=Year y=Salaries / smooth=0.5;       /* nonparametric regression */
   reg x=Year y=Salaries / degree=3 nomarkers legendLabel="Cubic Regression";
run;

For the loess model, salaries appear to increase with experience for the first six years and for years 12–18; the salaries appear to be flat for the other intervals. For the cubic regression model, the salaries appear to increase gradually, although they increase more quickly for the first 10 years.

These regression curves smooth the data. For a parametric model, you can sometimes interpret the parameters of the model in terms of physical quantities. Do you believe that the mean salary of a professor depends smoothly on the number of years of service? These models reflect that assumption.

Now think for a moment about what we did not do. We did not compute the sample mean for each year and "connect the dots." That would result in a jagged line, which does not smooth the data. "Connecting the dots" is not a model; it does not provide insight into the relationship between salary and years of service, not does it accomodate random errors in the data.

Moving from means to quantiles

Given the number of years of service, each of the previous curves predicts the mean salary. Statistically speaking, the regression curves model the "conditional mean" of the response variable. However, you can also design a regression procedure to model the conditional median of the response. In fact, you can model other quantiles, such as the upper quartile (75th percentile), the lower decile (10th percentile), or other values.

A model for a conditional quantile is known as quantile regression. In SAS, quantile regression is computed by using the QUANTREG procedure, the QUANTSELECT procedure (which supports variable selection), or the QUANTLIFE procedure (which support censored observations).

What is quantile regression?

There have been several introductory papers written on quantile regression:

I don't intend to duplicate these papers. Instead, let's use PROC QUANTREG to compute a quantile regression and compare it with the binned quantile plot:

title "Quantile Regression of Salary vs. Year";
ods graphics on;
proc quantreg data=salary ci=sparsity;
   model salaries = year year*year year*year*year /
                    quantile=0.25 0.5 0.75 plot=fitplot(showlimits);
run;

For these data, the model for the lower quartile increases for about 10 years, then levels off. The model for the conditional median is qualitatively similar to the cubic model for the conditional mean. The model for the upper quartile indicates a higher growth rate that for the median quartile. The quantile curves enable you to estimate how the inter-quartile range (the gap between the upper and lower quartiles) grows with time. For newly hired professors, only about $7,000 separates the relatively high salaries from the relatively low salaries. However, for professors with twenty or more years of experience, that gap has widened to more than $10,000.

When compared with the binned quantile plot at the beginning of this article, this quantile regression plot has several advantages:

  1. The curves smooth the data. Given the number of years of service, you can read off the predicted quartiles of salary from the plot.
  2. You can use the CI= option on the PROC QUANTREG statement to obtain confidence intervals (shown as shaded bands) for the predicted quartiles. See the PROC QUANTREG documentation for details.
  3. For parametric models (especially linear models), you might be able to interpret the parameters to gain insight into the process that generates the data. You can also compute nonparametric models by using the EFFECT statement to create spline effects.
  4. Quantile regression extends easily to multiple explanatory variables, whereas binning data gets harder as the dimension increases, and you often get bins for which there are no data.

So reach for quantile regression when you want to investigate how quartiles, quintiles, or deciles of the response variable change with covariates. The QUANTREG procedure is easy to run, and the results are superior to ad-hoc methods such as binning the data and connecting the sample quantiles.

Post a Comment

The role of statistics in the top public health achievements of the 20th century

In this International Year of Statistics, I'd like to describe the major role of statistics in public health advances. In our modern society, it is sometimes difficult to recall the huge advances in health and medicine in the 20th century. To name a few: penicillin was discovered in 1928, risk factors for heart attacks and stroke were established in the 1950s, and vaccines were created throughout the latter half of the century to prevent diseases that once killed thousands of children annually.

A few weeks ago, SAS was fortunate to receive a visit from Christy Chuang-Stein, Vice President and Head of Statistical Research and Consulting Center at Pfizer, and a candidate for president of the American Statistical Association. One of Christy's slides mentioned that the Centers for Disease Control and Prevention (CDC) published a list of the "Top 10 Great Public Health Achievements in the 20th Century." The CDC articles are fascinating but lengthy, so let me give you the executive summary and simultaneously emphasize the role of statistics in a few of these achievements:

  1. Routine immunization of children: During the 20th century, researchers developed vaccines that prevent smallpox, measles, polio, and other diseases. The safety and efficacy of these life-saving vaccines were tested by using statistically designed clinical trials and statistical quality control during manufacturing.

    Many of these studies were conducted out of the public's eye, but in 1954 there was a massive public trial to test Jonas Salk’s polio vaccine. More than 1.8 million children participated in a randomized, double-blind trial. This is a statistical design in which subjects are randomly assigned to either the control group or the vaccine group. Neither the doctors nor the parents knew which child received the vaccine instead of a placebo. This famous experiment was a success. Today, pharmaceutical companies run similar statistical studies as they develop drugs for the treatment and management of a wide range of maladies.

  2. Motor-vehicle safety: In 1925, about 18 Americans died for every million miles traveled. By the 1990s, that average mortality rate had dropped to 1.7 deaths per million miles. Engineering (both in vehicles and on the roads) had a large part to do with that decrease, but statistics played a role in identifying key risk factors that contributed to vehicular deaths, including statistics about the use of seat belts, infant restraint systems, booster seats, and statistics about the value of graduated licensing for teenage drivers.

  3. Declines in deaths from heart disease and stroke: Although heart disease and stroke are the first and third leading causes of death in the US, respectively, death rates due to heart disease have decreased 56 percent since the 1950s, and death rates from stroke have decreased by 70 percent. As I described in a previous article about Jerome Cornfield, carefully designed statistical studies such as the Framingham Heart Study established the major risk factors: high cholesterol, high blood pressure, smoking, and dietary factors such as cholesterol, fat, and salt.

  4. Safer foods: Several times a year, the news tells us about recalls of food that are linked to a foodborne disease such as Salmonella or E. coli. Tomatoes, spinach, lettuce, ground beef—these and other food products have recently been in the news as sources of outbreaks that are geographically diverse, yet are eventually traced to a common cause, such as poor sanitation at a single processing facility. Kaiser Fung's book, Numbers Rule Your World, presents a fascinating look at how statistical methods (coordinated through the CDC) are used to detect, investigate, trace, and control outbreaks of foodborne diseases.

  5. Tobacco as a health hazard: The CDC article notes that smoking is known to be the "leading preventable cause of death and disability" in the US. But the health risk of smoking was unknown before epidemiologists and statisticians began analyzing data in the 1940s and 1950s. By 1964 the evidence from thousands of studies convinced the US Surgeon General to conclude that "cigarette smoking is causally related to lung cancer" and to other diseases. The statistics associated with establishing causality led to a lengthy debate between statistician Jerome Cornfield (and colleagues) at the National Institutes of Health and Sir Ronald Fisher, a heavy smoker who was a brilliant statistician but a bully toward those who disagreed with him. In the end, science prevailed over intimidation, and the weight of statistical evidence has led to laws that have decreased the prevalence of smoking in the US, as shown in the following graph:

The other achievements in the top 10 list were workplace safety, control of infectious diseases, healthier mothers and babies, family planning, and fluoridation of drinking water.

After her talk, I asked Chuang-Stein whether the 21st century has produced any advances that compare with those on the CDC list for the 20th century. She replied that the following two achievements are likely to make the list for the 21st century:

  • Personalized medicine: In personalized medicine, an individual's genetic profile and his or her unique biochemistry are used to customize treatment. For example, which medicines are likely to provide the best results with the fewest side effects?

  • Disease modification: If a person is diagnosed early enough, it might be possible to inhibit the disease so that it never debilitates the person. For example, current research on Alzheimer's disease aims to inhibit the progression of the disease. Disease-modifying treatments are available for several diseases that develop slowly, including multiple sclerosis and rheumatoid arthritis

Certainly medicine, chemistry, engineering, and other fields also contributed to these public health successes. Statistics is a collaborative science, and it does not make its contributions in isolation. Cornfield called it "the bedfellow of the sciences." Nevertheless, statistics made these public health advances possible. And that is truly something to celebrate!

Post a Comment

How to generate multiple samples from the multivariate normal distribution in SAS

A SAS customer asks:

How do I use SAS to generate multiple samples of size N from a multivariate normal distribution?

Suppose that you want to simulate k samples (each with N observations) from a multivariate normal distribution with a given mean vector and covariance matrix. Because all of the samples are drawn from the same distribution, one way to generate k samples is to generate a single "mega sample" with kN observations, and then use an index variable to indicate that the first N observations belong to the first sample, the next N observations belong to the second sample, and so forth. This article implements that technique.

I have previously shown how to use the RANDNORMAL function in SAS/IML to simulate multivariate normal data. Now suppose that you want to generate 10 samples, where each sample contains five observations from a trivariate normal distribution. You can generate 5 x 10 = 50 observations as follows:

proc iml;
N = 5;                              /* size of each sample */
NumSamples = 10;                    /* number of samples   */  
 
/* specify population mean and covariance */
Mean = {1, 2, 3};
Cov = {3 2 1, 
       2 4 0,
       1 0 5};
call randseed(4321);               
X = RandNormal(N*NumSamples, Mean, Cov);

Remark: If you intend to implement the rest of your algorithm in the SAS/IML language, you probably do not need to create an index variable. For example, if you want to compute correlation matrices for each sample, you can subset the samples as follows:

/* optional: do you want to analyze the data in SAS/IML? */
Obs1 = 1;                   /* beginning of i_th sample      */
do i = 1 to NumSamples;
   Obs2 = Obs1 + N -1;      /* end of i_th sample            */
   Y = X[Obs1:Obs2, ];      /* Y = i_th sample               */
   Obs1 = Obs2 + 1;         /* prepare for next sample       */
   c = corr(Y);             /* do something with i_th sample */
end;

However, from other things that the SAS customer said, I think she wants the data in a SAS data set so that she can run a SAS procedure on the data. Therefore, construct an ID variable as follows:

ID = colvec(repeat(T(1:NumSamples), 1, N)); /* 1,1,1,...,2,2,2,...,3,3,3,... */

The nested syntax looks complicated, but each function call is simple. The T function creates a column vector with elements {1,2,...,9,10}. The REPEAT function copies the column vector to create a matrix with five columns. The COLVEC function stacks the resulting elements in row-major order. Thus the first five elements of the ID vector are 1, the next five are 2, and so forth, for a total of 50 elements. You can write the data to a SAS data set as follows:

Z = ID || X;
create MVN from Z[c={"ID" "x1" "x2" "x3"}];
append from Z;
close MVN;
quit;

For completeness, the following call to the CORR procedure computes 10 correlation matrices, one for each value of the ID variable. Remember: use the BY statement to carry out an analysis on each sample. You do NOT want to break the data into 10 data sets and use a macro loop to analyze each sample because that approach is not efficient.

proc corr data=MVN noprint out=CorrOut(where=(_TYPE_="CORR"));
   by ID;
   var x1-x3;
run;

So there you have it. To generate k samples from the same distribution, you do not need to call a function k times. Instead, it is often more efficient to make a single call that computes all of the random values in a single call. (In SAS/IML, this assumes that all of the simulated data can fit in memory.) You can then use index subscript or a separate index variable to analyze each of the k samples.

This tip also applies to simulating univariate random data. Generate all the data, then analyze the data by using BY groups.

Post a Comment

Point/Counterpoint: Where should you put ODS SELECT and ODS OUTPUT statements?

ODS statements are global SAS statements. As such, you can put them anywhere in your SAS program. For maximum readability, many SAS programmers agree that most ODS statements should appear outside procedures in "open" SAS code. For example, most programmers agree that the following statements should appear outside of procedures:

ods graphics off;
ods html style=HTMLBlue;
ods listing close;

However, it is less clear whether the ODS SELECT and the ODS OUTPUT statements should appear. Should you put them outside of a procedure boundary, or inside?

I have tried to summarize both sides of the debate by using a Point/Counterpoint approach. The argument for putting ODS statements outside of procedures is summarized by Otis Outside. The argument for putting ODS statements inside procedures is summarized by Ingrid Inside. I thank participants on the SAS-L discussion forum and several colleagues at SAS for sharing their thoughts on this matter. Hopefully Ingrid and Otis represent your views fairly and faithfully.

POINT: ODS statements should appear outside procedures

By: Otis Outside


When you write a procedure statement, put your ODS SELECT and ODS OUTPUT statements outside the procedure, like this:

ods select ParameterEstimates FitStatistics;
ods output ParameterEstimates=PE;
proc reg data=Sashelp.Class;
   model weight = height age;
quit;

In this program, the ODS SELECT statement is used to limit the output of the subsequent procedure call. Only the ParameterEstimates table and the FitStatistics table will be displayed. Furthermore, the ODS OUTPUT statement specifies that the ParameterEstimates table will be written to a SAS data set.

ODS statements—like all global statements—should always appear outside of procedures. The only statements that should appear inside a procedure are the statements that apply to that procedure. (Notice that if you look at the procedure documentation, you will not find the global statements there!) If you put the TITLE statement and all ODS statements before the PROC statement, it is clear that they are global statements that will affect the subsequent procedure call.

Furthermore, in procedures that support an OUTPUT statement, it is potentially confusing to put an ODS OUTPUT statement inside the procedure because it looks so similar to the OUTPUT statement. Eliminate the possible confusion by putting the ODS OUTPUT statement outside the procedure.

Although Rick has described an example in which an ODS statement does not work when it follows an interactive procedure, this argument is nullified if programmers correctly end interactive procedures by using a QUIT statement. Programmers like me should not have to modify a logical programming convention just to protect against the possibility that someone else might not terminate a procedure correctly.

COUNTERPOINT: ODS statements should appear inside procedures

By: Ingrid Inside


When you write a procedure statement, put your ODS SELECT and ODS OUTPUT statements inside the procedure, like this:

proc reg data=Sashelp.Class;
   model weight = height age;
   ods select FitStatistics;
run;
   model weight = height;
   ods select ParameterEstimates;
   ods output ParameterEstimates=PE;
run;
quit;

The REG procedure is an interactive procedure. Every time it encounters a RUN statement, it processes the preceding statements and displays the results. When it terminates with the QUIT statement, any ODS OUTPUT data sets are closed. In the preceding statements, the FitStatistics table is displayed for the first model that is specified. The second MODEL statement changes the model. For the second model, the ParameterEstimates table AND the FitStatistics table are both displayed because the ODS SELECT statement acts cumulatively. However, the ODS OUTPUT statement creates a data set that contains only the parameter estimates for the second model.

You can't get this flexibility if you put the ODS statements outside of the procedure. To obtain the full power of the ODS system and interactive procedures, you should put the ODS statements inside the procedure. And once you start doing it for interactive procedures, it makes sense to maintain that convention for all procedures.

And I don't buy the argument that "it might be confusing" to put ODS statements inside a procedure. I think the opposite is true. An ODS statement doesn't actually do anything. It just sets up a condition in ODS. ODS then waits for the right procedure to come along and consume it. I think it is less confusing to put ODS statements inside the procedure that they affect. That's why I recommend that you use this convention for all procedures, not just the interactive procedures.

Lastly, Rick's example shows how you can get in trouble if you put ODS statements outside of a procedure. A good programmer practices defensive programming. By putting ODS statements inside of procedures, my programs are more robust.

Your turn

What do you say, readers? Do you agree with Otis or with Ingrid? What convention do you use for ODS SELECT and ODS OUTPUT statements? Why? Leave a comment.

Post a Comment

The difference of density estimates: When does it make sense?

I was recently asked how to compute the difference between two density estimates in SAS. The person who asked the question sent me a link to a paper from The Review of Economics and Statistics that contains several examples of this technique (for example, see Figure 3 on p. 16 and Figure 4 on p. 17). The author of the paper uses the technique to demonstrate that the wage of a Mexican citizen influences his decision to emigrate to the US. Briefly, the paper plots a kernel density estimate of the income distribution of Mexicans who emigrated and those who did not, and shows that the incomes of those who decided to emigrate tend to be less than for those who decided to stay.

In SAS, you can form the difference between two density estimates by doing the following:

  1. Use PROC KDE to compute kernel density estimates of the two groups. Use the GRIDL= and GRIDU= options so that the two kernel densities are evaluated on the same grid of points. Use the OUT= option on the UNIVAR statement to write the density estimates to a SAS data set.
  2. Form the difference of the densities by subtracting one from the other.
  3. Use the SGPLOT procedure to plot the difference.

However, just because you can do something doesn't mean that you should do it! After I produce some graphs, I present some questions about the technique.

The difference of densities for subpopulations

For the sake of an example, I will use data in the Sashelp.Cars data set. I will compare the MPG_CITY for cars that are manufactured in the USA with the MPG_CITY for cars that are manufactured in Asia. The cars from the USA tend to get fewer miles per gallon than the cars from Asia, so these data are a convenient proxy for the Mexican incomes that are shown in the paper.

The following DATA step creates two MPG variables. The call to PROC KDE overlays both densities on a single plot and writes the kernel density estimates to the KerOut data set.

/* create example data: MPG_City for cars built in USA vs. Asia */
data cars;
 keep MPG_Asia MPG_USA;
 merge sashelp.cars(where=(origin="Asia") rename=(MPG_City=MPG_Asia))
       sashelp.cars(where=(origin="USA") rename=(MPG_City=MPG_USA));
run;
 
/* optional: use BWM= option to adjust kernel bandwidth */
proc kde data=cars;
univar MPG_Asia MPG_USA / plots=DensityOverlay out=KerOut GridL=5 GridU=65;
run;

The GRIDL= and GRIDU= options are necessary. Without them, the density estimate would not be evaluated at a common set of points. Then you would need to use an interpolation scheme to obtain the KDEs at a common set of locations. I have previously shown how to use SAS/IML functions for linear interpolation of data.

However, because we used the GRIDL= and GRIDU= options, no interpolation is necessary. You can just use the DATA step to merge the two KDEs and to compute the difference:

data Diff(drop=Var Count);
merge KerOut(where=(Var="MPG_Asia") rename=(Density=Density_Asia))
      KerOut(where=(Var="MPG_USA")  rename=(Density=Density_USA));
by Value;
Diff_Density = Density_USA - Density_Asia;
run;
 
/* display difference of group densities */
proc sgplot data=Diff;
   title "Difference between Densitites";
   series x=Value y=Diff_Density;
   refline 0 / axis=y;
   xaxis grid values=(5 to 65 by 5) label="MPG (City)";
   yaxis label="Difference in Density Estimates (USA - Asia)";
run;

You can interpret the graph as follows. Given that a vehicle was made in the USA, its fuel efficiency is generally less than for vehicles that were manufactured in Asia. The "difference plot" shows the extent of the difference in densities for the two subpopulations. This plot is interesting because American cars that get approximately 27 mpg seem to be an exception to the general rule.

Notice that the difference of densities is not itself a density. Rather, the difference plot is a way to visualize when the density of one distribution is greater than for a second distribution.

A critique of this method

I have never seen this "difference in density" technique before, and I have some questions about it. I am not familiar with a theoretical reference, so my criticisms are based on ignorance rather than on a careful study of the method.

One concern I have is that the scale of the "density difference plot" is clearly important, but the method seems to ignore it. Suppose that I have two normal populations, N(0, 1) and N(ε, 1). When I compute the difference in densities, the difference plot will look like the following plot.

The difference plot shows that the N(0, 1) distribution is to the left of the other distribution, but the plot doesn't warn the reader that the two distributions are essentially identical. No matter how small ε is, the difference plot will look similar...except for the scale of the vertical axis. Can the method can be extended to provide confidence limits? I don't know. Perhaps the integral of the absolute value of the difference (or the square of the difference) is an appropriate measure for how close the densities are.

Another concern I have is that kernel density estimation requires choosing a bandwidth parameter. There are various ways to choose a bandwidth when the goal is to estimate the density. However, it is not clear to me that the bandwidth that best estimates the density is also the bandwidth that best estimates the difference between the densities of two subpopulations.

The authors of the paper acknowledge some of these concerns (p. 17 and the bottom of p. 16), and they perform a nonparametric test, such as can be computed by using PROC NPAR1WAY. Nevertheless, I wonder when it makes sense to display the difference between two densities.

How would you analyze data like these? How would you visualize the results?

Post a Comment

Changing the ODS style might change color ramp (and what to do about it)

Did you know that your ODS style might result in changing the color ramp for contour plots and heat maps? For example, the default style in SAS 9.3 is HTMLBlue. Let's create a contour plot in the HTML destination by running an example adapted from the documentation for the RSREG procedure:

/* From RSREG Getting Started documentation */ 
data smell;
   input Odor T R H @@;
   datalines;
 66 40 .3 4     39 120 .3 4     43 40 .7 4     49 120 .7  4
 58 40 .5 2     17 120 .5 2     -5 40 .5 6    -40 120 .5  6
 65 80 .3 2      7  80 .7 2     43 80 .3 6    -22  80 .7  6
-31 80 .5 4    -35  80 .5 4    -26 80 .5 4
;
 
ods graphics on;
ods html style=htmlblue;
proc rsreg data=smell plots=surface;
   model Odor = T R;
   ods select Contour;
run;

The procedure creates a contour plot of the predicted response surface. The background of the plot is colored according to the standard error of the response. Notice from the color ramp on the right side that light colors represent small standard errors, whereas dark colors represent larger errors.

Sometimes I use the ODS journal style when I am writing a paper or book that will be published in grayscale rather than color. Let's create the exact same contour plot by using the journal style:

ods html style=Journal;
proc rsreg data=smell plots=surface;
   model Odor = T R;
   ods select Contour;
run;

The plot looks very different, even though it conveys exactly the same information. In this plot, the color ramp on the right side of the plot has opposite "polarity": dark colors represent small standard errors, whereas light colors represent larger errors.

Typically, a report uses a single style. Consequently, your color ramps will be consistent throughout the report. However, be careful if you publish an article that displays the procedure syntax and asserts that "dark regions represent areas with large standard errors." Those words might not be true if your reader is running a different ODS style than you are.

How to reverse a color ramp in SAS ODS graphics

Incidentally, if you like everything about an ODS style except for the color ramp that is used for contour plots, you can use PROC TEMPLATE to define a new template. The new template can inherit from a standard template, but redefine the color ramp.

For example, the following statements define a style called JournalReverse that inherits from the Journal style. In the JournalReverse style, the 'start' and 'end' colors of the two-color color ramp are chosen so that light colors correspond to low values and dark colors correspond to large values. The result is a grayscale image that looks similar to the HTMLBlue image at the beginning of this article.

proc template;
define style styles.JournalReverse;
   parent=styles.Journal;
   class graphColors from graphColors /
   'gramp2cend' = cx5F5F5F
   'gramp2cstart' = cxF0F0F0;
   end;
run;
 
ods html style=JournalReverse;
proc rsreg data=smell plots=surface;
   model Odor = T R;
   ods select Contour;
run;

For more ways to modify existing ODS styles, or to learn how to write your own style, see the "Style" chapter in the SAS/STAT User's Guide

Post a Comment

How to compute the distance between observations in SAS

In statistics, distances between observations are used to form clusters, to identify outliers, and to estimate distributions. Distances are used in spatial statistics and in other application areas.

There are many ways to define the distance between observations. I have previously written an article that explains Mahalanobis distance, which is used often in multivariate analysis, and I have showed how to compute the Mahalanobis distance in SAS. Today's article is simpler: how do you compute the usual Euclidean distance in SAS?

Recall that the squared Euclidean distance between the point p = (p1, p2, ..., pn) and the point q = (q1, q2, ..., qn) is the sum of the squares of the differences between the components: Dist2(p,q) = Σi (piqi)2. The Euclidean distance is then the square root of Dist2(p,q). This article shows three ways to compute the Euclidean distance in SAS:

  1. By using the DISTANCE procedure in SAS/STAT software.
  2. By using the DISTANCE function in SAS/IML software.
  3. By writing your own SAS/IML function.

The following DATA step creates 27 observations that are arranged on an integer lattice in three dimensions. Each row of the data set contains an observation in three variables. The 27 points are (0,0,0), (0,0,1), (0,0,2), (0,1,0), ..., (2,2,2).

data Obs;
do x=0 to 2;
   do y=0 to 2;
      do z = 0 to 2;
         output;
      end;
   end;
end;
run;

Compute distance by using SAS/STAT software

PROC DISTANCE can compute many kinds of distance, and can also standardize the data variables, which is useful when your variable represent different quantities (such as height, weight, and age). In the simple case of Euclidean distance without any standardization, specify the METHOD=EUCLID option and the NOSTD option on the PROC DISTANCE statement, as follows:

proc distance data=Obs out=Dist method=Euclid nostd;
   var interval(x y z);
run;
 
proc print data=Dist(obs=4);
   format Dist: 8.6;
   var Dist1-Dist4;
run;

The output data set has 27 variables and 27 observations. The preceding table shows the first four observations and the first four variables. You can see that the output data set is the lower-triangular portion of the distance matrix. The ith row gives the distance between the ith observation and the jth observation for ji. For example, the distance between the fourth observation (0,1,0) and the second observation (0,0,1) is sqrt(02 + 12 + 12)= sqrt(2) = 1.414.

If you prefer to output the full, dense, symmetric matrix of distances, use the SHAPE=SQUARE option on the PROC DISTANCE statement.

Computing distance in SAS/IML Software

In SAS/IML software, you can use the DISTANCE function in SAS/IML to compute a variety of distance matrices. The DISTANCE function was introduced in SAS/IML 12.1 (SAS 9.3M2).

By default, the DISTANCE function computes the Euclidean distance, and the output is always a square matrix. Therefore, the following statements compute the Euclidean pairwise distances between the 27 points in the Obs data set:

proc iml;
use Obs;
read all var _NUM_ into X;
close Obs;
 
D = distance(X);
print (D[1:4, 1:4])[format=8.6 c=("Dist1":"Dist4")];

The output shows that the values in the upper-left portion of the distance matrix are the same as were computed by PROC DISTANCE.

A SAS/IML Module for computing Euclidean distance

Sometimes you do not want to compute the complete matrix of pairwise distances between observations. Sometimes you only need the distances between observations that belong to different groups. For example, you might have one set of locations that represent warehouses and another set that represents the locations of retail stores. You might be interested in computing the distances from each store to the warehouses, so that you can efficiently ship goods to each store from the closest warehouse.

The following SAS/IML module computes the distances between observations in the matrix X and the matrix Y. The (i,j)th element of the result is the distance between the ith row of X and the jth row of Y.

/* compute Euclidean distance between points in x and points in y.
   x is a n x d matrix, where each row is a point in d dimensions.
   y is a m x d matrix.
   The function returns the n x q matrix of distances, D, such that
   D[i,j] is the distance between x[i,] and y[j,]. */
start PairwiseDist(x, y);
   if ncol(x)^=ncol(y) then return (.);       /* different dimensions */
   n = nrow(x);  m = nrow(y);
   idx = T(repeat(1:n, m));                   /* index matrix for x   */
   jdx = shape(repeat(1:m, n), n);            /* index matrix for y   */
   diff = X[idx,] - Y[jdx,];
   return( shape( sqrt(diff[,##]), n ) );     /* sqrt(sum of squares) */
finish;
 
p = { 1  0, 
      1  1,
     -1 -1};
q = { 0  0,
     -1  0};
PD = PairwiseDist(p,q); 
print PD;

If you are still running SAS/IML 9.3 or earlier, you can use the PairwiseDist function to define your own function that computes the Euclidean distance between rows of a matrix:

/* compute Euclidean distance between points in x.
   x is a p x d matrix, where each row is a point in d dimensions.
   Use the DISTANCE function in SAS/IML 12.1 and later releases. */
start EuclideanDistance(x);  
   y=x;
   return( PairwiseDist(x,y) );
finish;
 
D = EuclideanDistance(X);
print (D[1:4, 1:4])[format=8.6 c=("Dist1":"Dist4")];

The output is the same as for the built-in DISTANCE function, and is not printed.

Post a Comment

How to plot a discontinuous function

It is easy to use the SGPLOT procedure in SAS to plot the graph of a well-behaved continuous function: just create a data set of the (x,y) values on some domain and use the SERIES statement to connect the points. However, to plot the graph of a discontinuous function correctly requires more thought.

In statistics, discontinuous functions arise with moderate frequency. Empirical cumulative distribution functions are discontinuous, as are many bounded probability density functions. A simple example is the (continuous) uniform density function, which is defined as 1 on the interval [0, 1], and 0 outside of that interval.

If you want to graph the standard uniform density function, you might attempt the following:

data UniformPDF_Orig;
do x = -1 to 2 by 0.05;
   y = (x>=0 & x<=1);
   output;
end;
run;
 
title "Discontinuous Function: First Attempt";
proc sgplot data=UniformPDF_Orig;
   series x=x y=y;
   refline 0 1 / axis=x lineattrs=(pattern=dash);
   yaxis min=-0.1 max=1.1;
run;

Although the graph shows the basic idea of the uniform density, it is not accurate because the SERIES statement just "connects the dots," and so the graph displays a line segment with a large slope on the interval [–0.05, 0], and a line segment with a large negative slope on the interval [0.95, 1]. If you reduce the interval between consecutive points (for example, by using 0.01 as the step size in the DO loop), you can reduce the visual impact of the error, but it is still there.

If you know the points of discontinuity, then you can make a better graph by generating points on intervals for which the function is continuous. You can enumerate each interval and then use the GROUP= option on the SERIES statement to plot the graph on each interval, as follows:

data UniformPDF;
domain = 1;
do x = -1 to 0 by 0.05;
   y = 0; output;
end;
domain = 2;
do x = 0 to 1 by 0.05;
   y = 1; output;
end;
domain = 3;
do x = 1 to 2 by 0.05;
   y = 0; output;
end;
run;
 
title "Discontinuous Function: Better Version";
proc sgplot data=UniformPDF noautolegend;
   series x=x y=y / group=domain lineattrs=GraphFit;
   refline 0 1 / axis=x lineattrs=(pattern=dash);
   yaxis min=-0.1 max=1.1;
run;

The GROUP= option tells the SGPLOT procedure to plot three distinct curves, and not to connect them. The resulting graph is more accurate.

Thanks to Warren Kuhfeld for suggesting this topic.

Post a Comment

Got Matrix? Reach for the SAS/IML language

Someone recently asked a question on the SAS Support Communities about estimating parameters in ridge regression. I answered the question by pointing to a matrix formula in the SAS documentation. One of the advantages of the SAS/IML language is that you can implement matrix formulas in a natural way. The SAS/IML expressions can often be written almost verbatim from the formula. This article uses a ridge regression formula from the PROC REG documentation to illustrate this feature.

When to use ridge regression?

Ridge regression is a variant to least squares regression that is sometimes used when several explanatory variables are highly correlated. The "usual" ordinary least squares (OLS) regression produces unbiased estimates for the regression coefficients (in fact, the Best Linear Unbiased Estimates). However, when the explanatory variables are correlated, the OLS parameter estimates have large variance. It might be desirable to use a different regression technique, such as ridge regression, in order to obtain parameter estimates that have smaller variance. The trade-off is that the estimates for the ridge regression method are biased.

If X is the centered and scaled data matrix, then the crossproduct matrix X`X is nearly singular when columns of X are highly correlated. Mathematically, ridge regression adds a multiple (called the ridge parameter, k) of the identity matrix to the X`X matrix. The nonsingular matrix (X`X + kI) is then used to compute the parameter estimates.

Each value of k results in a different set of parameter estimates. There have been many papers written about how to choose the best value of k. In this article, I merely want to show how to compute the parameters for ridge regression when the ridge parameter is given.

How to compute ridge regression in SAS

In SAS software, you can compute ridge regression by using the REG procedure. The following example from the PROC REG documentation is used to illustrate ridge regression. The RIDGE= option specifies the value(s) of the ridge parameter, k. The OUTEST= option is used to create an output data set that contains the parameter estimates for each value of k.

/* Ridge regression example from PROC REG documentation */
data acetyl;
input x1-x4 @@;
x1x2 = x1 * x2;
x1x1 = x1 * x1;
datalines;
1300  7.5 0.012 49   1300  9   0.012  50.2 1300 11 0.0115 50.5
1300 13.5 0.013 48.5 1300 17   0.0135 47.5 1300 23 0.012  44.5
1200  5.3 0.04  28   1200  7.5 0.038  31.5 1200 11 0.032  34.5
1200 13.5 0.026 35   1200 17   0.034  38   1200 23 0.041  38.5
1100  5.3 0.084 15   1100  7.5 0.098  17   1100 11 0.092  20.5
1100 17   0.086 29.5
;
 
proc reg data=acetyl outest=b ridge=0.02 noprint;
   model x4=x1 x2 x3 x1x2 x1x1;
run;
 
proc print data=b(where=(_TYPE_="RIDGE")) noobs;
   var _RIDGE_ Intercept x1 x2 x3 x1x2 x1x1;
run;

The parameter estimates for the ridge regression are shown for the ridge parameter k = 0.02.

Implementing a matrix formula for ridge regression by using SAS/IML software

The question that was asked on the SAS Discussion Forum was about where to find the matrix formula for estimating the ridge regression coefficients. The documentation for the REG procedure includes a section that provides the formula that PROC REG uses to compute the ridge regression coefficients. The accompanying text says:

Let X be the matrix of the independent variables after centering [and scaling] the data, and let Y be a vector corresponding to the [centered] dependent variable. Let D be a diagonal matrix with diagonal elements as in X`X. The ridge regression estimate corresponding to the ridge constant k can be computed as D-1/2(Z`Z + kI)-1Z`Y.

The terms in brackets do not appear in the original documentation, but I included them for clarity. Since this is a matrix formula, let's use the SAS/IML language to implement the formula. The following SAS/IML program uses the formula to compute the ridged parameter estimates:

/* Ridge regression coefficients computed in SAS/IML */
proc iml;
use acetyl;        
read ALL var {x1 x2 x3 x1x2 x1x1} into X; 
read all var {x4} into y;
close acetyl;        
 
/* ridge regression */
xBar = mean(X);       /* save row vector of means */
yBar = mean(y);       /* save mean of Y */
X = X - xBar;         /* center X and y; do not add intercept */
y = y - yBar;                      
D = vecdiag(X`*X);
Z = X / sqrt(D`);     /* Z`Z = corr(X) */
k = 0.02;             /* choose a ridge parameter */
b =  (1/sqrt(D)) # inv(Z`*Z + k*I(ncol(X))) * (Z`*y); /* formula in REG doc */
print (b`)[colname={"x1" "x2" "x3" "x1x2" "x1x1"}];

So that the formula and the SAS/IML statements match each other, I have written the computation in a "natural" way, rather than in a more efficient way. See the article "Do you really need to compute that matrix inverse?" In any case, you can see that the parameter estimates that are compute from the formula agree with the estimates that are computed by PROC REG.

However, notice that the PROC IML computations do not include an intercept term. This is because the independent and dependent variables were all centered, so the intercept in these coordinates is exactly zero. To obtain the intercept in the original (uncentered) coordinates, you can use a simple formula that recovers the intercept estimate:

   /* The ridge regression uses centered x and y. Recover the intercept:
      y-yBar = b1(x1-x1Bar) + ... + bn(xn-xnBar)
   so      y = yBar-Sum(bi*xibar) + b1 x1 + ... + bn xn
   Consequently, b0 = yBar-Sum(bi*xibar). */
   b0 = ybar - xbar * b;
   print b0[label="Intercept"];

Got a matrix formula? Use SAS/IML software

As this article shows, the SAS/IML language enables you to implement matrix formulas in a natural way. Although this article implements a formula that is already built into a SAS procedure, the same ideas apply for formulas and algorithms that are not available in any SAS procedure.

So the next time that you need to implement a matrix formula that appears in a textbook, in documentation, or in a journal article, reach for the SAS/IML language!

Post a Comment

Jerome Cornfield: The statistician who established risk factors for lung cancer and heart disease

One purpose of this International Year of Statistics is to spread the word that the field of statistics benefits society. As part of the International Year, many organizations, including SAS and the American Statistical Association (ASA), are turning to history to illustrate how statistics is vital to the health and welfare of the world.

Recently, Stephen Stigler wrote an article in the ASA membership magazine in which he singled out 20 ASA members who were "influential in bringing [the ASA] to this point in our history." Stigler readily admits that "many excellent people have been omitted" from the list. I would like to highlight one of those who were omitted. This is the biography of Jerome Cornfield, who not only served as a president of the ASA, but also made fundamental contributions to the fields of statistics, medical research, and epidemiology.

Jerome Cornfield: The early years

Jerome Cornfield was born the son of Russian Jewish immigrants in 1912. Like John Sall, co-founder of SAS, he got a degree in history. He joined the US Bureau of Labor Statistics during the Depression (1935) and learned statistics by taking courses at the US Department of Agriculture from 1936–1938 (Greenhouse, p. 3). During this time he used probability sampling methods (which were cutting-edge research in the 1930s) to determine the number of unemployed in the US. He was instrumental in the design of samples used by the federal government, and contributed to the revision of the Consumer Price Index (1938-1940). In 1947 he joined a group that provided consultation to the fledgling National Institutes of Health (NIH) (Greenhouse, p. 4).

The link between cigarettes and lung cancer

At the NIH, he became interested in a vexing problem: Why were so many people dying of lung cancer in 1950, when 50 years earlier there were so few cases reported? Was it better diagnosis? More factory pollution? A population that was living longer than ever?

Some retrospective case-control studies by Richard Doll and Bradford Hill in the UK had linked lung cancer to cigarette smoking. But Cornfield wanted a stronger conclusion. He wanted to know whether the correlation was actually a cause-and-effect relationship. He also wanted to use statistics to assess the risk that an individual who smokes would develop lung cancer. By using Bayes' rule, Cornfield was able to combine Doll and Hill's results (namely, the estimated probability that someone was a smoker, given that they had lung cancer) with NIH data to answer the inverse question: the probability that someone would develop lung cancer, given that he was a smoker. The result: smokers are many times more likely to develop lung cancer than nonsmokers.

Cornfield's work "stunned research epidemiologists" and his techniques "made much of modern epidemiology possible" (McGrayne, p. 111). Although his work provoked the ire of the volatile and influential Sir Ronald Fisher (Fisher was a smoker, a paid consultant to the tobacco industry, and a fervent anti-Bayesian), Cornfield successfully defended his methods and results. In 1964, when the US Surgeon General warned "that cigarette smoking is causally related to lung cancer" (McGrayne, p. 113), Cornfield's work was among the evidence that was cited.

The risk factors for heart disease

Cornfield went on to apply statistics to many other controversial and important problems, such as the safety of the polio vaccine, and the safety and efficacy of drugs and food additives (Greenfield, p. 4). Among his achievements was contributing to the design of the Framingham Heart Study, a longitudinal study of the health of residents of Framingham, Massachusetts. Between 1948–1958, 92 out of 1,329 Framingham males experienced some form of a cardiovascular event, such as a heart attack or stroke. From these results, Cornfield was able to use statistics to identify age, cholesterol, cigarette smoking, heart abnormalities, and blood pressure as major risk factors that contribute to cardiovascular disease (McGrayne, p. 115). The identification of these risk factors is responsible for one of the most important public health achievements of the 20th century: the dramatic decline of mortality due to cardiovascular disease.

Contributions to the ASA

Besides serving as the ASA president in 1974, he was the president or vice-president of numerous other professional societies. He served as an editor of six scholarly journals, including the influential Journal of the American Statistical Association. His presidential address to the ASA is an amusing essay entitled "A Statistician's Apology," in which he asks "why should [a statistician] of spirit, of ambition, of high intellectual standards, take any pride or receive any real stimulation and satisfaction from serving an auxiliary role on someone else's problem?" (Cornfield, p. 9) The answer, he asserts after given examples from his own career, is that there is substantial "stimulation and satisfaction" from interacting and collaborating with scientists in other fields.

Furthermore, he continues, it is only by working on practical problems that statistics advances. "Application requires understanding, and the search for understanding often leads to, and cannot be distinguished from, research. The true Joy is to see the breadth of application and breadth of understanding grow together" (Cornfield, p. 11).

In this International Year of Statistics, it is important to remember the contributions of prolific researchers such as Jerome Cornfield. Yet, Cornfield's true legacy is that today's statisticians continue to use the methods that he introduced to understand the connection between risk factors and disease. Thanks to Cornfield and thousands of other statisticians, today's world is safer and healthier than ever before.

References

Cornfield, Jerome (1975). "A Statistician's Apology," Journal of the American Statistical Association, 70(349), pp. 7–14.

Greenhouse, Samuel W. (1982). "A Tribute," Biometrics, Proceedings of "Current Topics in Biostatistics and Epidemiology." A Memorial Symposium in Honor of Jerome Cornfield, 38, pp. 3–6

McGrayne, Sharon B. (2011). The Theory That Would Not Die, Yale University Press, Chapter 8, pp. 108–118.

Post a Comment