Jerome Cornfield: The statistician who established risk factors for lung cancer and heart disease

One purpose of this International Year of Statistics is to spread the word that the field of statistics benefits society. As part of the International Year, many organizations, including SAS and the American Statistical Association (ASA), are turning to history to illustrate how statistics is vital to the health and welfare of the world.

Recently, Stephen Stigler wrote an article in the ASA membership magazine in which he singled out 20 ASA members who were "influential in bringing [the ASA] to this point in our history." Stigler readily admits that "many excellent people have been omitted" from the list. I would like to highlight one of those who were omitted. This is the biography of Jerome Cornfield, who not only served as a president of the ASA, but also made fundamental contributions to the fields of statistics, medical research, and epidemiology.

Jerome Cornfield: The early years

Jerome Cornfield was born the son of Russian Jewish immigrants in 1912. Like John Sall, co-founder of SAS, he got a degree in history. He joined the US Bureau of Labor Statistics during the Depression (1935) and learned statistics by taking courses at the US Department of Agriculture from 1936–1938 (Greenhouse, p. 3). During this time he used probability sampling methods (which were cutting-edge research in the 1930s) to determine the number of unemployed in the US. He was instrumental in the design of samples used by the federal government, and contributed to the revision of the Consumer Price Index (1938-1940). In 1947 he joined a group that provided consultation to the fledgling National Institutes of Health (NIH) (Greenhouse, p. 4).

The link between cigarettes and lung cancer

At the NIH, he became interested in a vexing problem: Why were so many people dying of lung cancer in 1950, when 50 years earlier there were so few cases reported? Was it better diagnosis? More factory pollution? A population that was living longer than ever?

Some retrospective case-control studies by Richard Doll and Bradford Hill in the UK had linked lung cancer to cigarette smoking. But Cornfield wanted a stronger conclusion. He wanted to know whether the correlation was actually a cause-and-effect relationship. He also wanted to use statistics to assess the risk that an individual who smokes would develop lung cancer. By using Bayes' rule, Cornfield was able to combine Doll and Hill's results (namely, the estimated probability that someone was a smoker, given that they had lung cancer) with NIH data to answer the inverse question: the probability that someone would develop lung cancer, given that he was a smoker. The result: smokers are many times more likely to develop lung cancer than nonsmokers.

Cornfield's work "stunned research epidemiologists" and his techniques "made much of modern epidemiology possible" (McGrayne, p. 111). Although his work provoked the ire of the volatile and influential Sir Ronald Fisher (Fisher was a smoker, a paid consultant to the tobacco industry, and a fervent anti-Bayesian), Cornfield successfully defended his methods and results. In 1964, when the US Surgeon General warned "that cigarette smoking is causally related to lung cancer" (McGrayne, p. 113), Cornfield's work was among the evidence that was cited.

The risk factors for heart disease

Cornfield went on to apply statistics to many other controversial and important problems, such as the safety of the polio vaccine, and the safety and efficacy of drugs and food additives (Greenfield, p. 4). Among his achievements was contributing to the design of the Framingham Heart Study, a longitudinal study of the health of residents of Framingham, Massachusetts. Between 1948–1958, 92 out of 1,329 Framingham males experienced some form of a cardiovascular event, such as a heart attack or stroke. From these results, Cornfield was able to use statistics to identify age, cholesterol, cigarette smoking, heart abnormalities, and blood pressure as major risk factors that contribute to cardiovascular disease (McGrayne, p. 115). The identification of these risk factors is responsible for one of the most important public health achievements of the 20th century: the dramatic decline of mortality due to cardiovascular disease.

Contributions to the ASA

Besides serving as the ASA president in 1974, he was the president or vice-president of numerous other professional societies. He served as an editor of six scholarly journals, including the influential Journal of the American Statistical Association. His presidential address to the ASA is an amusing essay entitled "A Statistician's Apology," in which he asks "why should [a statistician] of spirit, of ambition, of high intellectual standards, take any pride or receive any real stimulation and satisfaction from serving an auxiliary role on someone else's problem?" (Cornfield, p. 9) The answer, he asserts after given examples from his own career, is that there is substantial "stimulation and satisfaction" from interacting and collaborating with scientists in other fields.

Furthermore, he continues, it is only by working on practical problems that statistics advances. "Application requires understanding, and the search for understanding often leads to, and cannot be distinguished from, research. The true Joy is to see the breadth of application and breadth of understanding grow together" (Cornfield, p. 11).

In this International Year of Statistics, it is important to remember the contributions of prolific researchers such as Jerome Cornfield. Yet, Cornfield's true legacy is that today's statisticians continue to use the methods that he introduced to understand the connection between risk factors and disease. Thanks to Cornfield and thousands of other statisticians, today's world is safer and healthier than ever before.

References

Cornfield, Jerome (1975). "A Statistician's Apology," Journal of the American Statistical Association, 70(349), pp. 7–14.

Greenhouse, Samuel W. (1982). "A Tribute," Biometrics, Proceedings of "Current Topics in Biostatistics and Epidemiology." A Memorial Symposium in Honor of Jerome Cornfield, 38, pp. 3–6

McGrayne, Sharon B. (2011). The Theory That Would Not Die, Yale University Press, Chapter 8, pp. 108–118.

Post a Comment

The case of spilled coffee and the regression intercept

Argh! I've just spilled coffee on output that shows the least squares coefficients for a regression model that I was investigating. Now the parameter estimate for the intercept is completely obscured, although I can still see the parameter estimates for the coefficients of the continuous explanatory variable. What can I do? Do I need to rerun the regression in order to recover the estimate for the intercept term?

No, it turns out that I do not. If I know (or can compute) the mean values of the response variable and the explanatory variables, then that is enough information to recreate the intercept estimate.

A formula for the intercept of a regression model

In high school or college, you might have learned a simple formula for computing the intercept for a one-variable regression model. If my is the mean of the response variable, mx is the mean of the explanatory variable, and b is the slope of the least square regression line, then the intercept is computed as b0 = myb mx.

For multivariate regression with several continuous variables, the formula is similar: b0 = my − Σibi mi, where mi is the mean of the ith explanatory variable. In vector notation, if b is the vector of coefficients for the explanatory variables and mx is the corresponding vector of means, then you can use the inner product operator to compute the intercept: b0 = myb*mx.

Math saves the day

Let's see how this works on my coffee-stained example. The original analysis was created by using this regression model in PROC REG:

proc reg data=sashelp.cars;
   model Mpg_city = Weight Cylinders EngineSize;
   ods select ParameterEstimates;
quit;

Rather than rerun the regression, I'll use the SAS/IML language to compute the means of the relevant variables, and use the inner product formula to obtain the estimate for the intercept term. In practice, you might also need to take care of two additional details:

  • The REG procedure excludes observations for which any explanatory variable is missing. If your data contain missing values, remove the rows of your data that have missing values. One way to do it is to use the COUNTMISS function in the SAS/IML language to exclude the rows with missing values.
  • The ParameterEstimates table displays the parameter estimates in a format that aligns the decimal points, but does not necessarily represent the full precision of the coefficients. Because I want the following PROC IML statements to give exactly the same estimate for the intercept term as reported by PROC REG, I used additional digits of precision for the coefficient of the WEIGHT variable.

proc iml;
use sashelp.cars;
read all var {Weight Cylinders EngineSize} into X;  /* read explanatory vars */
read all var {Mpg_city} into Y;                     /* read response var */
close sashelp.cars;
 
/* include only complete cases (exclude missing) */
idx = loc(countmiss(X, "row")=0);
X = X[idx, ];
Y = Y[idx, ];
 
xBar = mean(X);                         /* row vector of means for X */
yBar = mean(Y);                         /* mean of Y */
b = {-.003165747, -0.55569, -0.93885};  /* regression coefficients for X */
b0 = yBar - xBar * b;                   /* recover intercept term */
print b0;

The output shows that the intercept term for this multivariable regression is about 37.6. If you run the REG procedure, you obtain the same estimate.

Why should you care?

Okay, you might have guessed by now that I didn't really spill coffee on a piece of paper, so why did I go through this exercise? It is faster and easier to simply rerun PROC REG. Why go to the trouble of computing the interept estimate in terms of the means of the variables?

The reason is that some regression problems are easier to solve if you not only center the explanatory variables but also center the response variable. The centering operation results in a solution that passes through (0,0). In other words, the y-intercept term is zero. However, by using the trick in this article, you can "recover" the intercept term for the uncentered data by using the simple linear operation that I've described here. In a future article, I'll apply this trick to solving a problem that uses ridge regression. That article won't feature any spilled coffee, but the trick will be useful anyway.

Post a Comment

Construct normal data from summary statistics

Last week there was an interesting question posted to the "Stat-Math Statistics" group on LinkedIn. The original question was a little confusing, so I'll state it in a more general form:

A population is normally distributed with a known mean and standard deviation. A sample of size N is drawn from the population. Each element of the sample is rounded to the nearest integer. The challenge: Construct a sample whose sample mean and sample standard deviation are as close as possible to the population values.

And here is my paraphrase of the actual problem that was posed:

The time required for Sally to walk home is normally distributed with mean 18 minutes and standard deviation 2 minutes. She walks home on 70 days and records her time to the nearest minute. What set of 70 values result in sample statistics that are as close as possible to the population parameters?

If you would like to try to solve the problem on your own, stop reading here. Spoilers ahead!


My approach to this problem is to use the normal density to estimate the number of observations for each value of the integer times 12, 13, ..., 24. Why those values? Because we are told that the mean is 18, and I assume that all the data will be within three standard deviations of the mean: [12, 24] = [18 - 3*2, 18 + 3*2].

In SAS, the PDF function computes the normal density. If you multiply the normal density by the number of observations, you obtain an estimate for the expected number of observation in each unit interval. Of course, this estimate will not be an integer, so use the ROUND function to obtain integers, as follows:

data WalkTimes;
N = 70; Mean = 18; StdDev = 2;
keep t freq;
do t = 12 to 24;
   pdf = N*pdf( "normal", t, Mean, StdDev);  /* approximate expected number */
   freq = round(pdf);                        /* round to integer */
   output;
end;
run;

It is not clear that this approach will always produce N observations, but it does for this symmetric distribution. What does the distribution look like and what are the sample moments? A quick call to PROC UNIVARIATE answers these questions:

proc univariate data=WalkTimes;
   freq Freq;
   var t;
   histogram t / normal midpoints=(12 to 24) vscale=count barlabel=count;
   ods select Moments Histogram;
run;

The histogram plots the counts in each interval, displays the empirical frequencies, and overlays the normal curve for the population. There is close agreement between the data and the population. (The procedure also produces goodness-of-fit tests, which I do not show here. None of the tests reject the null hypothesis that the data are from an N(18, 2) population.)

The Moments table shows that there are 70 observations, that the mean is exactly 18, and that the standard deviation is very close to 2. Furthermore, the skewness is exactly zero, which means that the data distribution is symmetric. The small kurtosis value indicates that the tails of the distribution are close zero, which is what you would expect for a normal sample.

I obtained the same data distribution as the person who posed the problem, so presumably he used a similar approach. Other people tried to simulate the problem by generating random numbers from N(18,2) and rounding them to integers.

What approach would you use? Can you come up with a set of 70 integers that do a better job of approximating the N(18, 2) distribution?

Post a Comment

SAS/IML Posters and Presentations at SAS Global Forum 2013

There is something for everyone at SAS Global Forum 2013. I like to attend presentations in the Statistics and Data Analysis track and talk with SAS customers in the SAS Support and Demo Area. But one activity that I enjoy the most is to stroll through the poster area and to chat informally with the poster presenters. In contrast to the formal lectures and presentations, posters promote a two-way dialog between the presenter and the listener. If a poster interests me, I might stay and talk for 20 minutes. If a poster does not interest me, I can move on to the next poster.

This year, several posters feature SAS/IML computations, so I am keen to see those and to talk with the presenters. This year the presenters will be at their posters on Monday, April 29, 2013, 2:00 pm–3:30 pm. Here are a few posters whose abstracts mention the SAS/IML language:

The poster on how to "Optimize SAS/IML [for] Simulation" interests me because I am currently putting the finishing touches on my newest book, Simulating Data with SAS. There will be an advanced copy of my book at the SAS Press booth, in case you want to browse it.

Of course, I am also excited to be giving a free Hands-On Workshop, "Getting Started with the SAS/IML Language," on Tuesday, April 30, 2013, at 1:30 pm–3:10 pm (Room 2024 in the Moscone West Convention Center).

I am also interested in the paper "Using SAS to Measure Airport Connectivity: An Analysis of Airport Centrality in the US Network with SAS/IML Studio," which will be presented on Wednesday, May 1, 2013, at 9:00 am–9:50 am. Because SAS/IML Studio supports dynamically linked statistical graphs, I think it will be an interesting talk and demo. Also, I've done an analysis of US airports using SAS/IML Studio, so it will be fun to see a related analysis.

What talks are you looking forward to at SAS Global Forum 2013? I hope to see you there!

Post a Comment

IMLPlus documentation is now available online

I am pleased to announce that the documentation for the IMLPlus language is now available online. Previously, this resource was available only from within the SAS/IML Studio application. This documentation can now be accessed by anyone, regardless of whether they have installed SAS/IML Studio.

As I have described previously, IMLPlus is an enhanced version of the SAS/IML programming language that includes the ability to create and manipulate dynamically linked statistical graphs. You can use the IMLPlus language from within SAS/IML Studio, which is a free application that is distributed as part of the SAS/IML product.

I have been writing this blog for 2.5 years, but I have written only a few articles about IMLPlus, in part because I was unable to link to the IMLPlus language documentation. Now that there is an online version of the IMLPlus class reference, expect an occasional article that demonstrates features of this rich and powerful extension to the SAS/IML computational language.

If you would like to explore the IMLPlus programming language, first check whether SAS/IML Studio is installed on your Windows PC. To get started with IMLPlus, you can work through the examples in SAS/IML Studio for SAS/STAT Programmers. If you have my book, Statistical Programming with SAS/IML Software, I discuss IMLPlus programming techniques in Chapters 5–10.

Post a Comment

How to use PROC SGPLOT to display the slope and intercept of a regression line

A SAS user asked an interesting question on the SAS/GRAPH and ODS Graphics Support Forum. The question is: Does PROC SGPLOT support a way to display the slope of the regression line that is computed by the REG statement? Recall that the REG statement in PROC SGPLOT fits and displays a line through points in a scatter plot.

In SAS 9.3, you cannot obtain this information directly from PROC SGPLOT. Instead, you need to use PROC REG to compute this information. You can use the following steps to create a plot that displays the parameter estimates:

  1. Use PROC REG to compute the parameter estimates (slope and intercept). Save this information to a SAS data set.
  2. Use a DATA step to create macro variables that contain the parameter estimates.
  3. Use the INSET statement in PROC SGPLOT to add this information to the fitted scatter plot\.

Step 1: Save the parameter estimates

You can use the OUTEST= option or the ODS OUPUT statements to save the parameter estimates to a SAS data set. In the following example, the ODS OUTPUT statement saves the ParameterEstimates table to the PE data set:

ods graphics off;
proc reg data=sashelp.class;
   model weight = height;
   ods output ParameterEstimates=PE;
run;

Step 2: Create macro variables

In the PE data set, the ESTIMATE variable contains the parameter estimates. The first row contains the estimate for the intercept term; the second row contains the estimate for the slope. The following DATA step saves these into macro variables:

data _null_;
   set PE;
   if _n_ = 1 then call symput('Int', put(estimate, BEST6.));    
   else            call symput('Slope', put(estimate, BEST6.));  
run;

Step 3: Use the INSET Statement to display the parameter estimates

You can now create the plot by using PROC SGPLOT. Use the INSET statement to display the parameter estimates in a text box:

proc sgplot data=sashelp.class noautolegend;
   title "Regression Line with Slope and Intercept";
   reg y=weight x=height;
   inset "Intercept = &Int" "Slope = &Slope" / 
         border title="Parameter Estimates" position=topleft;
run;

Of course, you can use a similar strategy to display any other relevant statistics on the scatter plot. There is an example in the SAS/STAT User's Guide that shows other fit statistics, as well as how to use Greek letters and superscripts in the inset text. You can also display a formula that shows the equation of a displayed line.

Post a Comment

Initializing vectors by using repetition factors

The SAS/IML language has a curious syntax that enables you to specify a "repetition factor" when you initialize a vector of literal values. Essentially, the language enables you to specify the frequency of an element. For example, suppose you want to define the following vector:

proc iml;
x = {1 2 2 2 2 3 3 3 3 3 4 4 5 5 5 5 5 5};

The vector has one 1, followed by four 2s, followed by five 3s, two 4s, and six 5s. An alternative syntax is to specify the "repetition factor" for each element by using a positive integer enclosed in brackets, like so:

x = {1 [4]2 [5]3 [2]4 [6]5};

You can think of the repetition factor as the frequency or number of occurrences of the value that follows the closing bracket.

Admittedly, this simple example does not save a lot of typing, but if a value is repeated tens or hundreds of times, this syntax not only saves typing, but also is less prone to error and is clearer to read. For example, repetition factors make it easy to specify the genders of 100 subjects:

gender = {[42]"Female" [58]"Male"};

I find this syntax interesting because I am not aware of many other languages that support repetition factors like this. FORTRAN has repetition factors for the FORMAT and the DATA statement. This syntax is supported in the SAS DATA step, which obviously preceded and inspired the SAS/IML syntax:

data A;
array x(18) (1 4*2 5*3 2*4 6*5);
run;
Furthermore, the SAS/SCL language had repetition factors for defining arrays. Does anyone know of other languages that support a similar syntax for initializing arrays?

Post a Comment

What happens if you misspecify the parameters for the "Table" distribution?

I have previously written about how to use the "table" distribution to generate random values from a discrete probability distribution. For example, if there are 50 black marbles, 20 red marbles, and 30 white marbles in a box, the following SAS/IML program simulates random draws (with replacement) of 1,000 marbles:

proc iml;
call randseed(1234);           /* set random number seed             */
x = j(1000, 1);                /* allocate space for 1,000 draws     */
p = {0.5 0.2 0.3};             /* probabilities of each category     */
call randgen(x, "Table", p);   /* fill vector with values 1, 2, or 3 */
 
/* compute and print empirical distribution of values */
call tabulate(category, freq, x);
print (freq/sum(freq ))[c={"Black" "Red" "White"}];

The observed percentages of each category are close to their expected values. About 50% of the draws resulted in a black marble, about 20% resulted in red, and about 30% resulted in white.

It is worth knowing what happens if you specify a set of probabilities that do not sum to 1. If you specify k outcomes whose probabilities sum to T < 1, then a "new outcome" is created. The RANDGEN routine will return the value (k+1) with probability 1-T. In other words, the random sample that is generated by the previous RANDGEN call can also be generated by using the following definition for the probability vector:

p = {0.5 0.2};                 /* prob of third category is 0.3 */

In a similar way, probabilities that sum to more than 1 are truncated. For example, suppose that you specify the following array:

p = {0.5 0.2 0.4};             /* prob of third category is still 0.3 */

The sum of the elements exceeds 1. SAS software will truncate the third element so that the value 3 is returned with probability 0.3, not 0.4 as was specified. Some programmers might expect each element to be divided by the sum of the elements (1.1 in this example), but that is not what happens. You can force that behavior by defining
p = p / sum(p).

The RAND function in the SAS DATA step behaves in the same manner.

Post a Comment

Empty subscripts: How to fill rows and columns of a matrix

Suppose that you have a SAS/IML matrix and you want to set each element of a submatrix to zero (or any other value). There is a simple syntax that accomplishes this task.

If you subscript a matrix and do not specify a row, it means "use all rows." So, for example, you can assign zeros to the third column of a matrix by using the following syntax:

proc iml;
x = j(3, 5, 1);   /* create 3 x 5 matrix of ones             */
x[ , 3] = 0;      /* assign zeros to third column (all rows) */
print x;

In the same way, if you do not specify a column, it means "use all columns." So, for example, you can assign zeros to the second row of a matrix by using the following syntax:

x[2, ] = 0;       /* assign zeros to second row (all columns) */

Of course, you can extend these examples by assigning nonconstant vectors to rows or columns:

x[3, ] = 1:5;     /* assign {1 2 3 4 5} to third row */

But here's a cool tip that many SAS/IML programmers do not know: you can fill the entire matrix by skipping the rows and the columns! For example, to assign zeros to entire matrix (regardless of dimensions), use the following syntax:

x[ , ] = 0;       /* assign zero to all elements */
print x;

I don't see this syntax used very often. It is never used on the right-hand side of an assignment because the expression t = x is easier and clearer than t = x[ , ]. However, when assigning values into a pre-allocated matrix, it is an efficient alternative to x = j(nrow(x), ncol(x), 0) or x[1:nrow(x), 1:ncol(x)] = 0. The "double empty" technique also preserves any matrix attributes that you might have assigned by using the MATTRIB statement, whereas using the J function deletes the old matrix and creates a new one.

Do you have a favorite SAS/IML syntax that is tricky or seldom-used? Share it in a comment!

Post a Comment

How to access any program or data set that appears in the SAS/STAT documentation

If you are like me, you've experienced the following frustration. You are reading the SAS/STAT documentation, trying to understand some procedure or option, when you find an example that is very similar to what you need. "Great," you think, "this example will help me understand how the SAS procedure works!" And then you see it. Instead of a complete DATA step that you can cut and paste into SAS, the example shows only the first few observations, followed by

   ... more lines ...

Argh! The example is using a large data set, and not all observations are shown! If you have not experienced this situation, here's a link to an example from the SAS/STAT documentation that uses the "... more lines ..." trick.

I've always found this situation annoying. As I wrote in a previous article, you can access all sample programs from within the SAS Windowing Environment, but I've always wished that the sample data were available online.

Sample libraries are now available online

My wish is granted! All of the sample programs and data for the SAS/STAT products are now online. The samples for the SAS/ETS product are likewise available, and the recent issue of SAS Statistics and Operations Research News says that "the samples for other products [will be] available soon." Awesome news!

Access sample data directly from the Web

In addition, Warren Kuhfeld wrote some DATA step code that shows how you can use SAS to read the data and program directly from the Web site. This is especially useful for posting questions on a SAS discussion forum or sending an example by email.

If you follow the link, there are several examples of how to read the entire sample, or just read the data. The last example shows how to "extract the initial comment block, title, and DATA step and submit them to SAS." If you like the idea of reading data directly from the Web, Chris Hemedinger recently showed how to read data from a public URL by using PROC HTTP.

So next time you want to run a SAS example and the documentation includes the dreaded statement "... more lines ...", don't fret. The samples are now available online. And you don't even have to bookmark a special URL. Just go to the main page for the 12.1 documentation and click on the link that says "Example Programs."

Post a Comment