Estimate a proportion and a confidence interval in SAS

0

A SAS programmer wanted to estimate a proportion and a confidence interval (CI), but didn't know which SAS procedure to call. He knows a formula for the CI from an elementary statistics textbook. If x is the observed count of events in a random sample of size n, then the textbook states following:

  • The estimate for the proportion is the statistic \(\hat{p} = x/n\),
  • If the distribution of the statistic is normal, then a 95% confidence interval for the proportion is \( \hat{p} \pm z \sqrt { \hat{p} (1 - \hat{p} ) / n } \)
    where z is the 0.975th quantile of the standard normal distribution. The square-root quantity is the standard error of the statistic.

This formula defines the Wald confidence interval. It assumes that the sampling distribution of the statistic is normal. In general, you can obtain a (1-α)100% confidence interval by specifying the z-value in the formula as the (1-α/2)th quartile of the standard normal distribution. You might need to truncate the interval so that it is always a subset of the interval [0, 1].

In SAS, this problem is called an estimate of a binomial proportion, and PROC FREQ is a SAS procedure that can provide the estimate. To use PROC FREQ, you must construct a data set that has the counts of both events and non-events.

An example proportion

Let's use a typical example. A researcher wants to estimate the proportion of families in New York City who live in a rental unit. The researcher obtains a random sample of n = 500 housed families, and discovers that x=340 families rent (rather than own) their unit. In a "Stat 101" course, you learn the following:

  • A sample estimate of the proportion of families that rent is \(\hat{p} = x/n = 0.68\).
  • For this problem, the standard error of the sample proportion is \(\sqrt { \hat{p} (1 - \hat{p} ) / n } = \sqrt { (0.68)(0.32)/500 } \approx 0.0209\). For a 95% confidence interval, z=1.96, so \(z \sqrt { \hat{p} (1 - \hat{p} ) / n } \approx 0.04\). Therefore, the 95% Wald CI is approximately [0.64, 0.72].

Create data for the binomial proportion

The challenge for the SAS programmer is threefold: What procedure do you use? How do you create the input data set? And how to you ask for a Wald confidence interval?

As mentioned earlier, PROC FREQ can estimate the proportion when you correctly specify the input data. PROC FREQ is not designed to use the counts of the number of events and the sample size. Instead, it uses the count for the events (x) and the non-events (n – x). The following DATA step reads one observation (the number of events and the sample size) and outputs TWO observations. In the output data set, Response is a binary variable with values "Yes" and "No," and Count is a variable that specifies the frequency of each response. In PROC FREQ, you can use the WEIGHT statement to specify the counts for each level of the response variable:

/* estimate of proportion and CI for proportion: 
   n = 500 number of families in NYC in random sample
   x = 340 number that rent housing
*/
data Have;
input x n;   /* read in events and trials */
/* convert input obs to TWO output obs that record the Response (Yes/No) and Count */
Response = "Yes"; Count = x;   output;  
Response = "No "; Count = n-x; output;
datalines;
340 500
;
 
proc freq data=Have;
   tables Response / nocum binomial(level='Yes' CL=Wald);
   weight Count / zeros;
run;

The output from PROC FREQ is shown:

  • The first table shows the sample counts for each level of the response. "No" indicates that the family does not rent (a non-event), whereas "Yes" indicates that the family rents (an event).
  • The second table shows that the event was observed for 0.68 of the sample, so that is the estimate of the binomial proportion, \(\hat{p}\). The row labelled "ASE" is the Asymptotic Standard Error. As shown earlier, this quantity equals 0.0209.
  • The third table shows an estimate for a 95% confidence interval, which is approximately [0.64, 0.72], as shown previously.

Additional PROC FREQ options

In the previous sections, I said that the third table shown an estimate for the CI, rather than the estimate. The Wald interval is just one of several ways to estimate a confidence interval for the population proportion. Although the Wald interval is shown in almost every statistics textbook, the Wald interval is not the best one to use. I prefer to use the Wilson interval, which you can obtain by specifying the CL=WILSON suboption in the BINOMIAL option on the TABLES statement.

You can also choose to run a hypothesis test for a specified proportion. For example, suppose you read an article that claims that 69% of New York City families rent. Does the sample support this assertion? You can specify the P=0.69 suboption in the BINOMIAL option to get a fourth table that shows the statistics for the null hypothesis H0: p=0.69. If you do not specify the P= suboption, you will get a table for the hypothesis test for H0: p=0.5.

You can combine multiple options into one BINOMIAL statement. For example:
binomial(level='Yes' CL=Wald CL=Wilson P=0.69)

Summary

New programmers face three challenges when they use a language such as SAS to solve an elementary statistic problem. What procedure do you use? How do you create the input data set? And how to you specify options to obtain additional information such as confidence intervals or hypothesis tests? This article shows the answers to these questions for the simple problem of estimating a binomial proportion. You can use PROC FREQ and the BINOMIAL option to obtain the estimates. You need to create an input data set that specifies the number of events and non-events, as opposed to the number of events and the sample size.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

Leave A Reply

Back to Top