Exact tests in PROC FREQ: What, when, and how

Did you know that the FREQ procedure in SAS can compute exact p-values for more than 20 statistical tests and statistics that are associated with contingency table? Mamma mia! That's a veritable smorgasbord of options!

Some of the tests are specifically for one-way tables or 2 x 2 tables, but many apply to two-way tables of any dimension. You can specify exact p-values by using the EXACT statement in PROC FREQ.

The "Details" section of the PROC FREQ documentation lists all of the tests and statistics that support exact p-values or exact confidence intervals. With so many tests and options, it can be challenging to determine which tests are available for which kinds of tables. To better understand the choices, I classified the options according to the dimensions of table: one-way tables, 2 x 2 tables, general two-way tables, and stratified 2 x 2 tables. In the following lists, each description mentions the name of the option on the EXACT statement that specifies the computation.

One-way tables

For one-way tables, PROC FREQ provides exact p-values for the following tests:

  • Binomial proportion tests (BINOMIAL). The same option also provides exact (Clopper-Pearson) confidence limits for binomial proportions.
  • Chi-square goodness-of-fit test (PCHI and LRCHI options). You can tests a null hypothesis of equal proportions, or you can specify the proportions.

Two-way tables

For two-way tables, PROC FREQ provides exact p-values for the following tests:

  • Chi-square tests, including the Pearson chi-square test (PCHI), likelihood ratio chi-square test (LRCHI), and Mantel-Haenszel chi-square test (MHCHI), or use the CHISQ option to get all three.
  • Fisher’s exact test (FISHER).
  • Jonckheere-Terpstra test (JT), which assumes that a column represents an ordinal response variables.
  • Cochran-Armitage test for trend (TREND).

There are also exact p-values for the following statistics:

  • Kendall’s tau-b and Stuart’s tau-c statistics (TAUB and TAUC).
  • Somers’s D statistics (SMDCR and SMDRC).
  • Pearson and Spearman correlation coefficients (PCORR and SCORR).
  • Simple and weighted kappa coefficients (KAPPA and WTKAPPA).

2 x 2 tables

In addition to the tests for general two-way tables, for 2 x 2 tables PROC FREQ provides exact p-values for the following tests:

  • McNemar’s exact test for matched pairs (MCNEM).
  • Exact confidence limits for the odds ratio (OR).
  • Unconditional exact test for the proportion difference (BARNARD).
  • Exact unconditional confidence limits for the proportion difference (RISKDIFF) and for the relative risk (RELRISK).

Stratified 2 x 2 tables

For stratified 2 x 2 tables, PROC FREQ provides the following:

  • Zelen’s exact test for equal odds ratios (EQOR).
  • An exact test for the common odds ratio, which also provides exact confidence limits for the common odds ratio (COMOR).

When were these tests first available in SAS?

The previous sections contain a lot of information. Because not everyone is running SAS 9.4, I was curious as to when some of these tests were introduced. I was able to compile the following (non-exhaustive) list by trolling through various "What's New" documents:

  • SAS Version 6: Fisher's exact test was introduced.
  • SAS Version 8: The EXACT statement was introduced in SAS version 8. It contained exact p-values for binomial proportions in one-way tables, the many chi-square tests, and Fisher's exact test.
  • SAS 9.1: Exact confidence limits for the common odds ratio and related tests.
  • SAS 9.2: Exact unconditional confidence limits for the proportion (risk) difference and Zelen’s exact test for equal odds ratios.
  • SAS 9.3: Exact unconditional confidence limits for the relative risk.
  • SAS 9.4: The MIDP option in the EXACT statement produces exact mid p-values for several tests.

How does PROC FREQ compute exact p-values?

I've discussed what exact computations are available and when some of them became available, but what about how the computation works?

Because tables are integer-valued, you can theoretically enumerate all tables that satisfy a particular null hypothesis. For example, in a recent article about 2 x 2 tables, I was able to write down the nine possible tables that satisfied a particular constraint and to assign probabilities for each distinct table. Consequently, if you compute a statistic for each table, you can exactly calculate the probability that the statistic is more extreme than a particular observed value. That probability is the exact p-value. (For a more precise definition, see the "Exact Statistics" section of the documentation.)

In practice, PROC FREQ does not directly enumerate all of the possible tables that satisfy the null hypothesis for each test. For some of the exact tests for a general two-way table, it uses a network algorithm of Mehta and Patel (1983), which is more efficient than direct enumeration. The algorithm enables the procedure to compute exact p-values for many small and moderately sized tables.

Post a Comment

Simulate contingency tables with fixed row and column sums in SAS

How do you simulate a contingency table that has a specified row and column sum? Last week I showed how to simulate a random 2 x 2 contingency table when the marginal frequencies are specified. This article generalizes to random r x c frequency tables (also called cross-tabulations) that have the same marginal row and column sums as an observed table of counts.

For example, consider the following table:


The column sums are c = {18 6 7} and the row sums are r = {17, 8, 6}. There are many other tables that have the same row sums and column sums. For example, the following three tables satisfy those conditions:


In a previous blog post, I presented a standard algorithm (see Gentle, 2003, p. 206) that can generate an r x c table one column at a time. The first column is a random draw from the multivariate hypergeometric distribution with a total number of elements equal to the observed count for that column. After generating the first column, you subtract the counts from the marginal row counts. You then iteratively apply the same algorithm to the second column, and so forth.

The hypergeometric algorithm requires some special handling for certain degenerate situations. In contrast, a simpler algorithm from the algorithm of Agresti, Wackerly, and Boyett (1979) is mathematically equivalent but easier to implement. You can download a SAS/IML program that implements both these algorithms.

Assume the columns sums are given by the vector c = {c1, c2, ..., cm} and the row sums are given by r = {r1, r2, ..., rn}. Think about putting balls of m different colors into an urn. You put in c1 balls of the first color, c2 balls of the second color, and so forth. Mix up the balls. Now pull out r1 balls, count how many are of each color, and write those numbers in the first row of a table. Then pull out r2 balls, record the color counts in the second row of the table, and so forth.

The following SAS/IML module implements the algorithm in about 15 lines of code. For your convenience, the code includes the definition of TabulateLevels function, which counts the balls of every color, even if some colors have zero counts:

proc iml;
/* Output levels and frequencies for categories in x, including all reference levels. 
   http://blogs.sas.com/content/iml/2015/10/07/tabulate-counts-ref.html */
start TabulateLevels(OutLevels, OutFreq, x, refSet);
   call tabulate(levels, freq, x);        /* compute the observed frequencies */
   OutLevels = union(refSet, levels);     /* union of data and reference set */ 
   idx = loc(element(OutLevels, levels)); /* find the observed categories */
   OutFreq = j(1, ncol(OutLevels), 0);    /* set all frequencies to 0 */
   OutFreq[idx] = freq;                   /* overwrite observed frequencies */
/* Algorithm from Agresti, Wackerly, and Boyett (1979), Psychometrika */
start RandContingency(_c, _r);
   c = rowvec(_c);   m = ncol(c);
   r = colvec(_r);   n = nrow(r);
   tbl = j(n, m, 0);
   refSet = 1:m;
   vals = repeat(refSet, c);    /* vector replication: http://blogs.sas.com/content/iml/2014/07/07/expand-frequency-data.html */
   perm = sample(vals, ,"WOR"); /* n=number of elements of vals */
   e = cusum(r);                /* ending indices */
   s = 1 // (1+e[1:n-1]);       /* starting indices */
   do j = 1 to n;
      v = perm[ s[j]:e[j] ];    /* pull out r1, r2, etc */
      call TabulateLevels(lev, freq, v, refSet);
      tbl[j, ] = freq;          /* put counts in jth row */
   return( tbl );

Every time you call the RandContingency function it generates a random table with the specified row and column sums. The following statements define a 3 x 3 table and generate three random tables, which appear at the beginning of this article. Each random table is a draw from the null distribution that assumes independence of rows and columns:

call randseed(12345);
Table = {10 1 6,
          3 5 0,
          5 0 1};
c = Table[+, ];
r = Table[ ,+];
do i = 1 to 3;
   RandTbl = RandContingency(c, r);
   print RandTbl;

The example is small, but the algorithm can handle larger tables, both in terms of the dimension of the table and in terms of the sample size. For example, the following example defines row and column sums for a 500 x 20 table. The sample size for this table is 30,000.

/* try a big 500 x 20 table */
r = j(500,  1,   60);
c = j(  1, 20, 1500);
tbl = RandContingency(c, r);
ods graphics / width=600px height=2000px;
call heatmapcont(tbl);

The table is too big to print in this article, but you can create a heat map to that shows the distribution of values. The adjacent image shows the first 40 rows of the heat map. The cells in the table vary between 0 (white) and 11 (dark blue), with about half of the cells having values 3 or 4. For tables of this size, you can create thousands of tables in a few seconds. For smaller tables, you can simulate ten thousands tables in less than a second.

In a future blog post I will show how to use the RandContingency function to simulate random tables and carry out Monte Carlo estimation of exact p-values.

Post a Comment

Create a correlation matrix from the upper triangular elements

A recent question posted on a discussion forum discussed storing the strictly upper-triangular portion of a correlation matrix. Suppose that you have a correlation matrix like the following:

proc iml;
corr = {1.0  0.6  0.5  0.4,
        0.6  1.0  0.3  0.2,
        0.5  0.3  1.0  0.1,
        0.4  0.2  0.1  1.0};

Every correlation matrix is symmetric and has a unit diagonal. Consequently, although this 4 x 4 matrix has 16 elements, only six elements convey any information. In general, an n x n matrix has only n(n–1)/2 informative elements. It seems logical, therefore, that for large matrices you might want to store only the strictly upper portion of a correlation matrix.

If the correlation matrix is stored in a data set, you can use the DATA step and arrays to extract only the strictly upper-triangular correlations. In the SAS/IML language, you can use the ROW and COL functions to extract the upper triangular portion of the matrix into a vector, as follows:

r = row(corr);
c = col(corr);
upperTri = loc(r < c); /* upper tri indices in row major order */
v = corr[upperTri];    /* vector contains n*(n-1)/2 upper triangular corr */
print v;

To reconstruct the correlation matrix from the vector is a little challenging. The main problem is to figure out the dimension of the correlation matrix by using the number of elements in the vector v.

Let k be number of elements in the vector v. Then k = n(n–1)/2 elements for some value of n. Rearranging the equation gives n2 - n - 2k = 0, and by the quadratic formula this equation has the positive solution n = (1 + sqrt(1 + 8k) ) / 2. For example, k=6 for the present example, from which we deduce that n = 4.

After you have discovered the value of n, it is easy allocate a matrix, copy the correlations into the upper triangular portion, make the matrix symmetric, and assign the unit diagonal, as follows:

k = nrow(v);
n = (sqrt(1 + 8*k) + 1)/2;   /* dimension of full matrix */
A = J(n,n,0);                /* allocate zero matrix */
A[upperTri] = v;             /* copy correlations */
A = A + A`;                  /* make symmetric */
A[loc(r = c)] = 1;           /* put 1 on diagonal */

If you use this operation frequently, you can create modules that encapsulate the process of extracting and restoring correlation matrices.

Do you like to solve tricky little problems? Do you enjoy spending a few minutes each day learning about SAS software and sharing your expertise with other? If so, you might enjoy participating in the SAS Support Communities. If you have written a paper about how to do something non-trivial in SAS, consider posting it to the SAS/IML File Exchange.

Post a Comment

Models and simulation for 2x2 contingency tables

When modeling and simulating data, it is important to be able to articulate the real-life statistical process that generates the data.

Suppose a friend says to you, "I want to simulate two random correlated variables, X and Y." Usually this means that he wants data generated from a multivariate distribution, such as the bivariate normal distribution. However, an alternative scenario is to generate X by according to some univariate distribution and generate Y according to some regression model.

In both cases, the variables are related. However, in the first case the pairs are generated from a joint bivariate distribution. In the second case, Y is assumed to be a response variable that depends on X. In this case, you are generating Y conditionally based on the value of X. Usually asking your friend about the sampling framework of the real data enables you to determine which model and simulation is more appropriate.

Similarly, you can make various assumptions about the data in a contingency table (sometimes called a frequency table or a cross-tabulation), which lead to different models. The fundamental question is "how were the data generated or collected?" The answer to that question should become the basis for modeling and simulating the data.

The following are some common assumptions about the sampling framework for data in contingency tables (Stokes, Davis, and Koch, 2012, Chapter 2):

  1. Assume that the sample size is fixed, but the rows and column sums are not. For example, the Wikipedia article on contingency tables includes a table that shows the gender and handedness (right handed or left handed) of 100 random people. If you choose a new sample of 100 people, it is unlikely that the row or column sums will be the same. You can use the multinomial distribution to simulate these tables.

  2. Assume that the rows sums are fixed but the column sums are not. A modification of the previous survey design is to interview 50 men and 50 women and ask each person whether they are right or left handed. This design fixes the number of men and women, and any simulation of this data must respect the design. You can use the binomial distribution to simulate these tables row by row.

  3. Assume that the row and column sums are fixed. This assumption is often used when you want to evaluate a statistical hypothesis on the data. For example, Fisher's exact test computes an exact p-value by considering all tables that have the same row and column sums as the observed table, and then counts the proportion of tables whose test statistic is equal to or more extreme than the observed value. You can use the hypergeometric distribution to simulate random tables 2 x 2 tables with specified marginal distributions.

These techniques generalize to larger tables. For the first assumption (fixed sample size) you can use the multinomial distribution to generate each cell count. For the second assumption (fixed row sums), there are a multitude of ways to model a response variable, including log-linear models. For details, see Friendly (2000, Chapter 7) or Stokes, Davis, and Koch (2012).

The third assumption (fixed row and column marginal sums) is tractable for a 2 x 2 table, but generating all tables becomes infeasible for large tables. However, if you can simulate random tables that have a specified marginal sums, then you can obtain a Monte Carlo estimate for the p-value of the test. Let's see how to simulate such tables.

The PROC FREQ documentation contains a 2 x 2 table for 23 patients in a case-control study relating dietary fat and heart disease. The column sums are {13 10} and the row sums are {15 8}. The following statements simulate three tables that have the same row and column sums:

proc iml;
/* first arg is row vector of sums of cols; second arg contains sums of rows */
start RandContingency2x2(_c, _r);
   c = rowvec(_c);
   r = colvec(_r);
   if ncol(c)^=2 | nrow(r)^=2 then 
      abort "This function applies only to 2x2 tables";
   tbl = j(2, 2, 0);           /* allocate 2x2 table */
   call randgen(a, "Hyper", sum(r), r[1], c[1]);  
   tbl[1,1] = a;               /* assign random value to first element */
   tbl[2,1] = c[1] - a;        /* constraint on column sum */
   tbl[ ,2] = r - tbl[ ,1];    /* constraints on row sums */
   return( tbl );
call randseed(1);
/* marginals from PROC FREQ example */
c = {13 10};                  /* sum of columns */
r = {15, 8};                  /* sum of rows */
do i = 1 to 3;
   tbl = RandContingency2x2(c, r);
   print tbl;

The output shows three of the nine unique tables that satisfy the specified marginal sums. You can use the PDF function for the hypergeometric distribution to show that the first table is generated with probability 0.035, the second table has the probability 0.1575, and the third appears with probability 0.315. The following statement computes the probabilities for all 2 x 2 tables where the upper left cell has the specified values:

/* prob of 2x2 table with row sums (r) and col sums (c) where A[1,1]=5,6,...13 */
prob = pdf("Hyper", 5:13, sum(r), r[1], c[1]);
print prob[c=("5":"13") F=6.4];

In a future blog post, I will discuss how to generalize this simulation to larger tables that have r rows and c columns.

Post a Comment

Create a surface plot in SAS


This article shows how to visualize a surface in SAS. You can use the SURFACEPLOTPARM statement in the Graph Template Language (GTL) to create a surface plot. But don't worry, you don't need to know anything about GTL: just copy the code in this article and replace the names of the data set and variables.

In some situations, you don't have to write any code because some SAS procedures create surface plots automatically. For example, the RSREG procedures produces surface plots that are quadratic regression surfaces. The KDE procedure produces surfaces for bivariate density estimates.

An alternative to plotting a surface is to use a contour plot. For many situations a contour plot is ideal, and you never need to worry that a ridge in the foreground of a surface will hide details of a valley behind the ridge. Many SAS procedures (GLM, KRIGE, PLM,...) create contour plots automatically when it is appropriate.

Creating a surface plot in SAS with ODS graphics: The template

Use the GTL to create a surface plot. This section describes how to define a graph template for a surface plot. The next section shows how to display the plot by using the SGRENDER procedure.

The following template is adapted from the documentation for the SURFACEPLOTPARM statement, which explains various options for visualizing the surface:

proc template;                        /* surface plot with continuous color ramp */
define statgraph SurfaceTmplt;
dynamic _X _Y _Z _Title;              /* dynamic variables */
 entrytitle _Title;                   /* specify title at run time (optional) */
  layout overlay3d;
    surfaceplotparm x=_X y=_Y z=_Z /  /* specify variables at run time */
       colormodel=threecolorramp      /* or =twocolorramp */
       colorresponse=_Z;              /* prior to 9.4m2, use SURFACECOLORGRADIENT= */
    continuouslegend "surface";

The template describes the layout for a surface plot. Instead of hard-coding the names of variables, the template uses dynamic variables so that you can reuse the template for many different data sets. The dynamic variables (and the title) are assigned values when you use the SGRENDER procedure to render the graph.

You only need to run the previous statements once. The statements create a template called SurfaceTmplt. Whenever you want to create a surface, simply use PROC SGRENDER, as shown in the next section.

Creating a surface plot in SAS: The rendering

To use the template, you need to have values for the surface on a grid of (x,y) locations. The following DATA step generates values for a cubic function of two variables that I previously used to construct heat maps. The surface plot of this function is shown at the top of this article.

/* sample data: a cubic function of two variables */
%let Step = 0.04;
data A;
do x = -1 to 1 by &Step;
   do y = -1 to 1 by &Step;
      z = x**3 - y**2 - x + 0.5;
proc sgrender data=A template=SurfaceTmplt; 
   dynamic _X='X' _Y='Y' _Z='Z' _Title="Cubic Surface";

For this surface, the COLORMODEL=threecolorramp option was used to color the surface according to a three-color spectrum of colors. For other surfaces, a two-color spectrum (COLORMODEL=twocolorramp) might be more appropriate.

Most of the time you have data that is already arranged on a grid of (x, y) values. If you have irregularly spaced data, you can use your favorite regression procedure to fit a surface to the data, then use the PLM procedure to score the model on a regular grid of points. You can download a SAS program that analyzes irregularly spaced data and scores a regression model on a regular grid so that the surface can be visualized. Alternatively, if you have SAS/GRAPH software, you can use the G3GRID procedure to interpolate the values onto a regular grid.

Post a Comment

Tabulate counts when there are unobserved categories

Suppose that you are tabulating the eye colors of students in a small class (following Friendly, 1992). Depending upon the ethnic groups of these students, you might not observe any green-eyed students. How do you put a 0 into the table that summarizes the number of students who have each eye color?

If you are using PROC FREQ to tabulate the counts, I have previously shown how to use the DATA step to create the set of all possible categories and run PROC FREQ on the resulting data. The trick is to use the ZEROS option on the WEIGHT statement to include categories with zero counts.

In the SAS/IML language you can do something similar to enable the TABULATE subroutine to report 0 counts for unobserved categories. However, I don't like the previous way that I implemented the idea, so I want to try again. The following SAS/IML module uses the LOC-ELEMENT technique to accomplish the task, but wraps the functionality with a simpler syntax that is similar to the syntax for the TABULATE subroutine. The only difference is a fourth argument that specifies the "reference set", which is the set of all possible categories.

proc iml;
/* output levels and frequencies for categories in x, including all 
   levels in the reference set */
start TabulateLevels(OutLevels, OutFreq, x, refSet);
   call tabulate(levels, freq, x);        /* compute the observed frequencies */
   OutLevels = union(refSet, levels);     /* union of data and reference set */ 
   idx = loc(element(OutLevels, levels)); /* find the observed categories */
   OutFreq = j(1, ncol(OutLevels), 0);    /* set all frequencies to 0 */
   OutFreq[idx] = freq;                   /* overwrite observed frequencies */
/* count faces for 10 rolls of six-sided die, using the reference set 1:6 */
rolls = {4 6 4 6 5 2 4 3 3 6};           /* only 5 faces observed */
call TabulateLevels(Levels, Freq, rolls, 1:6);
print Freq[c=(char(Levels))];

The table shows the output for a set of 10 rolls of a six-sided die. The "1" face was not observed during the experiment, but we want to report a count of 0 for that category.

I like this module better because its syntax is similar to the TABULATE subroutine. The name even includes "Tabulate", which will make the function easier to find when I need to use it.

Post a Comment

How to generate random integers in SAS

I was recently talking with some SAS customers and someone started talking about generating random numbers. I was asked "Why can't SAS create an easy way to generate random numbers? Excel has a simple way to generate random numbers between 1 and 100, and I use it all the time."

The speaker was talking about generating random integers from a discrete uniform distribution, where the numbers range between a specified minimum and maximum value. I have previously written about how to generate random numbers in SAS, but the section about random integers is buried in the middle. So let's simplify the process and see whether we can get SAS to generate random integers "as easily as Excel does."

A macro with a simple syntax

I looked up the Excel function that generates random integers. The syntax is RANDBETWEEN(min, max). For example, if you want a random integer between 1 and 10, the syntax is RANDBETWEEN(1, 10).

In SAS, random numbers are generated by using the RAND function, which is a sophisticated function that can sample from complex statistical distributions. However, the following SAS macro hides the complexity of the RAND function and creates a simple statement that that has the same functionality as the Excel function:

/* SAS macro that duplicates the Excel RANDBETWEEN function */
%macro RandBetween(min, max);
   (&min + floor((1+&max-&min)*rand("uniform")))

After you've defined the macro, you can just pretend that it is a function and use it in a DATA step. The following DATA step generates 100 random numbers between 1 and 10:

data RandInt;
do i = 1 to 100;
   x = %RandBetween(1, 10);
proc print data=RandInt(obs=6);  var x;  run;

The adjacent table shows the first six random integers. Every time you run the program you will get a different sequence of random integers. Use the STREAMINIT subroutine if you want a reproducible stream of integers.

A SAS/IML function for random integers

The previous macro call should appeal to DATA step programmers. For the SAS/IML programmers, you can define a function that takes an additional argument and returns a vector of random integers:

proc iml;
/* Generate n random integers drawn uniformly at random from 
   integers in [min, max] where min and max are integers. */
start RandBetween(n, min, max);
   u = j(n, 1);
   call randgen(u, "Uniform", min, max+1);
   return( floor(u) );
/* Test the function. Generate 10,000 values and tabulate the result. */
call randseed(1234);
x = RandBetween(10000, 1, 10);
call tabulate(val, freq, x);
print freq[c=(char(val))];

The output shows that each integer 1 through 10 appeared approximately 1,000 times in the simulated data.

In summary, you can define a SAS macro or a SAS/IML function that hides the complexity of the RAND function. The result is a syntax that is easy to use and duplicates the functionality of the Excel RANDBETWEEN function.

Post a Comment

Balls and urns Part 2: Multi-colored balls

In a previous post I described how to simulate random samples from an urn that contains colored balls. The previous article described the case where the balls can be either of two colors. In that csae, all the distributions are univariate. In this article I examine the case where the urn contains c > 2 colors, which leads to multivariate distributions.

Let K1, K2, ..., Kc be the number of balls for each of the c colors. Let N = Σ Ki be the total number of balls.

The table distribution

The simplest experiment is to reach into the urn and pull out a single ball. The probability of choosing a ball of color i is pi = Ki / N. In SAS you can use the table distribution to specify the probabilities of selecting each integer in the range [1, c]. Some people call the table distribution a categorical distribution or a generalized Bernoulli distribution.

You can use the RAND function in the DATA step to simulate from a categorical distribution. You can also use the RANDGEN function in SAS/IML software, or you can use the SAMPLE function.

The multinomial distribution

If you repeat the experiment n times (replacing the selected ball after recording its value), you obtain a c-tuple of counts. Let ki be the number of times that a ball of color i was selected. Then the result of n draws with replacement is the vector (k1, k2, ..., kc) where Σ ki = n.

By definition, this c-tuple is a single draw from the multinomial distribution. In SAS, you can use the RANDMULTINOMIAL function in SAS/IML to simulate from the multinomial distribution.

The multivariate hypergeometric distribution

When sampling with replacement, the probability of choosing a ball of color i never changes. However, you can also sample without replacement. Without replacement, the probability of choosing a given color changes from draw to draw.

The distribution of the number of colored balls when you sample without replacement is called the multivariate hypergeometric distribution. SAS does not have a built-in function for simulating data from a multivariate hypergeometric distribution, but the following programs shows that you can iteratively use the univariate hypergeometric distribution to construct a multivariate function.

A standard technique for drawing a sample from multiple categories is to reduce the problem to a series of two-category problems by combining colors (Gentle, 2003, p. 206). You can obtain a random number of balls drawn from the first color category by using the hypergeometric distribution with K1 balls of the first color and N - K1 balls in the "other" category. Suppose that k1 balls of the first color are selected. For the second step, there are N - K1 balls of the other colors in the urn and n - k1 balls left to draw. Repeat this process with the second color category. Use the hypergeometric distribution to draw k2 balls of the second color. That leaves (N - K1 - K2) balls of the remaining colors in the urn and (n - k1 - k1) balls left to draw. Continue this process to obtain counts for all the categories.

The following SAS/IML function can return multiple draws from the multivariate hypergeometric distribution. The function is efficient because it exploits the fact that you can pass vectors of parameters to the RANDGEN subroutine. The RandMVHyper function uses a vector of N's (ItemsLeft) and a vector of n's (nDraw) to generate an arbitrary number of independent draws from the multivariate hypergeometric distribution. There is no loop over the number of replications. The only loop is over the number of colors, which is necessary because that loop updates the selection probabilities. (If you study the code, you might want to mentally set nRep = 1 so that the vectors of parameters become scalar quantities.)

proc iml;
/* There are K[i] balls of color i in an urn.
   Draw N balls without replacement and report the number of 
   each color. 
   N[1] = sample size
   N[2] = number of replications (optional. Default is 1) 
   K = vector that gives the number balls of each color.
       The total number of balls is sum(K). 
start RandMVHyper(N, _K);
   if nrow(N)*ncol(N)>1 then nRep = N[2];
   else nRep = 1;
   K = rowvec(_K);               /* K[i] is number of items for category i */
   nDraw = j(nRep, 1, N[1]);     /* run nRep sims at once */
   ItemsLeft = j(nRep, 1, sum(K));
   x = j(nRep, ncol(K), 0);     /* each row is draw from MV hyper */
   h = j(nRep, 1, 0);           /* vec for hypergeometric values  */
   do i = 1 to ncol(K)-1;
      Kvec = j(nRep, 1, K[i]);
      call randgen(h, "Hyper", ItemsLeft, Kvec, nDraw);
      x[,i] = h;
      ItemsLeft = ItemsLeft - K[i];         /* update parameters */
      nDraw = nDraw - h;
   x[,ncol(K)] = nDraw;
   return (x);
call randseed(1234);
/* TEST:      nDraws nRep     K1 K2 K3 */
y = RandMVHyper({100 1000}, {100 40 60});
print (y[1:5,])[c={"black" "white" "red"} L="Draws w/o Replacement"];
mean = mean(y);
print mean;

The test code requests 1,000 replicates. Each replicate is a random vector from the multivariate hypergeometric distribution, namely the result of drawing 100 balls without replacement from an urn that contains 100 black balls, 40 white balls, and 60 red balls. The table shows the first five results. For example, the fourth experiment resulted in 53 black, 24 white, and 23 red balls. Each row sums to 100, which is the number of balls drawn for each experiment. The average counts from the 1,000 experiments were approximately 50 black ball, 20 white balls, and 30 red balls. These are the expected values.

The RandMVHyper function handles most cases, but it can break down when the sample size is small and when some categories (colors) have small marginal sums. You can download a more robust version of the function that handles these edge cases.

In summary:

  • Use the table (generalized Bernoulli) distribution to simulate drawing a colored ball from an urn that contains balls of different colors.
  • Use the multinomial distribution to generate the counts when you sample n balls with replacement. In other words, you repeat n independent and identical generalized Bernoulli trials and tabulate the number of balls that were selected for each color.
  • Use the multivariate hypergeometric distribution to generate the counts when you sample n balls without replacement. This article contains a SAS/IML function that samples from the multivariate hypergeometric distribution.
Post a Comment

Balls and urns: Discrete probability functions in SAS

If not for probability theory, urns would appear only in funeral homes and anthologies of British poetry.

But in probability and statistics, urns are ever present and contain colored balls. The removal and inspection of colored balls from an urn is a classic way to demonstrate probability, sampling, variation, and other elementary concepts in probability and statistics.

This article describes some ball-and-urn experiments, and how you can use SAS software to simulate drawing balls from an urn under various assumptions. This article describes the case where the urn contains exactly two colors, often chosen to be black and white. A subsequent article will describe the case where the urn contains more than two colors.

Let K0 be the number of black balls, K1 be the number of white balls, and N be the total number of balls in the urn.

Bernoulli Trials

The simplest experiment is to reach into the urn and pull out a single ball. Although there are two kinds of balls, the act of drawing one ball depends only on one parameter, namely the probability of choosing a white ball, which is p = K1 / N. (Similarly, the ball will be black with probability 1–p.) This is called a Bernoulli trial. In SAS the Bernoulli distribution is supported by the four essential functions for probability and sampling: the PDF, CDF, QUANTILE, and RAND functions.

For example, the following DATA step repeats 100 independent Bernoulli trials and plots the results for an urn in which 75% of the balls are white. "Independence" means that after each trial, the selected ball is returned to the urn, and the urn is thoroughly mixed. The only parameter for the Bernoulli distribution is the probability of selecting a white ball:

data Bern;
call streaminit(54321);
do i = 1 to 100;
   color = rand("Bernoulli", 0.75);  /* P(color=1) = 0.75 */
proc freq data=Bern;
tables color / nocum plots=FreqPlot;

Binomial Trials

You don't actually need to simulate 100 Bernoulli trials and call PROC FREQ to obtain the (random) number of white balls that were drawn. The number of white balls that are selected when you draw n balls with replacement from an urn can be obtained from the binomial distribution. The binomial distribution saves you computational time because you can simulate one binomial trial instead of 100 Bernoulli trials. The binomial distribution has two parameters: the size of the sample that you are drawing (n=100) and the probability of the event of interest (p = 0.75).

The following DATA step simulates five binomial trials. Each binomial trial represents 100 independent Bernoulli draws:

data Binom;
call streaminit(54321);
do i = 1 to 5;
   numWhite = rand("Binomial", 100, 0.75);  /* 100 draws; P(color=1) = 0.75 */
proc print; var numWhite; run;

Hypergeometric Trials

When sampling with replacement, the probability of choosing a white ball never changes. However, you can also sample without replacement. You start with N balls in the urn, of which K1 are white. If you choose a white ball and remove it, the probability of choosing a white ball on the next draw is decreased. If you choose a black ball, the probability of choosing white increases. In addition, K1 is an upper bound for the maximum number of white balls that can ever be drawn.

In spite of the changing probability, the distribution of the number of white balls when you sample without replacement is well understood. It is called the hypergeometric distribution. Like the binomial distribution, using the hypergeometric distribution saves time because you don't need to use the Bernoulli distribution (with a changing probability) to simulate drawing 100 balls without replacement.

The following DATA step uses the hypergeometric distribution to draw a sample of 100 balls from an urn that contains 120 balls, 90 of which are white. For this distribution it is necessary to specify three parameters: the total number of balls (N), and the initial number of white balls (K1), and the sample size (n).

data Hyper;
call streaminit(54321);
do i = 1 to 5;
   /* sample without replacement from urn with 120 ball of which 90 are white */
   numWhite = rand("Hypergeometric", 120, 90, 100); /* N=120, K1=90; 100 draws */
proc print; var numWhite; run;

The values from the hypergeometric distribution have a smaller variance then the values from the binomial distribution. This is typical when the number of draws is close to the number of balls in the urn. When the number of balls is much greater than the number of draws, the hypergeometric distributions is similar to the binomial distribution.

In summary:

  • Use the Bernoulli distribution to analyze the probability of pulling a colored ball from an urn with two colors.
  • The binomial distribution model the distribution of the number of white balls when you sample n balls with replacement. In other words, you repeat n independent and identical Bernoulli trials.
  • The hypergeometric distribution models the distribution of the number of white balls when you sample n balls without replacement.

Other related discrete probability distributions

There are other discrete probability functions that can model ball-and-urn experiments in which the question is "how many trials do you need until" some event occurs:

  • The geometric distribution describes the number of trials that are needed to obtain one white ball. SAS supports the geometric distribution in the RAND function and in probability functions.
  • The negative binomial distribution is the distribution of the number of black balls before a specified number of white balls are drawn in a sampling with replacement experiment. SAS supports the negative binomial distribution as well
  • A lesser-known distribution is the negative hypergeometric distribution, which is the distribution of black balls when you sample without replacement. SAS does not support this distribution directly, but the hypergeometric distribution is useful for implementing it.
Post a Comment

Ten "one-liners" that create test matrices for statistical programmers

You've had a long day. You've implemented a custom algorithm in the SAS/IML language. But before you go home, you want to generate some matrices and test your program.

If you are like me, you prefer a short statement—one line would be best. However, you also want the flexibility to create large matrices to test the performance of the algorithm. And you'd like the matrices to be simple so that you can determine if the algorithm is working and debug it if necessary.

Last week's article on how to generate a large correlation matrix caused me to think about other quick and easy ways to generate matrices of an arbitrary size. Over the years I've developed favorite "test matrices" that I use to test algorithms, to write examples for my blog, or to answer questions on a discussion forum.

Here are ten one-line statements that generate numerical and character matrices of an arbitrary size. Most matrices have n rows and p columns. However, some functions generate square matrices, which are n x n.

  1. Constant matrix: The J function enables you to create a matrix with constant values. The syntax is
    const = j(n, p, value);
  2. Constant by row or column: The ROW and COL functions enable you to create a matrix where the ith row or column has the value i. The syntax is
    rows = row(j(n, p));
  3. Sequential and nonrepeating: The SHAPE function in conjunction with the index creation operator (:) enables you to create a matrix that contains sequential values. (You can also use the SHAPECOL function.) The syntax is
    seq = shape(1:n*p, n);
  4. Periodic: The MOD function returns the remainder when a number is divided by some value. When combined with the previous example, you obtain matrices for which the values repeat in a periodic manner. The syntax is
    mod = mod(shape(1:n*p, n), value);
  5. Symmetric matrices: The SQRSYM and SQRVECH functions enable you to create square symmetric matrices from a vector of values. The syntax is
    sym = sqrsym(1:n*(n+1)/2);
  6. Diagonally banded: The TOEPLITZ function enables you to create square banded matrices. The syntax is
    band = toeplitz(n:1);
  7. Magic squares: The MAGIC function (which requires SAS/IML 12.3) creates square matrices for which the sums of the rows, columns, and diagonals are equal. The syntax is
    magic = magic(n);
  8. Random integers: The SAMPLE function enables you to generate a random sample from a finite set. The syntax is
    samp = sample(1:value, p//n);
  9. Random values from a standard distribution: The RANDFUN function enables you to generate a matrix of random values from a standard distribution. The syntax is
    rand = randfun(n//p, "Normal");
  10. Character matrices: Many of the previous matrices can be modified to create character matrices. For any integer matrix, you can apply the MOD function to create integers in the range 1–26 that can be mapped to letters. The index creation operator (:) supports letters, so you can create a vector of all uppercase letters by using the syntax "A":"Z". Also, the SAMPLE function can create a random sample directly from any set of character values. The syntax is
    sampC = sample("A":"Z", p//n);

The following SAS/IML program generates examples of each of these matrices. You can change the values of n and p to create matrices of any size.

proc iml;
call randseed(12345);
n = 4;
p = 6;
value = 3;
/* constant and sequential matrices */
const = j(n, p, value);
rows = row(j(n, p));
cols = col(j(n,p));
seq = shape(1:n*p, n);
seq2 = shapecol(1:n*p, n);
mod = mod(shape(1:n*p, n), value);
/* square matrices */
symU = sqrvech(1:n*(n+1)/2);
symL = sqrsym(1:n*(n+1)/2);
band = toeplitz(n:1);
magic = magic(n);
/* random matrices */
samp = sample(1:value, p//n);
rand = randfun(n//p, "Normal");
/* character matrices */
letters = "A":"Z";
sampC = sample(letters, p//n);
idx = 1 + mod(0:n*p-1, ncol(letters));  /* values 1-26 */
modC = shape(letters[idx],  n);
print const, rows, cols, seq, seq2, mod,
      symU, symL, band, magic, samp, rand, 
      sampC, modC;

Do you have a favorite matrix that you use to test your programs? Leave a comment.

Post a Comment