The post The conjugate gradient method appeared first on The DO Loop.

]]>I often claim that the "natural syntax" of the SAS/IML language makes it easy to implement an algorithm or statistical formula as it appears in a textbook or journal. The other day I had an opportunity to test the truth of that statement. A SAS programmer wanted to implement the conjugate gradient algorithm, which is an iterative method for solving a system of equations with certain properties. I looked up the Wikipedia article about the conjugate gradient method and saw the following text:

"The algorithm is detailed below for solving Ax = b where A is a real, symmetric, positive-definite matrix. The input vector x0 can be an approximate initial solution or 0." The text was accompanied by pseudocode. (In the box to the right; click to enlarge.)

I used the pseudocode to implement the conjugate gradient method in SAS/IML. (The method is explained further in the next section.) I chose not to use the 'k' and 'k+1' notation but merely to overwrite the old values of variables with new values. In the program, the variables `x`, `r`, and `p` are vectors and the variable `A` is a matrix.

/* Linear conjugate gradient method as presented in Wikipedia: https://en.wikipedia.org/wiki/Conjugate_gradient_method Solve the linear system A*x = b, where A is a symmetric positive definite matrix. The algorithm converges in at most n iterations, where n is the dimension of A. This function requires an initial guess, x0, which can be the zero vector. The function does not verify that the matrix A is symmetric positive definite. This module returns a matrix mX= x1 || x2 || ... || xn whose columns contain the iterative path from x0 to the approximate solution vector xn. */ proc iml; start ConjGrad( x0, A, b, Tolerance=1e-6 ); x = x0; /* initial guess */ r = b - A*x0; /* residual */ p = r; /* initial direction */ mX = j(nrow(A), nrow(A), .); /* optional: each column is result of an iteration */ do k = 1 to ncol(mX) until ( done ); rpr = r`*r; Ap = A*p; /* store partial results */ alpha = rpr / (p`*Ap); /* step size */ x = x + alpha*p; mX[,k] = x; /* remember new guess */ r = r - alpha*Ap; /* new residual */ done = (sqrt(rpr) < Tolerance); /* stop if ||r|| < Tol */ beta = (r`*r) / rpr; /* new stepsize */ p = r + beta*p; /* new direction */ end; return ( mX ); /* I want to return the entire iteration history, not just x */ finish; |

The SAS/IML program is easy to read and looks remarkably similar to the pseudocode in the Wikipedia article. This is in contrast to lower-level languages such as C in which the implementation looks markedly different from the pseudocode.

The conjugate gradient method is an iterative method to find the solution of a linear system A*x=b, where A is a symmetric positive definite n x n matrix, b is a vector, and x is the unknown solution vector. (Recall that symmetric positive definite matrices arise naturally in statistics as the crossproduct matrix (or covariance matrix) of a set of variables.) The beauty of the conjugate gradient method is twofold: it is guaranteed to find the solution (in exact arithmetic) in at most n iterations, and it requires only simple operations, namely matrix-vector multiplications and vector additions. This makes it ideal for solving system of large sparse systems, because you can implement the algorithm without explicitly forming the coefficient matrix.

It is fun to look at how the algorithm converges from the initial guess to the final solution. The following example converges gradually, but I know of examples for which the algorithm seems to make little progress for the first n – 1 iterations, only to make a huge jump on the final iteration straight to the solution!

Recall that you can use a Toeplitz matrix to construct a symmetric positive definite matrix. The following statements define a banded Toeplitz matrix with 5 on the diagonal and specify the right-hand side of the system. The zero vector is used as an initial guess for the algorithm. The call to the ConjGrad function returns a matrix whose columns contain the iteration history for the method. For this problem, the method requires five iterations to converge, so the fifth column contains the solution vector. You can check that the solution to this system is (x1, x2, 3, x4, x5) = (-0.75, 0, 1.5, 0.5, -0.75), either by performing matrix multiplication or by using the SOLVE function in IML to compute the solution vector.

A = {5 4 3 2 1, /* SPD Toeplitz matrix */ 4 5 4 3 2, 3 4 5 4 3, 2 3 4 5 4, 1 2 3 4 5}; b = {1, 3, 5, 4, 2}; /* right hand side */ n = ncol(A); x0 = j(n,1,0); /* the zero vector */ traj = ConjGrad( x0, A, b ); x = traj[ ,n]; /* for this problem, solution is in last column */ |

It is instructive to view how the iteration progresses from an initial guess to the final solution. One way to view the iterations is to compute the Euclidean distance between each partial solution and the final solution. You can then graph the distance between each iteration and the final solution, which decreases monotonically (Hestenes and Stiefel, 1952).

Distance = sqrt( (traj - x)[##, ] ); /* || x[j] - x_Soln || */ Iteration = 1:n; title "Convergence of Conjugate Gradient Method"; call series(Iteration, Distance) grid={x y} xValues=Iteration option="markers" label={"" "Distance to Solution"}; |

Notice in the distance graph that the fourth iteration almost equals the final solution. You can try different initial guesses to see how the guess affects the convergence.

In addition to the global convergence, you can visualize the convergence in each coordinate. Because the vectors live in high-dimensional space, it impossible to draw a scatter plot of the iterative solutions. However, you can visualize each vector in "parallel coordinates" as a sequence of line segments that connect the coordinates in each variable. In the following graph, each "series plot" represents a partial solution. The curves are labeled by the iteration number. The blue horizontal line represents the initial guess (iteration 0). The partial solution after the first iteration is shown in red and so on until the final solution, which is displayed in black. You can see that the third iteration (for this example) is close to the final solution. The fourth partial solution is so close that it cannot be visually distinguished from the final solution.

In summary, this post shows that the natural syntax of the SAS/IML language makes it easy to translate pseudocode into a working program. The article focuses on the conjugate gradient method, which solves a symmetric, positive definite, linear system in at most n iterations. The article shows two ways to visualize the convergence of the iterative method to the solution.

The post The conjugate gradient method appeared first on The DO Loop.

]]>The post Compute with combinations: Maximize a function over combinations of variables appeared first on The DO Loop.

]]>These types of problems are specific examples of a single abstract problem, as follows:

- From a set of p values, generate all subsets that contain k < p elements. Call the subsets Y
_{1}, Y_{2}, ..., Y_{t}, where t equals "p choose k". (In SAS, use the COMB function to compute the number of combinations:`t = comb(p,k)`.) - For each subset, evaluate some function on the subset. Call the values z
_{1}, z_{2}, ..., z_{t}. - Return some statistic of the z
_{i}. Often the statistics is a maximum or minimum, but it could also be a mean, variance, or percentile.

This is an "exhaustive" method that explicitly generates all subsets, so clearly this technique is impractical for large values of p. The examples that I've seen on discussion forums often use p ≤ 10 and small values of k (often 2, 3, or 4). For parameters in this range, an exhaustive solution is feasible.

This general problem includes "leave-one-out" or jackknife estimates as a special case (k = p – 1), so clearly this formulation is both general and powerful.
This formulation also includes the knapsack problem in discrete optimization. In the knapsack problem, you have p items and a knapsack that can hold k items. You want to choose the items so that the knapsack holds as much value as possible. The knapsack problem maximizes the *sum* of the values whereas the general problem in this article can handle nonlinear functions of the values.

You can use the following DATA set to simulate integer data with a specified number of columns and rows. I use the relatively new "Integer" distribution to generate uniformly distributed integers in the range [-3, 9].

%let p = 5; /* number of variables */ %let NObs = 6; /* number of observations */ data have(drop=i j); call streaminit(123); array x[&p]; do i = 1 to &NObs; do j = 1 to dim(x); x[j] = rand("Integer", -3, 9); /* SAS 9.4M4 */ end; output; end; ; proc print data=have; run; |

For p = 5 and k = 3, the problem is: "For each observation of the 5 variables, find the largest product among any 3 values." In the SAS/IML language, you can solve problems like this by using the ALLCOMB function to generate all combinations of size k from the index set {1,2,...,p}. These values are indices that you can use to reference each combination of values. You can evaluate your function on each combination and then compute the max, min, mean, etc. For example, the following SAS/IML statements generate all combinations of 3 values from the set {1, 2, 3, 4, 5}:

proc iml; p = 5; k = 3; c = allcomb(p, k); /* combinations of p items taken k at a time */ print c; |

A cool feature of the SAS/IML language is that you can use these values as column subscripts! In particular, the expression `X[i, c]` generates all 3-fold combinations of values in the i_th row. You can then use the SHAPE function to reshape the values into a matrix that has 3 columns, as follows:

/* Example: find all combination of elements in the first row */ varNames = "x1":"x5"; use have; read all var varNames into X; close; Y = X[1, c]; /* all combinations of columns for 1st row */ M = shape(Y, nrow(c), k); /* reshape so each row has k elements */ prod = M[, #]; /* product of elements across columns */ print M prod; |

Notice that each row of the matrix M contains k = 3 elements of Y. There are "5 choose 3" = 10 possible ways to choose 3 items from a set of 5, so the M matrix has 10 rows. Notice that you can use a subscript reduction operator (#) to compute the product of elements for each combination of elements. The maximum three-value product for the first row of data is 24.

The following loop performs this computation for each observation. The result is a vector that contains the maximum three-value product of each row. The original data and the results are then displayed side by side:

/* for each row and for X1-X4, find maximum product of three elements */ result = j(nrow(X), 1); do i = 1 to nrow(X); Y = X[i, c]; /* get i_th row and all combinations of coloumns */ M = shape(Y, nrow(c), k); /* reshape so each row has k elements */ result[i] = max( M[,#] ); /* max of product of rows */ end; print X[colname=varNames] result[L="maxMul"]; |

Of course, if the computation for each observation is more complicated than in this example,
you can define a function that computes the result and then call the module like this: `result[i]= MyFunc(M);``
`

You can perform a similar computation in the DATA step, but it requires more loops. You can use the ALLCOMB function (or the LEXCOMBI function) to generate all k-fold combinations of the indices {1, 2, ..., p}. You should call the ALLCOMB function inside a loop from 1 to NCOMB(p, k). Inside the loop, you can evaluate the objective function on each combination of data values. Many DATA step functions such as MAX, MIN, SMALLEST, and LARGEST accept arrays of variables, so you probably want to store the variables and the indices in arrays. The following DATA step contains comments that describe each step of the program:

%let p = 5; %let k = 3; %let NChooseK = %sysfunc(comb(&p,&k)); /* N choose k */ data Want(keep=x1-x&p maxProd); set have; array x[&p] x1-x&p; /* array of data */ array c[&k]; /* array of indices */ array r[&NChoosek]; /* array of results for each combination */ ncomb = comb(&p, &k); /* number of combinations */ do i=1 to &k; c[i]=0; end; /* zero the array of indices before first call to ALLCOMB */ do j = 1 to ncomb; call allcombi(&p, &k, of c[*]); /* generate j_th combination of indices */ /* evaluate function of the array {x[c[1]], x[c[2]], ..., x[c[k]]} */ r[j] = 1; /* initialize product to 1 */ do i=1 to &k; r[j] = r[j] * x[c[i]]; end; /* product of j_th combination */ end; maxProd = max(of r[*]); /* max of products */ output; run; proc print data=Want; run; |

The DATA step uses an array (R) of values to store the result of the function evaluated on each subset. For a MAX or MIN computation, this array is not necessary because you can keep track of the current MAX or MIN inside the loop over combinations. However, for more general problems (for example, find the median value), an array might be necessary.

In summary, this article shows how to solve a general class of problems. The general problem generates all subsets of size k from a set of size p. For each subset, you evaluate a function and produce a statistic. From among the "p choose k" statistics, you then choose the max, min, or some other measure. This article shows how to solve these problems efficiently in the SAS/IML language or in the SAS DATA step. Because this is a "brute force" technique, it is limited to small values of p. I suggest p ≤ 25.

The post Compute with combinations: Maximize a function over combinations of variables appeared first on The DO Loop.

]]>The post Visualize repetition in song lyrics appeared first on The DO Loop.

]]>
The analysis is easy. Suppose that a song (or any text source) contains N words. Define the *repetition matrix* to be the N x N matrix where the (i,j)th cell has the value 1 if the i_th word is the same as the j_th word. Otherwise, the (i,j)th cell equals 0. Now visualize the matrix by using a heat map: Black indicates cells where the matrix is 1 and white for 0. A SAS program that performs this analysis is available at the end of this article.

To illustrate this algorithm, consider the nursery rhyme, "Row, Row, Row the Boat":

Row, row, row your boat Gently down the stream Merrily, merrily, merrily, merrily Life is but a dream. |

There are 18 words in this song. Words 1–3 are repeated, as are words 10-–13. You can use the SAS DATA steps to read the words of the song into a variable and use other SAS functions to strip out any punctuation. You can then use SAS/IML software to construct and visualize the repetition matrix. The details are shown at the end of this article.

The repetition matrix for the song "Row, Row, Row, Your Boat" is shown to the right. For this example I could have put the actual words along the side and bottom of the matrix, but that is not feasible for songs that have hundreds of words. Instead, the matrix has a numerical axis where the number indicates the position of each word in the song.

Every repetition matrix has 1s on the diagonal. In this song, the words "row" and "merrily" are repeated. Consequently, there is a 3 x 3 block of 1s at the top left and a 4 x 4 block of 1s in the middle of the matrix. (Click to enlarge.)

As mentioned, this song has very little repetition. One way to quantify the amount of repetition is to compute the proportion of 1s in the upper triangular portion of the repetition matrix. The upper triangular portion of an N x N matrix has N(N–1)/2 elements. For this song, N=18, so there are 153 cells and 9 of them are 1s. Therefore the "repetition score" is 9 / 153 = 0.059.

I wrote a SAS/IML function that creates and visualizes the repetition matrix and returns the repetition score. In order to visualize songs that might have hundreds of words, I suppress the outlines (the grid) in the heat map. To illustrate the output of the function, the following image visualizes the words of the song "Here We Go Round the Mulberry Bush":

Here we go round the mulberry bush, The mulberry bush, The mulberry bush. Here we go round the mulberry bush So early in the morning. |

The repetition score for this song is 0.087. You can see diagonal "stripes" that correspond to the repeating phrases "here we go round" and "the mulberry bush". In fact, if you study only the first seven rows, you can "see" almost the entire structure of the song. The first seven words contain all lyrics except for four words ("so", "early", "in", "morning").

Let's visualize the repetitions in the lyrics of several classic songs.

When I saw Morris's examples, the first song I wanted to visualize was "Hey Jude" by the Beatles. Not only does the title phrase repeat throughout the song, but the final chorus ("Nah nah nah nah nah nah, nah nah nah, hey Jude") repeats more than a dozen times. This results in a very dense block in the lower right corner of the repetition matrix and a very high repetition score of 0.183. The following image visualizes "Hey Jude":

The second song that I wanted to visualize was "Love Shack" by The B-52s. In addition to a title that repeats almost 40 times, the song contains a sequence near the end in which the phrase "Bang bang bang on the door baby" is alternated with various interjections. The following visualization of the repetition matrix indicates that there is a lot of variation interspersed with regular repetition. The repetition score is 0.035.

Lastly, I wanted to visualize the song "Call Me" by Blondie. This classic song has only 241 words, yet the title is repeated 41 times! In other words, about 1/3 of the song consists of those two words! Furthermore, there is a bridge in the middle of the song in which the phrase "oh oh oh oh oh" is alternated with other phrases (some in Italian and French) that appear only once in the song. The repetition score is 0.077. The song is visualized below:

If you think this is a fun topic, you can construct these images yourself by using SAS. If you discover a song that has an interesting repetition matrix, post a comment!

Here's the basic idea of how to construct and visualize a repetition matrix. First, use the DATA step to read each word, use the COMPRESS function to remove any punctuation, and standardize the input by transforming all words to lowercase:data Lyrics; length word $20; input word @@; word = lowcase( compress(word, ,'ps') ); /* remove punctuation and spaces */ datalines; Here we go round the mulberry bush, The mulberry bush, The mulberry bush. Here we go round the mulberry bush So early in the morning. ; |

In SAS/IML software you can use the ELEMENT function to find the locations in the i_th row that have the value 1. After you construct a repetition matrix, you can use the HEATMAPDISC subroutine to display it. For example, the following SAS/IML program reads the words of the song into a vector and visualizes the repetition matrix. It also returns the repetition score, which is the proportion of 1s in the upper triangular portion of the matrix.

ods graphics / width=500 height=500; /* for 9.4M5 you might need NXYBINSMAX=1000000 */ proc iml; /* define a function that creates and visualizes the repetition matrix */ start VizLyrics(DSName, Title); use (DSName); read all var _CHAR_ into Word; close; N = nrow(Word); M = j(N,N,0); /* allocate N x N matrix */ do i = 1 to N; M[,i] = element(Word, Word[i]); /* construct i_th row */ end; run heatmapdisc(M) title=Title colorramp={white black} displayoutlines=0 showlegend=0; /* compute the proportion of 1s in the upper triangular portion of the matrix */ upperIdx = loc(col(M)>row(M)); return ( M[upperIdx][:] ); /* proportion of words that are repeated */ finish; score = VizLyrics("Lyrics", "Here We Go Round the Mulberry Bush"); print score; |

If you want to reproduce the images in this post, you can download the SAS program for this article. In addition, the program creates repetition matrices for "We Didn't Start the Fire" (Billy Joel) and a portion of Martin Luther King Jr.'s "I Have a Dream" speech. You can modify the program and enter lyrics for your favorite songs.

The post Visualize repetition in song lyrics appeared first on The DO Loop.

]]>The post Pi, special functions, and distributions appeared first on The DO Loop.

]]>Pi is a mathematical constant that never changes. Pi is the same value today as it was in ancient Babylon and Greece. The timeless constancy of pi is a comforting presence in a world of rapid change.

But even though the value of pi does not change, our *knowledge about pi* does change and grow.
I was reminded of this recently when I opened my worn copy of
the *Handbook of Mathematical Functions* (more commonly known as "Abramowitz and Stegun," the names of its editors). When the 1,046-page *Handbook* was published in 1964, it was the premier reference volume for applied mathematicians and mathematical scientists.
Interestingly, pi is not even listed in the index! It does appear on p. 3 under "Mathematical Constants," which gives a 25-digit approximation of many mathematical constants such as pi, e, and sqrt(2).

Fast forward to the age of the internet.
In 2010, the *Handbook* was transformed into an expanded online, searchable, interactive web site. The new *Handbook* is called The NIST Digital Library of Mathematical Functions.
This is very exciting because the *Handbook* is now available (for free!) to everyone!

If you search for pi in the online Digital Library, you find that the editors chose to define pi as the value of the integral

This seems to be a strange way to define pi. Pi is the ratio of the circumference and diameter of a circle, and upon first glance that formula doesn't seem related to a circle.
A more geometric choice would be an integrand such as
sqrt(1 + t^{2}), which connects pi to the area under the unit circle.

Of course, the integral in the Digital Library *is* equal to pi, but it is not obvious.
You might recall from calculus that the antiderivative of 1/(1+t^{2}) is arctan(t). Therefore the expression is just a complicated way to write 4 arctan(1). Ah! This makes more sense because arctan(1) is equal to π/4. In fact, before SAS introduced the CONSTANT function, SAS programmers used to define pi by using the computation `pi = 4*ATAN(1)`. Nevertheless, I think expressing arctan(1) as an integral is unnecessarily obtuse.

I am not enamored with the editors' choice of an integral to define pi, but if I were to use that integrand to define pi, I would use a variation that has applications in probability and statistics. Statisticians sometimes use the Cauchy distribution, which is a fat-tailed distribution that has the interesting mathematical property that the distribution has no mean (expected) value! (Mathematicians say that "the first moment does not exist.") Researchers in robust statistical methods sometimes use Cauchy-distributed errors to generate extreme outliers in simulated data.

The Cauchy probability density function (PDF) is 1/π 1/(1+t^{2}), which means that
the integral of the PDF on the interval [-∞, ∞] is 1. Equivalently, the integral of
1/(1+t^{2}) on the interval [-∞, ∞] is π:

This definition of pi seems more natural than the integral on [0, 1]. I could make other suggestions (such as the integral of arccos on [-1, 1]), but I think I'll stop here.

The purpose of this post is to celebrate pi, which is so ubiquitous and important that it can be defined in numerous ways.
A secondary purpose is to highlight the availability of the
NIST Digital Library of Mathematical Functions, which is an online successor of the venerable *Handbook of Mathematical Functions*. I am thrilled with the availability of this amazing resource, regardless of how they define pi!

To complete this Pi Day post, I leave you with a pi-ku. A pi-ku is like a haiku, but each line contains syllables the number of syllables in the decimal expansion of pi. A common structure for a pi-ku is 3-1-4. The following pi-ku celebrates the new Digital Library:

Handbook of

Math

Functions? Online!

The post Pi, special functions, and distributions appeared first on The DO Loop.

]]>The post Fit a distribution from quantiles appeared first on The DO Loop.

]]>This question was asked by a SAS programmer who wanted to fit a gamma distribution by using sample quantiles of the data. In particular, the programmer said, "we have the 50th and 90th percentile" of the data and "want to find the parameters for the gamma distribution [that fit] our data."

This is an interesting question. Recall that the method of moments uses sample *moments* (mean, variance, skewness,...) to estimate parameters in a distribution. When you use the method of moments, you express the moments of the distribution in terms of the parameters, set the distribution's moments equal to the sample moments, and solve for the parameter values for which the equation is true.

In a similar way, you can fit a distribution matching quantiles: Equate the sample and distributional quantiles and solve for the parameters of the distribution. This is sometimes called *quantile-matching estimation* (QME). Because the quantiles involve the cumulative distribution function (CDF), the equation does not usually have a closed-form solution and must be solved numerically.

To answer the programmer's question, suppose you do not have the original data, but you are told that the 50th percentile (median) of the data is x = 4 and the 90th percentile is x = 8. You suspect that the data are distributed according to a gamma distribution, which has a shape parameter (α) and a scale parameter (β). To use quantile-matching estimation, set F(4; α, β) = 0.5 and F(8; α, β) = 0.9, where F is the cumulative distribution of the Gamma(α, β) distribution. You can then solve for the values of (α, β) that satisfy the equations. You will get a CDF that matches the quantiles of the data, as shown to the right.

I have previously written about four ways to solve nonlinear equations in SAS. One way is to use PROC MODEL, as shown below:

data initial; alpha=1; beta=1; /* initial guess for finding root */ p1=0.5; X1 = 4; /* eqn for 1st quantile: F(X1; alpha, beta) = p1 */ p2=0.9; X2 = 8; /* eqn for 2nd quantile: F(X2; alpha, beta) = p2 */ run; proc model data=initial; eq.one = cdf("Gamma", X1, alpha, beta) - p1; /* find root of eq1 */ eq.two = cdf("Gamma", X2, alpha, beta) - p2; /* and eq2 */ solve alpha beta / solveprint out=solved outpredict; run;quit; proc print data=solved noobs; var alpha beta; run; |

The output indicates that the parameters (α, β) = (2.96, 1.52) are the values for which the Gamma(α, β) quantiles match the sample quantiles. You can see this by graphing the CDF function and adding reference lines at the 50th and 90th percentiles, as shown at the beginning of this section. The following SAS code creates the graph:

/* Graph the CDF function to verify that the solution makes sense */ data Check; set solved; /* estimates of (alpha, beta) from solving eqns */ do x = 0 to 12 by 0.2; CDF = cdf("gamma", x, alpha, beta); output; end; run; title "CDF of Gamma Distribution"; title2 "Showing 50th and 90th Percentiles"; proc sgplot data=Check; series x=x y=CDF / curvelabel; dropline y=0.5 X=4 / dropto=both; /* first percentile */ dropline y=0.9 X=8 / dropto=both; /* second percentile */ yaxis values=(0 to 1 by 0.1) label="Cumulative Probability"; xaxis values=(0 to 12 by 2); run; |

The previous section is relevant when you have as many sample quantiles as parameters. If you have more sample quantiles than parameters, then the system is overconstrained and you probably want to compute a least squares solution. If there are *m* sample quantiles,
the least squares solution is the set of parameters that minimizes the sum of squares Σ_{i}^{m} (p_{i} – *F*(x_{i}; α, β))^{2}.

For example, the following DATA step contains five sample quantiles. The observation (p,q) = (0.1, 1.48) indicates that the 10th percentile is x=1.48. The second observation indicates that the 25th percentile is x=2.50. The last observation indicates that the 90th percentile is x=7.99. You can use PROC NLIN to find a least squares solution to the quantile-matching problem, as follows:

data SampleQntls; input p q; /* p is cumul probability; q is p_th sample quantile */ datalines; 0.1 1.48 0.25 2.50 0.5 4.25 0.75 6.00 0.9 7.99 ; /* least squares fit of parameters */ proc nlin data=SampleQntls /* sometimes the NOHALVE option is useful */ outest=PE(where=(_TYPE_="FINAL")); parms alpha 2 beta 2; bounds 0 < alpha beta; model p = cdf("Gamma", q, alpha, beta); run; proc print data=PE noobs; var alpha beta; run; |

The solution indicates the parameter values (α, β) = (2.72, 1.70) minimize the sum of squares between the observed and theoretical quantiles. The following graph shows the observed quantiles overlaid on the CDF of the fitted Gamma(α, β) distribution. Alternatively, you can graph the quantile-quantile plot of the observed and fitted quantiles.

For small samples, quantiles in the tail of a distribution have a large standard error, which means that the observed quantile might not be close to the theoretical quantile. One way to handle that uncertainty is to compute a weighted regression analysis where each sample quantile is weighted by the inverse of its variance.
According to Stuart and Ord (*Kendallâ€™s Advanced Theory of Statistics*, 1994, section 10.10), the standard error of the p_th sample quantile in a sample of size *n* is σ^{2} = p(1-p) / (n *f*(ξ_{p})^{2}), where
ξ_{p} is the p_th quantile of the distribution and *f* is the probability density function.

In PROC NLIN, you can perform weighted analysis by using the automatic variable _WEIGHT_. The following statements define the variance of the p_th sample quantile and define weights equal to the inverse variance. Notice the NOHALVE option, which can be useful for iteratively reweighted least squares problems. The option eliminates the requirement that the weighted sum of squares must decrease at every iteration.

/* weighted least squares fit where w[i] = 1/variance[i] */ proc nlin data=SampleQntls NOHALVE outest=WPE(where=(_TYPE_="FINAL")); parms alpha 2 beta 2; bounds 0 < alpha beta; N = 80; /* sample size */ xi = quantile("gamma", p, alpha, beta); /* quantile of distrib */ f = pdf("Gamma", xi, alpha, beta); /* density at quantile */ variance = p*(1-p) / (N * f**2); /* variance of sample quantiles */ _weight_ = 1 / variance; /* weight for each observation */ model p = cdf("Gamma", q, alpha, beta); run; |

The parameter estimates for the weighted analysis are slightly different than for the unweighted analysis. The following graph shows the CDF for the weighted estimates, which does not pass as close to the 75th and 90th percentiles as does the CDF for the unweighted estimates. This is because the PDF of the gamma distribution is relatively small for those quantiles, which causes the regression to underweight those sample quantiles.

In summary, this article shows how to use SAS to fit distribution parameters to observed quantiles by using quantile-matching estimation (QME). If the number of quantiles is the same as the number of parameters, you can numerically solve for the parameters for which the quantiles of the distribution equal the sample quantiles. If you have more quantiles than parameters, you can compute a least squares estimate of the parameters. Because quantile estimates in the tail of a distribution have larger uncertainty, you might want to underweight those quantiles. One way to do that is to run a weighted least squares regression where the weights are inversely proportional to the variance of the sample quantiles.

The post Fit a distribution from quantiles appeared first on The DO Loop.

]]>The post The probability of a saddle point in a matrix appeared first on The DO Loop.

]]>- Locate a saddle point in a matrix
- Use simulation to estimate the probability of finding a saddle point in a random matrix.
- Use simulation to explore the distribution of the location of a saddle point in a random matrix.

You might remember from multivariable calculus that a critical point (x0, y0) is a *saddle point* of a function *f* if it is a local minimum of the surface in one direction and a local maximum in another direction. The canonical example is *f*(x,y) = x^{2} - y^{2} at the point (x0, y0) = (0, 0). Along the horizontal line y=0, the point (0, 0) is a local minimum. Along the vertical line x=0, the point (0, 0) is a local maximum.

The definition of a saddle point of a matrix is similar. For an n x p matrix M, the cell (i, j) is a saddle point of the matrix if M[i, j] is simultaneously the minimum value of the i_th row and the maximum value of the j_th column.

Consider the following 3 x 3 matrices. The minimum value for each row is highlighted in blue. The maximum value for each column is highlighted in red. Two of the matrices have saddle points, which are highlighted in purple.

- For the matrix M1, the (3, 1) cell is a saddle point because 7 is the smallest value in row 3 and the largest value in column 1.
- For the matrix M2, the (2, 2) cell is a saddle point because 5 is the smallest value in row 2 and the largest value in column 2.
- The matrix M3 does not have a saddle point. No cell is simultaneously the smallest value in its row and the largest value in its column.

In a matrix language such as SAS/IML, you can use row and column operators to find the location of a saddle point, if it exists. The SAS/IML language supports subscript reduction operators, which enable you to compute certain statistics for all rows or all columns without writing a loop. For this problem, you can use the "index of the minimum operator" (**>:<**) to find the elements that are the minimum for each row and the
"index of the maximum operator" (**<:>**) to find the elements that are the maximum for each column. To find whether there is an element in common, you can use the XSECT function find the intersection of the two sets. (SAS/IML supports several functions that perform set operations.)

A function that computes the location of the saddle point follows. It uses the SUB2NDX function to convert subscripts to indices.

proc iml; start LocSaddlePt(M); dim = dimension(M); n = dim[1]; p = dim[2]; minRow = M[ ,>:<]; /* for each row, find column that contains min */ minSubscripts = T(1:n) || minRow; /* location as (i,j) pair */ minIdx = sub2ndx(dim, minSubscripts); /* convert location to index in [1, np] */ maxCol = T(M[<:>, ]); /* for each column, find row that contains max */ maxSubscripts = maxCol || T(1:p); /* loation as (i,j) pair */ maxIdx = sub2ndx(dim, maxSubscripts); /* convert location to to index in [1, np] */ xsect = xsect(minIdx, maxIdx); /* intersection; might be empty matrix */ if ncol(xsect) > 0 then return ( xsect ); else return ( . ); finish; M1 = {1 2 3, 4 5 6, 7 8 9}; idx1 = LocSaddlePt(M1); M2 = {9 1 2, 8 5 7, 3 4 6}; idx2 = LocSaddlePt(M2); M3 = {8 1 9, 7 2 6, 3 4 5}; idx3 = LocSaddlePt(M3); print idx1 idx2 idx3; |

The indices of the saddle points are shown for the three example matrices. The 7th element of the M1 matrix is a saddle point, which corresponds to the (3, 1) cell for a 3 x 3 matrix in row-major order. The 5th element (cell (2,2)) of the M2 matrix is a saddle point. The M3 matrix does not contain a saddle point.

A matrix contains either zero or one saddle point.
There is a theorem (Ord, 1979) that says that the probability of finding a saddle point in a random n x p matrix is (n! p!) / (n+p-1)!
when the elements of the matrix are randomly chosen from *any* continuous probability distribution.

Although the theorem holds for any continuous probability distribution, it suffices to prove it for the uniform distribution. Notice that the existence and location of a saddle point is invariant under any monotonic transformation because those transformations do not change the relative order of elements. In particular, if F is any probability distribution, the existence and location of the saddle point is invariant under the monotonic transformation F^{–1}, which transforms the numbers into a sample from the uniform distribution. In fact, you can replace the number by their ranks, which converts the random matrix into a matrix of that contains the consecutive integers 1, 2, ..., np. Thus the previous integer matrices represent the ranks of a random matrix where the elements are drawn from any continuous distribution.

For small values of n and p, you can use the ALLPERM function to enumerate all possible matrices that have integers in the range [1, np]. You can reshape the permutation into an n x p matrix and compute whether the matrix contains a saddle point. If you use a 0/1 indicator array to encode the result, then the exact probability of finding a saddle point in a random n x p matrix is found by computing the mean of the indicator array. The following example computes the exact probability of finding a saddle point in a random 3 x 3 matrix:

A = allperm(9); /* all permutations of 1:9 */ saddle = j(nrow(A), 1); /* allocate vector for results */ do i = 1 to nrow(A); /* for each permutation (row) ... */ M = shape(A[i,], 3); /* reshape row into 3x3 matrix */ idx = LocSaddlePt( M ); /* find location of saddle */ saddle[i] = (idx > 0); /* 0/1 indicator variable */ end; prob = saddle[:]; /* (sum of 1s) / (number of matrices) */ print prob; |

As you can see, the computational enumeration shows that the probability of a saddle point is 0.3 for 3 x 3 matrices. This agrees with the theoretical formula (3! 3!) / 5! = 36 / 120 = 3/10.

Of course, when *n* and *p* are larger it becomes infeasible to completely enumerate all n x p matrices. However, you can adopt a Monte Carlo approach: generate a large number of random matrices and compute the proportion which contains a saddle point. The following program simulates one million random 4 x 4 matrices from the uniform distribution and computes the proportion that has a saddle point. This is a Monte Carlo estimate of the true probability.

NSim = 1e6; n = 4; p=4; /* random 4x4 matrices from uniform distribution */ A = j(NSim, n*p); call randgen(A, "Uniform"); saddleLoc = j(NSim, 1, .); /* store location of the saddle point */ saddle = j(NSim, 1); /* binary indicator variable */ do i = 1 to NSim; /* Monte Carlo estimation of probability */ M = shape(A[i,], n, p); saddleLoc[i] = LocSaddlePt( M ); /* save the location 1-16 */ saddle[i] = (saddleLoc[i] > 0 ); /* 0/1 binary variable */ end; estProb = saddle[:]; Prob = fact(n)*fact(p) / fact(n+p-1); print estProb Prob; |

The output shows that the estimated probability agrees with the exact probability to four decimal places.

The saddle point can occur in any cell in the matrix. Consider, for example, the integer matrix M1 from the earlier 3 x 3 example. If you swap the first and third rows of M1, you will obtain a new matrix that has a saddle point in the (1,1) cell. If you swap the second and third rows of M1, you obtain a saddle point in the (2,1) cell. Similarly, you can swap columns to move a saddle point to a different column. In general, you can move the saddle point to any cell in the matrix by a transposition of rows and columns. Thus each cell in the matrix has an equal probability of containing a saddle point, and the probability that the saddle point is in cell (i,j) is 1/(np) times to the probability that a saddle point exists.

In the preceding SAS/IML simulation, the `saddleLoc` variable contains the location of the saddle point in the 4 x 4 matrix. The following statements compute the empirical proportion of times that a saddle point appeared in each of the 16 cells of the matrix:

call tabulate(location, freq, saddleLoc); /* frequencies in cells 1-16 */ counts = shape(freq, n, p); /* shape into matrix */ EstProbLoc = counts / (NSim); /* empirical proportions */ print EstProbLoc[r=('1':'4') c=('1':'4') F=7.5]; |

The output indicates that each cell has a probability that is estimated by 0.007 (the "James Bond" probability!). Theoretically, you can prove that the probability in each cell is given by the value of the complete beta function B(n, p), which for n=p=4 yields B(4,4) = 0.0071429 (Hofri, 2006).

In summary, this article has explored three topics that are dear to my heart: probability, simulation, and properties of matrices. Along the way, we encountered factorials, permutations, the complete beta function, and got to write some fun SAS/IML programs. For more information about the probability of a saddle point in random matrices, see "The probability that a matrix has a saddle point" (Thorp, 1979) and and "On the distribution of a saddle point value in a random matrix" (Hofri, 2006).

The post The probability of a saddle point in a matrix appeared first on The DO Loop.

]]>The post Solve a system of nonlinear equations with SAS appeared first on The DO Loop.

]]>
This article shows how to use SAS to solve a system of nonlinear equations. When there are *n* unknowns and *n* equations, this problem is equivalent to finding a multivariate root of a vector-valued function **F**(**x**) = **0** because you can always write the system as

f_{1}(x1, x2, ..., xn) = 0

f_{2}(x1, x2, ..., xn) = 0

. . .

f_{n}(x1, x2, ..., xn) = 0

Here
the f_{i} are the nonlinear component functions, **F** is the vector (f_{1}, f_{2}, ..., f_{n}), and **x** is the vector (x1, x2, ..., xn).

In two dimensions, the solution can be visualized as the intersection of two planar curves.
An example for *n* = 2 is shown at the right. The two curves meet at the solution (x, y) = (1, 2).

There are several ways to solve a system of nonlinear equations in SAS, including:

- In SAS/IML software, you can use the NLPLM or NLPHQN methods to solve the corresponding least-squares problem. Namely, find the value of
**x**that minimizes ||**F**(**x**) ||. - In SAS/ETS software, you can use the SOLVE statement in PROC MODEL to solve the system.
- In SAS/STAT software, you can use the NLIN procedure to solve the system.
- In SAS/OR software, you can use PROC OPTMODEL to solve the system.

When *n* = 1, the problem is one-dimensional. You can use the FROOT function in SAS/IML software to find the root of a one-dimensional function. You can also use the SOLVE function in conjunction with PROC FCMP.

This article shows how to find a root for the following system of three equations:

f_{1}(x, y, z) = log(x) + exp(-x*y) - exp(-2)

f_{2}(x, y, z) = exp(x) - sqrt(z)/x - exp(1) + 2

f_{3}(x, y, z) = x + y - y*z + 5

You can verify that the value (x, y, z)=(1, 2, 4) is an exact root of this system.

You can use the NLPLM or NLPHQN methods in SAS/IML to solve nonlinear equations. You need to define a function that returns the value of the function as a row vector. This is very important: **the function must return a row vector!** If the domain of any component of the function is restricted (for example, because of LOG or SQRT functions), you can define a linear constraint matrix.
You then supply an initial guess and call the NLPLM routine to solve the least-squares problem
that minimizes 1/2 (f_{1}^{2} + ... + f_{n}^{2}). Obviously the minimum occurs when each component is zero, that is, when (x,y,z) is a root of the vector-valued function. You can solve for the root as follows:

proc iml; start Fun(var); x = var[1]; y = var[2]; z = var[3]; f = j(1, 3, .); /* return a ROW VECTOR */ f[1] = log(x) + exp(-x*y) - exp(-2); f[2] = exp(x) - sqrt(z)/x - exp(1) + 2; f[3] = x + y - y*z + 5; return (f); finish; /* x[1] x[2] x[3] constraints. Lower bounds in 1st row; upper bounds in 2nd row */ con = {1e-6 . 1e-6, /* x[1] > 0 and x[3] > 0; no bounds on y */ . . .}; x0 = {1 1 1}; /* initial guess */ optn = {3 /* solve least square problem that has 3 components */ 1}; /* amount of printing */ call nlphqn(rc, Soln, "Fun", x0, optn) blc=con; /* or use NLPLM */ print Soln; quit; |

The NLPHQN routine converges to the solution (1, 2, 4). Notice that the first element of the `optn` vector must contain *n*, the number of equations in the system.

If you have access to SAS/ETS software, PROC MODEL provides a way to solve simultaneous equations. You first create a SAS data set that contains an initial guess for the solution. You then define the equations in PROC MODEL and use the SOLVE statement to solve the system, as follows:

data InitialGuess; x=1; y=1; z=1; /* initial guess for Newton's method */ run; proc model data=InitialGuess; bounds 0 < x z; eq.one = log(x) + exp(-x*y) - exp(-2); eq.two = exp(x) - sqrt(z)/x - exp(1) + 2; eq.three = x + y - y*z + 5; solve x y z / solveprint out=solved outpredict; run;quit; title "Solution from PROC MODEL in SAS/ETS"; proc print data=solved noobs; var x y z; run; |

A nice feature of PROC MODEL is that it automatically generates symbolic derivatives and uses them in the solution of the simultaneous equations. If you want to use derivatives in PROC IML, you must specify them yourself. Otherwise, the NLP routines use numerical finite-difference approximations.

You can solve a system of equations by using only SAS/STAT software, but you need to know a trick. My colleague who supports PROC NLIN says he has "seen this trick before" but does not know who first thought of it. I saw it in a 2000 paper by Nam, Cho, and Shim (in Korean).

Because PROC NLIN is designed to solve regression problems, you need to recast the problem in terms of a response variable, explanatory variables, and parameters. Recall that ordinary least squares regression enables you to solve a linear system such as

0 = C1*v1 + C2*v2 + C3*v3

where the left-hand side is a response vector (the zero vector), the C_i are regression coefficients, and the v_i are explanatory variables. (You need three or more observations to solve this regression problem.) PROC NLIN enables you to solve more complex regression problems. In particular, the coefficients can be nonlinear functions of parameters. For example, if the parameters are (x,y,z), you can solve the following system:

0 = C1(x,y,z)*v1 + C2(x,y,z)*v2 + C3(x,y,z)*v3.

To solve this nonlinear system of equations, you can choose the explanatory variables to be coordinate basis functions: v1=(1,0,0), v2=(0,1,0), and v3=(0,0,1). These three observations define three equations for three unknown parameters. In general, if you have *n* equations in *n* unknowns, you can specify *n* coordinate basis functions.

To accommodate an arbitrary number of equations, the following data step generates *n* basis vectors, where *n* is given by the value of the macro variable `numEqns`. The BasisVectors data set contains a column of zeros (the LHS variable):

%let numEqns = 3; data BasisVectors; LHS = 0; array v[&numEqns]; do i = 1 to dim(v); do j = 1 to dim(v); v[j] = (i=j); /* 1 when i=j; 0 otherwise */ end; output; end; drop i j; run; title "Solution from PROC NLIN in SAS/STAT"; proc nlin data=BasisVectors; parms x 1 y 1 z 1; /* initial guess */ bounds 0 < x z; /* linear constraints */ eq1 = log(x) + exp(-x*y) - exp(-2); eq2 = exp(x) - sqrt(z)/x - exp(1) + 2; eq3 = x + y - y*z + 5; model LHS = eq1*v1 + eq2*v2 + eq3*v3; ods select EstSummary ParameterEstimates; run; |

The problem contains three parameters and the data contains three observations. Consequently, the standard errors and confidence intervals are not meaningful. The parameter estimates are the solution to the nonlinear simultaneous equations.

With PROC OPTMODEL in SAS/OR software, you can express the system in a natural syntax. You can either minimize the objective function `F = 0.5 * (f1**2 + f2**2 + f3**2)` or solve the system directly by specifying constraints but not an objective function, as follows:

title "Solution from PROC OPTMODEL in SAS/OR"; proc optmodel; var x init 1, y init 1, z init 1; /* -or- var x >= 1e-6 init 1, y init 1, z >= 0 init 1; to specify bounds */ con c1: log(x) + exp(-x*y) = exp(-2); con c2: exp(x) - sqrt(z)/x = exp(1) - 2; con c3: x + y - y*z = -5; solve noobjective; print x y z; quit; |

The solution is (x,y,z)=(1,2,4) and is not shown.

In summary, there are multiple ways to solve systems of nonlinear equations in SAS. My favorite ways are the NLPHQN function in SAS/IML and the SOLVE statement in PROC MODEL in SAS/ETS. However, you can also use PROC NLIN in SAS/STAT software or PROC OPTMODEL in SAS/OR. When you need to solve a system of simultaneous nonlinear equations in SAS, you can choose whichever method is most convenient for you.

The post Solve a system of nonlinear equations with SAS appeared first on The DO Loop.

]]>The post How to use FIRST.variable and LAST.variable in a BY-group analysis in SAS appeared first on The DO Loop.

]]>
This article gives several examples of using the
`FIRST.`*variable* and `LAST.`*variable* indicator
variables for BY-group analysis in the SAS DATA step.
The first example shows how to compute counts and cumulative amounts for each BY group.
The second example shows how to compute the time between the first and last visit of a patient to a clinic, as well as the change in a measured quantity between the first and last visit.
BY-group processing in the DATA step is a fundamental operation that belongs in every SAS programmer's tool box.

The first example uses data from the Sashelp.Heart data set, which contains data for 5,209 patients in a medical study of heart disease. The data are distributed with SAS.
The following DATA step extracts the `Smoking_Status` and `Weight` variables and sorts the data by the `Smoking_Status` variable:

proc sort data=Sashelp.Heart(keep=Smoking_Status Weight) out=Heart; by Smoking_Status; run; |

Because the data are sorted by the `Smoking_Status` variable, you can use the `FIRST.Smoking_Status` and
`LAST.Smoking_Status` temporary variables to count the number of observations in each level of the Smoking_Status variable. (PROC FREQ computes the same information, but does not require sorted data.)
When you use the BY Smoking_Status statement, the DATA step automatically creates the
`FIRST.Smoking_Status` and
`LAST.Smoking_Status` indicator variables. As its name implies,
the `FIRST.Smoking_Status` variable has the value 1 for the first observation in each BY group and the value 0 otherwise.
(More correctly, the value is 1 for the first record and for records for which the Smoking_Status variable is different than it was for the previous record.)
Similarly, the
`LAST.Smoking_Status` indicator variable has the value 1 for the last observation in each BY group and 0 otherwise.

The following DATA step defines a variable named `Count` and initializes `Count=0` at the beginning of each BY group.
For every observation in the BY group, the `Count` variable is incremented by 1. When the last record in each BY group is read, that record is written to the Count data set.

data Count; set Heart; /* data are sorted by Smoking_Status */ BY Smoking_Status; /* automatically creates indicator vars */ if FIRST.Smoking_Status then Count = 0; /* initialize Count at beginning of each BY group */ Count + 1; /* increment Count for each record */ if LAST.Smoking_Status; /* output only the last record of each BY group */ run; proc print data=Count noobs; format Count comma10.; var Smoking_Status Count; run; |

The same technique enables you to accumulate values of a variable within a group. For example, you can accumulate the total weight of all patients in each smoking group by using the following statements:

if FIRST.Smoking_Status then cumWt = 0; cumWt + Weight; |

This same technique can be used to accumulate revenue from various sources, such as departments, stores, or regions.

Another common use of the` FIRST.`*variable* and `LAST.`*variable* indicator variables is to determine the length of time between a patient's first visit and his last visit. Consider the following DATA step, which defines the dates and weights for four male patients who visited a clinic as part of a weight-loss program:

data Patients; informat Date date7.; format Date date7. PatientID Z4.; input PatientID Date Weight @@; datalines; 1021 04Jan16 302 1042 06Jan16 285 1053 07Jan16 325 1063 11Jan16 291 1053 01Feb16 299 1021 01Feb16 288 1063 09Feb16 283 1042 16Feb16 279 1021 07Mar16 280 1063 09Mar16 272 1042 28Mar16 272 1021 04Apr16 273 1063 20Apr16 270 1053 28Apr16 289 1053 13May16 295 1063 31May16 269 ; |

For these data, you can sort by the patient ID and by the date of visit. After sorting, the first record for each patient contains the first visit to the clinic and the last record contains the last visit. You can subtract the patient's weight for these dates to determine how much the patient gained or lost during the trial. You can also use the INTCK function to compute the elapsed time between visits. If you want to measure time in days, you can simply subtract the dates, but the INTCK function enables you to compute duration in terms of years, months, weeks, and other time units.

proc sort data=Patients; by PatientID Date; run; data weightLoss; set Patients; BY PatientID; retain startDate startWeight; /* RETAIN the starting values */ if FIRST.PatientID then do; startDate = Date; startWeight = Weight; /* remember the initial values */ end; if LAST.PatientID then do; endDate = Date; endWeight = Weight; elapsedDays = intck('day', startDate, endDate); /* elapsed time (in days) */ weightLoss = startWeight - endWeight; /* weight loss */ AvgWeightLoss = weightLoss / elapsedDays; /* average weight loss per day */ output; /* output only the last record in each group */ end; run; proc print noobs; var PatientID elapsedDays startWeight endWeight weightLoss AvgWeightLoss; run; |

The output data set summarizes each patient's activities at the clinic, including his average weight loss and the duration of his treatment.

Some programmers think that the
`FIRST.`*variable* and `LAST.`*variable* indicator variables require that the data be sorted, but that is not true. The temporary variables are created whenever you use a BY statement in a DATA step. You can use the NOTSORTED option on the BY statement to process records regardless of the sort order.

In summary, the BY statement in the DATA step automatically creates two indicator variables. You can use the variables to determine the first and last record in each BY group.
Typically the `FIRST.`*variable* indicator is used to initialize summary statistics and to remember
the initial values of measurement.
The `LAST.`*variable* indicator is used to output the result of the computations, which often includes simple descriptive statistics such as a sum, difference, maximum, minimum, or average values.

BY-group processing in the DATA step is a common topic that is presented at SAS conferences. Some authors use FIRST.BY and LAST.BY as the name of the indicator variables. For further reading, I recommend the paper "The Power of the BY Statement" (Choate and Dunn, 2007). SAS also provides several samples about BY-group processing in the SAS DATA step, including the following:

- Carry non-missing values down a BY-Group
- Use BY groups to transpose data from long to wide
- Select a specified number of observations from the top of each BY-Group

The post How to use FIRST.variable and LAST.variable in a BY-group analysis in SAS appeared first on The DO Loop.

]]>The post A Monte Carlo algorithm to estimate a median appeared first on The DO Loop.

]]>Of course, there's no such thing as a free lunch. The Monte Carlo estimate is an approximation. It is useful when a quick-and-dirty estimate is more useful than a more precise value that takes longer to compute. For example, if you want to find the median value of 10 billion credit card transactions, it might not matter whether the median $16.74 or $16.75. Either would be an adequate estimate for many business decisions.

Although the traditional sample median is commonly used, it is merely an estimate. All sample quantiles are estimates of a population quantity, and there are many ways to estimate quantiles in statistical software. To estimate these quantities faster, researchers have developed approximate methods such as the piecewise-parabolic algorithm in PROC MEANS or "probabilistic methods" such as the ones in this article.

Consider the following data set, which simulates 10 million observations from a mixture of an exponential and a normal distribution. You can compute the exact median for the distribution, which is 8.065. A histogram of the data and the population median are shown to the right.

Because the sample size is large, PROC MEANS takes a few seconds to compute the sample median by using the traditional algorithm which sorts the data (an O(N log N) operation) and then returns the middle value. The sample median is 8.065.

/* simulate from a mixture distribution. See https://blogs.sas.com/content/iml/2011/09/21/generate-a-random-sample-from-a-mixture-distribution.html */ %let N = 1e7; data Have; call streaminit(12345); do i = 1 to &N; d = rand("Bernoulli", 0.6); if d = 0 then x = rand("Exponential", 0.5); else x = rand("Normal", 10, 2); output; end; keep x; run; /* quantile estimate */ proc means data=Have N Median; var x; run; |

The basic idea of resampling methods is to extract a random subset of the data and compute statistics on the subset. Although the inferential statistics (standard errors and confidence interval) on the subset are not valid for the original data, the point estimates are typically close, especially when the original data and the subsample are large.

The simplest form of a resampling algorithm is to generate a random subsample of the data and compute the median of the subsample. The following example uses PROC SURVEYSELECT to resample (with replacement) from the data. The subsample is one tenth as large as the original sample. The subsequent call to PROC MEANS computes the median of the subsample, which is 8.063.

proc surveyselect data=Have seed=1234567 NOPRINT out=SubSample method=urs samprate=0.1; /* 10% sampling with replacement */ run; title "Estimate from 10% of the data"; proc means data=SubSample N Median; freq NumberHits; var x; run; |

It takes less time to extract a 10% subset and compute the median than it takes to compute the median of the full data. This naive resampling is simple to implement and often works well, as in this case. The estimate on the resampled data is close to the true population median, which is 8.065. (However, the standard error of the traditional estimate is smaller.)

You might worry, however, that by discarding 90% of the data you might discard too much information. In fact, there is a 90% chance that we discarded the middle data value! Thus it is wise to construct the subsample more carefully. The next section constructs a subsample that excludes data from the tails of the distribution and keeps data from the middle of the distribution.

This section presents a more sophisticated approximate algorithm for the median. My presentation of this Monte Carlo algorithm is based on the lecture notes of Prof. H-K Hon and a Web page for computer scientists. Unfortunately, neither source credits the original author of this algorithm. If you know the original reference, please post it in the comments.

Begin with a set S that contains N observations. The following algorithm returns an approximate median with high probability:

- Choose a random sample (with replacement) of size k = N
^{3/4}from the data. Call the sample R. - Choose lower and upper "sentinels," L and U. The sentinels are the lower and upper quantiles of R that correspond to the ranks k/2 ± sqrt(N). These sentinels define a symmetric interval about the median of R.
- (Optional) Verify that the interval [L, U] contains the median value of the original data.
- Let C be the subset of the original observations that are within the interval [L, U]. This is a small subset.
- Return the median of C.

The idea is to use the random sample *only* to find an interval that probably contains the median. The width of the interval depends on the size of the data. If N=10,000, the algorithm uses the 40th and 60th percentiles of the random subset R. For N=1E8, it uses the 49th and 51th percentiles.

The following SAS/IML program implements the Monte Carlo estimate for the median:

proc iml; /* Choose a set R of k = N##(3/4) elements in x, chosen at random with replacement. Choose quantiles of R to form an interval [L,U]. Return the median of data that are within [L,U]. */ start MedianRand(x); N = nrow(x); k = ceil(N##0.75); /* smaller sample size */ R = T(sample(x, k)); /* bootstrap sample of size k (with replacement) */ call sort(R); /* Sort this subset of data */ L = R[ floor(k/2 - sqrt(N)) ]; /* lower sentinel */ U = R[ ceil(k/2 + sqrt(N)) ]; /* upper sentinel */ UTail = (x > L); /* indicator for large values */ LTail = (x < U); /* indicator for small values */ if sum(UTail) > N/2 & sum(LTail) > N/2 then do; C = x[loc(UTail & LTail)]; /* extract central portion of data */ return ( median(C) ); end; else do; print "Median not between sentinels!"; return (.); end; finish; /* run the algorithm on example data */ use Have; read all var {"x"}; close; call randseed(456); t1 = time(); approxMed = MedianRand(x); /* compute approximate median; time it */ tApproxMed = time()-t1; t0 = time(); med = median(x); /* compute traditional estimate; time it */ tMedian = time()-t0; results = (approxMed || med) // (tApproxMed|| tMedian); print results[colname={"Approx" "Traditional"} rowname={"Estimate" "Time"} F=Best6.]; |

The output indicates that the Monte Carlo algorithm gives an estimate that is close to the traditional estimate, but produces it three times faster. If you repeat this experiment for 1E8 observations, the Monte Carlo algorithm computes an estimate in 2.4 seconds versus 11.4 seconds for the traditional estimate, which is almost five times faster. As the data size grows, the speed advantage of the Monte Carlo algorithm increases. See Prof. Hon's notes for more details.

In summary, this article shows how to implement a probabilistic algorithm for estimating the median for large data. The Monte Carlo algorithm uses a bootstrap subsample to estimate a symmetric interval that probably contains the true median, then uses the interval to strategically choose and analyze only part of the original data. For large data, this Monte Carlo estimate is much faster than the traditional estimate.

The post A Monte Carlo algorithm to estimate a median appeared first on The DO Loop.

]]>The post Compute the quantiles of any distribution appeared first on The DO Loop.

]]>In SAS, the QUANTILE function computes the quantiles for about 25 distributions. This article shows how you can use numerical root-finding methods (and possibly numerical integration) in SAS/IML software to compute the quantile function for ANY continuous distribution. I have previously written about related topics and particular examples, such as the following:

- How to compute quantiles for the folded normal distribution
- How to compute quantiles for the contaminated normal distribution.
- How to compute quantiles for discrete probability distributions

Computing a quantile would make a good final exam question for an undergraduate class in numerical analysis. Although some distributions have an explicit CDF, many distributions are defined only by a probability density function (the PDF, f(*x*)) and numerical integration must be used to compute the cumulative distribution (the CDF, F(*x*)). A canonical example is the normal distribution. I've previously shown how to use numerical integration to compute a CDF from a PDF by using the definition F(*x*) = ∫ f(*t*) *dt*, where the lower limit of the integral is –∞ and the upper limit is *x*.

Whether the CDF is defined analytically or through numerical integration, the quantile for *p* is found implicitly as the solution to the equation F(*x*) = *p*, where *p* is a probability in the interval (0,1). This is illustrated by the graph at the right.

Equivalently, you can define G(*x*; *p*) = F(*x*) – *p* so that the quantile is the root of the equation G(*x*; *p*) = 0.
For well-behaved densities that occur in practice, a numerical root is easily found because the CDF is monotonically increasing. (If you like pathological functions, see the Cantor staircase distribution.)

SAS/IML software provides the QUAD subroutine, which provides numerical integration, and the FROOT function, which solves for roots. Thus SAS/IML is an ideal computational environment for computing quantiles for custom distributions.

As an example, consider a distribution that is a mixture of an exponential and a normal distribution:

F(x) = 0.4 F_{exp}(x; 0.5) + 0.6 Φ(x; 10, 2),

where
F_{exp}(x; 0.5) is the exponential distribution with scale parameter 0.5 and
Φ(x; 10, 2) is the normal CDF with mean 20 and standard deviation 2.
In this case, you do not need to use numerical integration to compute the CDF. You can compute the CDF as a linear combination of the exponential and normal CDFs, as shown in the following SAS/IML function:

/* program to numerically find quantiles for a custom distribution */ proc iml; /* Define the cumulative distribution function here. */ start CustomCDF(x); F = 0.4*cdf("Exponential", x, 0.5) + 0.6*cdf("Normal", x, 10, 2); return F; finish; |

The previous section shows the graph of the CDF on the interval [0, 16]. The vertical and horizontal lines correspond to the first, second and third quartiles of the distribution. The quartiles are close to the values Q1 ≈ 0.5, Q2 ≈ 8, and Q3 ≈ 10.5. The next section shows how to compute the quantiles.

As long as you can define a function that evaluates the CDF, you can find quantiles. For unbounded distributions, it is usually helpful to plot the CDF so that you can visually estimate an interval that contains the quantile. (For bounded distributions, the support of the distribution contains all quantiles.) For the mixture distribution in the previous section, it is clear that the quantiles are in the interval [0, 16].

The following program finds arbitrary quantiles for whichever CDF is evaluated by the `CustomCDF` function. To find quantiles for a different function, you can modify the `CustomCDF` and change the interval on which to find the quantiles. You do not need to modify the `RootFunc` or `CustomQuantile` functions.

/* Express CDF(x)=p as the root for the function CDF(x)-p. */ start RootFunc(x) global(gProb); return CustomCDF(x) - gProb; /* quantile for p is root of CDF(x)-p */ finish; /* You need to provide an interval on which to search for the quantiles. */ start CustomQuantile(p, Interval) global(gProb); q = j(nrow(p), ncol(p), .); /* allocate result matrix */ do i = 1 to nrow(p)*ncol(p); /* for each element of p... */ gProb = p[i]; /* set global variable */ q[i] = froot("RootFunc", Interval); /* find root (quantile) */ end; return q; finish; /* Example: look for quartiles in interval [0, 16] */ probs = {0.25 0.5 0.75}; /* Q1, Q2, Q3 */ intvl = {0 16}; /* interval on which to search for quantiles */ quartiles = CustomQuantile(probs, intvl); print quartiles[colname={Q1 Q2 Q3}]; |

In summary, you can compute an arbitrary quantile of an arbitrary continuous distribution if you can (1) evaluate the CDF at any point and (2) numerically solve for the root of the equation CDF(*x*)-*p* for a probability value, *p*. Because the support of the distribution is arbitrary, the implementation requires that you provide an interval [a,b] that contains the quantile.

The computation should be robust and accurate for non-pathological distributions provided that the density is not tiny or zero at the value of the quantile. Although this example is illustrated in SAS, the same method will work in other software.

The post Compute the quantiles of any distribution appeared first on The DO Loop.

]]>