Create an ogive in SAS

My son is taking an AP Statistics course in high school this year. AP Statistics is one of the fastest-growing AP courses, so I welcome the chance to see topics and techniques in the course. Last week I was pleased to see that they teach data exploration techniques, such as using an ogive (rhymes with "slow jive") to approximate the cumulative distribution of a set of univariate data. This article shows how you can create an ogive by using SAS procedures.

#StatWisdom: How to create an ogive (rhymes with 'slow jive') Click To Tweet

What is an ogive?

An ogive is also called a cumulative histogram. You can create an ogive from a histogram by accumulating the frequencies (or relative frequencies) in each histogram bin. The height of an ogive curve at x is found by summing the heights of the histogram bins to the left of x.

A histogram estimates the density of a distribution; the ogive estimates the cumulative distribution. Both are easy to construct by hand. Both are coarse estimates that depend on your choice of a bin widths and anchor position.

Ogives are not used much by professional statisticians because modern computers make it easy to compute and visualize the exact empirical cumulative distribution function (ECDF). However, if you are a student learning to analyze data by hand, an ogive is an easy way to approximate the ECDF from binned data. They are also important if you do not have access to the original data, but you have a histogram that appeared in a published report. (See "How to approximate a distribution from published quantiles.")

Create an ogive from a histogram

To demonstrate the construction of an ogive, let's consider the distribution of the MPG_CITY variable in the Sashelp.Cars data set. This variable contains the reported fuel efficiency (in miles per gallon) for 428 vehicle models. The following call to PROC UNIVARIATE in Base SAS uses the OUTHIST= option in the HISTOGRAM statement to create a data set that contains the frequencies and relative frequencies of each bin. By default, the frequencies are reported for the midpoints of the intervals. To create an ogive you need the endpoints of each bin, so use the ENDPOINTS option as follows:

proc univariate;
var mpg_city;
histogram mpg_city / grid vscale=proportion ENDPOINTS OUTHIST=OutHist;
/* cdfplot mpg_city / vscale=proportion; */ /* optional: create an ECDF plot */
Distribution of miles per gallon in 428 vehicles

The histogram shows that most vehicles get between 15 and 25 mpg in the city. The distribution is skewed to the right, with a few vehicles getting as much as 59 or 60 mpg. A few gas-guzzling vehicles get less than 15 mpg.

You can construct an ogive from the relative frequencies in the 11 histogram bins. The height of the ogive at x=10 (the leftmost endpoint in the histogram) is zero. The height at x=15 is the height of the first bar. The height at x=20 is the sum of the heights of the first two histogram bars, and so on.

Create an ogive from the output of PROC UNIVARIATE

Each row in the OutHist data set contains a left-hand endpoint and the relative frequency (height) of the bar. However, to construct an ogive, you need to associate the bar height with the right-hand endpoints. This is because at the left-hand endpoint none of the density for the bin has accumulated, and for the right-hand endpoint all of the density has accumulated.

Consequently, to construct an ogive from the OUTHIST= data set, you can do the following:

  • Associate zero with the leftmost endpoint of the bins.
  • Adjust the counts and proportions in the OutHist data so that they are associated with the right-hand endpoint of each bin. You can use the LAG function to do this.
  • Accumulate the relative frequencies in each bin to form the cumulative frequencies.
  • Add a new observation to the OutHist data that contains the rightmost endpoint of the bins.

The following SAS DATA step carry out these adjustments:

data Ogive;
set outhist end=EOF;
ogiveX = _MinPt_;        /* left endpoint of bin */
dx = dif(ogiveX);        /* compute bin width */
prop = lag(_OBSPCT_);    /* move relative frequency to RIGHT endpoint */
if _N_=1 then
   prop = 0;             /* replace missing value by 0 for first obs */
cumProp + prop/100;      /* accumulate proportions */
if EOF then do;          /* append RIGHT endpoint of final bin */
   ogiveX = ogiveX + dx;
   cumProp = 1;
drop dx _:;              /* drop variables that begin with underscore */

The Ogive data set contains all the information that you need to graph an ogive. The following call to PROC SGPLOT uses a VLINE statement, which treats the endpoints of the bins as discrete values. You could also use the SERIES statement, which treats the endpoints as a continuous variable, but might not put a tick mark at each bin endpoint.

title "Cumulative Distribution of Binned Values (Ogive)";
proc sgplot data=Ogive;
   vline OgiveX / response=cumProp markers;
   /* series x=_minpt_ y=cumProp / markers; */
   xaxis grid label="Miles per Gallon (City)";
   yaxis grid values=(0 to 1 by 0.1) label="Cumulative Proportion";
An ogive created from a histogram. Each marker shows the cumulative proportion of the data values that lie below each histogram bin.

You can use the graph to estimate the percentiles of the data. For example:

  • The 20th percentile is approximately 17 because the curve appears to pass through the point (17, 0.20). In other words, about 20% of the vehicles get 17 mpg or less.
  • The 50th percentile is approximately 19 because the curve appears to pass through the point (19, 0.50).
  • The 90th percentile is approximately 27 because the curve appears to pass through the point (27, 0.90). Only 10% of the vehicles have a fuel efficiency greater than 27 mpg.

Compare an ogive to an empirical cumulative distribution

You might wonder how well the ogive approximates the empirical CDF. The following graph overlays the ogive and the ECDF for this data. You can see that the two curves agree closely at the ogive values, shown by the markers. However, there is some deviation because the ogive assume a linear accumulation (a uniform distribution) of data values within each histogram bin. Nevertheless, this coarse piecewise linear curve that is based on binned data does a good job of showing the basic shape of the empirical cumulative distribution.

Comparison of an ogive and an empirical cumulative distribution
Post a Comment

Simulate data from a generalized Gaussian distribution

Although statisticians often assume normally distributed errors, there are important processes for which the error distribution has a heavy tail. A well-known heavy-tailed distribution is the t distribution, but the t distribution is unsuitable for some applications because it does not have finite moments (means, variance,...) for small parameter values. An alternative choice for a heavy-tailed distribution is the generalized Gaussian distribution, which is the topic of this article.

The generalized Gaussian distribution

The generalized Gaussian distribution has a standardized probability density of the form f(x) = B exp( -|Ax|α ), where A(α) and B(α) are known functions of the exponent parameter α > 0. When 0 < α < 2, the generalized Gaussian distribution (GGD) is a heavy-tailed distribution that has finite moments. The distribution has applications in finance and signal processing.

The following SAS statements evaluate the GGD density function for four values of the shape parameter α. The SGPLOT procedure graphs the density functions (adapted from Nardon and Pianca (2009)):

data GenGauss;
mu=0; sigma=1;  /* location=mu and scale=sigma */
do alpha = 0.5, 1, 2, 5;
   A = sqrt( Gamma(3/alpha) / Gamma(1/alpha) ) / sigma;   /* A(alpha, sigma) */
   B = (alpha/2) / Gamma(1/alpha) * A;                    /* B(alpha, sigma) = normalizing constant */
   do x = -3 to 3 by 0.05;
      pdf = B*exp(-(A*abs(x-mu))**alpha);
title "Generalized Gaussian Densities";
title2 "(*ESC*){unicode mu}=0; (*ESC*){unicode sigma}=1"; 
proc sgplot data=GenGauss;
   series x=x y=pdf / group=alpha;
   yaxis grid label="Density of Generalized Gaussian";
   keylegend / across=1 opaque location=inside position=right;
Density functions for the generalized Gaussian distribution, computed in SAS

The density for α=1/2 is sharply peaked and has heavy tails. The distribution for α=1 is the Laplace distribution, which also has a sharp peak at the mean. The distribution for α=2 is the standard normal distribution. For α=5, the distribution is platykurtic: it has a broad flat central region and thin tails.

This article uses only the standardized distribution that has zero mean and unit variance. However, the functions are all written to support a location parameter μ and a scale parameter σ.

Simulate random values from the generalized Gaussian distribution

Nardon and Pianca (2009) describe an algorithm for simulating random variates from the generalized Gaussian distribution: simulate from a gamma distribution, raise that variate to a power, and then randomly multiply by ±1. You can implement the simulation in the SAS DATA step or in the SAS/IML language. The following SAS/IML program defines a function for generating random GGD values:

proc iml;
start Rand_GenGauss(N, mu=0, sigma=1, alpha=0.5);
   n1 =N[1]; n2 = 1;                         /* default dimensions for returned matrix */
   if nrow(N)>1 | ncol(N)>1 then n2 = N[2];  /* support n1 x n2 matrices */
   x = j(n1,n2);  sgn = j(n1,n2);
   /* simulate for mu=0 and sigma=1, then scale and translate */
   A = sqrt( Gamma(3/alpha) / Gamma(1/alpha) );
   b = A##alpha;
   call randgen(x, "Gamma", 1/alpha, 1/b);   /* X ~ Gamma(1/alpha, 1/b) */
   call randgen(sgn, "Bernoulli", 0.5); 
   sgn = -1 + 2*sgn;                         /* random {-1, 1} */
   return mu + sigma * sgn # x##(1/alpha);
/* try it out: simulate 10,000  random variates */
call randseed(12345);
x = Rand_GenGauss(10000, 0, 1, 1/2);         /* X ~ GGD(mu=0, sigma=1, alpha=1/2) */
title "Generalized Gaussian Distribution";
title2 "mu=0, sigma=1, alpha=1/2";
call histogram(x);
Random sample from the generalized Gaussian distribution with alpha=1/2

For α=1/2, most of the simulated values are near 0, which is the mean and mode. The bulk of the remaining values are in the interval [-5,5]. However, a very small percentage (about 0.5%) are more extreme than these values. For this sample, the range of the simulated data is about [-14 , 11], so you can see that this distribution is useful for simulations in which the magnitude of errors can be more extreme than normal errors.

The generalized Gaussian distribution with exponent 1/2

The cumulative distribution function for the generalized Gaussian distribution does not have a closed-form solution in terms of elementary functions. However, there are a few values of α for which the equations simplify, including α=1/2. Chapeau-Blondeau and Monir (2002) note that the generalized Gaussian distribution with α=1/2 is especially useful in practice because its quantile function (inverse CDF) can be explicitly expressed in terms of the Lambert W function.

The CDF for the GGD with α=1/2 can be computed separately for x ≤ 0 and for x > 0, as shown below:

/* CDF for generalized Gaussian distribution with alpha=1/2 */
start CDF_GenGaussHalf(x);
   y = sqrt(abs(2*sqrt(30)*x));
   cdf = 0.5*(1+y)#exp(-y);     /* CDF for x <= 0 */
   idx = loc(x>0);              /* if x > 0, reflect: CDF(x) = 1 - CDF(-x) */
   if ncol(idx)>0 then do;
      cdf[idx] = 1 - cdf[idx];
   return cdf;
p = CDF_GenGaussHalf(-5);    /* P(X< -5) for X ~ GDD(0,1, alpha=1/2) */
print p;                     /* 0.0025654 */

You can use the CDF function to evaluate the probability that a random GGD observation is less than -5. The answer is 0.0026. By symmetry, the probability that a random variate is outside of [-5,5] is 0.0052, which agrees with the empirical proportion in the random sample in the previous section.

The inverse CDF for the GGD with α=1/2 can be implemented by calling the Lambert W function. (Technically, the lower branch of the Lambert W function.) You can download functions that implement the Lambert W function in SAS/IML. The following SAS/IML statements assume that you have downloaded the Lambert W functions and stored the modules, so that they are available by using the LOAD statement. Just as the CDF was defined in two parts, the inverse CDF is defined in two parts.

/* Use Lambert W function to compute the quantile function
   (inverse CDF) for generalized Gaussian distribution with alpha=1/2 */
load module=(LambertWm);          /* load function for the Lambert W function */
start Quantile_GenGaussHalf(x);
   q = j(nrow(x), ncol(x), 0);
   idx = loc(x<0.5);              /* invert CDF for x < 0.5 */
   if ncol(idx)>0 then do;
      y = -2 * x[idx] / exp(1);
      q[idx] = -1/(2*sqrt(30)) * (1 + LambertWm(y))##2;
   if ncol(idx)=nrow(x)*ncol(x) then return q;  /* all found */
   idx = loc(x=0.5);              /* quantile for x=0.5 */
   if ncol(idx)>0 then q[idx] =  0;
   idx = loc(x>0.5);              /* invert CDF for x > 0.5 */
   if ncol(idx)>0 then do;
      y = -2 * (1-x[idx]) / exp(1);
      q[idx] =  1/(2*sqrt(30)) * (1 + LambertWm(y))##2;
   return q;
q = Quantile_GenGaussHalf(0.975);  /* find q such that P(X < q) = 0.975 */
print q;                           /* 2.0543476 */

The output (not shown) is q = 2.05, which means that the interval [-2.05, 2.05] contains 95% of the probability for the standardized GGD when α=1/2. Compare this interval with the familiar interval [-1.96, 1.96], which contains 95% of the probability for the standard normal distribution.


To learn more about the generalized Gaussian distribution, with an emphasis on parameter value α=1/2, see the following wwll-written papers, which expand the discussion in this blog post:

Post a Comment

The distribution of nearest neighbor distances

Last week I showed how to compute nearest-neighbor distances for a set of numerical observations. Nearest-neighbor distances are used in many statistical computations, including the analysis of spatial point patterns. This article describes how the distribution of nearest-neighbor distances can help you determine whether spatial data are uniformly distributed or whether they show evidence of non-uniformity such as clusters of observations. You can examine the spatial data manually or use a SAS/STAT procedure (PROC SPP) that automates common spatial analyses.

The distribution of nearest neighbor distances

The distribution of the nearest-neighbor distances provides information about the process that generated the spatial data. Let's look at two example: clustered data for trees in a forest and uniformly distributed (simulated) data.

Scatter plot of tree locations. Clustering is visually evident and can be revealed by computing nearest neighbor distances

The Sashelp.Bei data contains the locations for 3,604 trees in a forest. A scatter plot of the data is shown to the left. The trees appear to be clustered in certain regions, whereas other regions contain no trees. This indicates that the trees are not uniformly distributed throughout the region.

Let's use the NearestNbr module from my previous blog post to compute the nearest neighbor distances for these tree locations. The following program saves the distances to a SAS data set named Dist_Trees. For comparison, the program also generates 3,604 random (uniformly distributed) points in the same rectangular region. The nearest neighbor distances for these simulated points are written to a data set named Dist_Unif.

proc iml;
use Sashelp.Bei where(Trees=1);
   read all var {x y} into T;
close Sashelp.Bei;
/* See */
start NearestNbr(idx, dist, X, k=1);
   N = nrow(X);  p = ncol(X);
   idx = j(N, k, .);   dist = j(N, k, .);
   D = distance(X);
   D[do(1, N*N, N+1)] = .;      /* set diagonal elements to missing */
   do j = 1 to k;
      dist[,j] = D[ ,><];       /* smallest distance in each row */
      idx[,j] = D[ ,>:<];       /* column of smallest distance in each row */
      if j < k then do;         /* prepare for next closest neighbors */
         ndx = sub2ndx(dimension(D), T(1:N)||idx[,j]);
         D[ndx] = .;            /* set elements to missing */
run NearestNbr(nbrIdx, dist, T);
create Dist_Trees var "Dist";  append;  close; /* tree distances */
/* compare with uniform */
call randseed(12345);
N = nrow(T);
/* generate random uniform points in [0,1000] x [0,500] */
X = 1000*randfun(N, "uniform") || 500*randfun(N, "uniform"); 
run NearestNbr(nbrIdx, dist, X);
create Dist_Unif var "Dist";  append;  close; /* random uniform distances */

You can use a PROC UNIVARIATE to compare the two distributions of distances. The following DATA step merges the distances and adds an indicator variable that has the value "Uniform" or "Trees." PROC UNIVARIATE creates a graph of the empirical cumulative distribution function for each set of nearest-neighbor distances:

data All;
set Dist_Unif(in=inUnif) Dist_Trees;
if inUnif then  Group = "Uniform";
else            Group = "Trees  ";
proc univariate data=All;
where dist<=20;   /* excludes 41 trees and 1 random observation */
   class Group;
   cdfplot dist / vscale=proportion odstitle="Compare ECDF for Uniform and Tree Data" overlay;
   ods select cdfplot;
Comparative empirical CDFs of nearest neighbor distances for tree data and uniformly distributed data

The empirical CDFs for the two data sets look quite different. For the tree data, the distribution function rises rapidly, which indicates that many observations have nearest neighbors that are very close. You can see that about 25% of trees have a nearest neighbor within about 1.5 units, and 50% have a nearest neighbor within 3.2 units.

In contrast, the empirical distribution of the uniform data rises less steeply. For the simulated observations, only 5% have a nearest neighbor within 1.5 units. Only 21% have a nearest neighbor within 3.5 units. The graph indicates clumping in the tree data: the typical tree has a neighbor that is closer than would be expected for random uniform locations.

Analysis of spatial point patterns in SAS

In SAS/STAT 13.2 and beyond you can use the new SPP procedure to perform comparisons like this automatically. The SPP procedure supports several statistical tests for "complete spatial randomness," which enable you to compare the observed distribution to the expected distribution for uniformly distributed data.

The SPP procedure supports tests that are based on nearest-neighbor distances, as well as other tests. The example in this article is essentially a visual representation of the "G function test," which you can carry out by using the following statements:

proc spp data=sashelp.bei plots=(G(unpack));
   process trees = (x, y /area=(0,0,1000,500) Event=Trees) / G;
G function test for complete spatial randomness with PROC SPP in SAS. The G function uses nearest neighbor distances

The SPP procedure creates a graph of the "edge-corrected G function." The upper curve is the empirical distribution of nearest neighbor distances for the trees, adjusted for edge effects caused by a finite domain. The lower curve shows the expected distribution for random uniform data of the same size on the same domain. The light-blue band is a 95% confidence envelope, which gives you a feeling for the variation due to random sampling. The envelope is formed by 100 simulations of random uniform data, similar to the simulation in the previous section.

The graph indicates that there is structure in the trees data beyond what you would expect from complete spatial randomness. You can use other features of PROC SPP to model the trees data. For an overview of the spatial analyses that you can perform with PROC SPP, see Mohan and Tobias (2015), "Analyzing Spatial Point Patterns Using the New SPP Procedure".

In summary, nearest-neighbor distances are useful for analyzing properties of spatial point patterns. This article compared tree data to a single instance of random uniform data. By using the SPP procedure, you can run a more complete analysis and obtain graphs and related statistics with minimal effort.

Post a Comment

Compute nearest neighbors in SAS

Finding nearest neighbors is an important step in many statistical computations such as local regression, clustering, and the analysis of spatial point patterns. Several SAS procedures find nearest neighbors as part of an analysis, including PROC LOESS, PROC CLUSTER, PROC MODECLUS, and PROC SPP. This article shows how to find nearest neighbors for every observation directly in SAS/IML, which is useful if you are implementing certain algorithms in that language.

Compute the distance between observations

Let's create a sample data set. The following DATA step simulates 100 observations that are uniformly distributed in the unit square [0,1] x [0,1]:

data Unif;
call streaminit(12345);
do i = 1 to 100;
   x = rand("Uniform");   y = rand("Uniform");   output;

I have previously shown how to compute the distance between observations in SAS by using PROC DISTANCE or the DISTANCE function in SAS/IML. The following statements read the data into a SAS/IML matrix and computes the pairwise distances between all observations:

proc iml;
use Unif; read all var {x, y} into X; close;
D = distance(X);              /* N x N distance matrix */

The D matrix is a symmetric 100 x 100 matrix. The value D[i,j] is the Euclidean distance between the ith and jth rows of X. An easy way to look for the nearest neighbor of observation i is to search the ith row for the column that contains smallest distance. Because the diagonal elements of D are all zero, a useful trick is to change the diagonal elements to be missing values. Then the smallest value in each row of D corresponds to the nearest neighbor.

You can use the following statements assigns missing values to the diagonal elements of the D matrix. You can then use the SAS/IML subscript reduction operators to find the minimum distance in each row:

diagIdx = do(1, nrow(D)*ncol(D), ncol(D)+1); /* index diagonal elements */
D[diagIdx] = .;                              /* set diagonal elements */
dist = D[ ,><];            /* smallest distance in each row */
nbrIdx = D[ ,>:<];         /* column of smallest distance in each row */
Neighbor = X[nbrIdx, ];    /* coords of closest neighbor */

Visualizing nearest neighbors

If you write the nearest neighbors and distances to a SAS data set, you can use the VECTOR statement in PROC SGPLOT to draw a vector that connects each observation to its nearest neighbor. The graph indicates the nearest neighbor for each observation.

Z = X || Neighbor || dist;
create NN_Unif from Z[c={"x" "y" "xc" "yc" "dist"}];
append from Z;
title "Nearest Neighbors for 100 Uniformly Distributed Points";
proc sgplot data=NN_Unif aspect=1;
label dist = "Distance to Nearest Neighbor";
   scatter x=x y=y / colorresponse=dist markerattrs=(symbol=CircleFilled);
   vector x=xc y=yc / xorigin=x yorigin=y transparency=0.5 
                      colorresponse=dist;  /* COLORRESPONSE= requires 9.4m3 for VECTOR */
   xaxis display=(nolabel);
   yaxis display=(nolabel);
Nearest neighbors each of 100 observations

Alternative ways to compute nearest neighbors

If you don't have SAS/IML but still want to compute nearest neighbors, you can use PROC MODECLUS. The NEIGHBOR option on the PROC MODECLUS statement produces a table that gives the observation number (or ID value) of nearest neighbors. For example, the following statements produce the observation numbers for the nearest neighbors:

ods select neighbor;
/* Use K=p to find nearest p-1 neighbors */
proc modeclus data=Unif method=1 k=2 Neighbor;  /* k=2 for nearest nbrs */
   var x y;

To get the coordinates of the nearest neighbor, you can create a variable that contains the observation numbers and then use an ID statement to include that ID variable in the PROC MODECLUS output. You can then look up the coordinates. I omit the details.

Why stop at one? A SAS/IML module for k nearest neighbors

You can use the ideas in the earlier SAS/IML program to write a program that returns the indices (observation numbers) of the k closest neighbors for k ≥ 1. The trick is to replace the smallest distance in each row with a missing value and then repeat the process of finding the smallest value (and column) in each row. The following SAS/IML module implements this computation:

proc iml;
/* Compute indices (row numbers) of k nearest neighbors.
   INPUT:  X    an (N x p) data matrix
           k    specifies the number of nearest neighbors (k>=1) 
   OUTPUT: idx  an (N x k) matrix of row numbers. idx[,j] contains
                the row numbers (in X) of the j_th closest neighbors
           dist an (N x k) matrix. dist[,j] contains the distances
                between X and the j_th closest neighbors
start NearestNbr(idx, dist, X, k=1);
   N = nrow(X);  p = ncol(X);
   idx = j(N, k, .);            /* j_th col contains obs numbers for j_th closest obs */
   dist = j(N, k, .);           /* j_th col contains distance for j_th closest obs */
   D = distance(X);
   D[do(1, N*N, N+1)] = .;      /* set diagonal elements to missing */
   do j = 1 to k;
      dist[,j] = D[ ,><];       /* smallest distance in each row */
      idx[,j] = D[ ,>:<];       /* column of smallest distance in each row */
      if j < k then do;         /* prepare for next closest neighbors */
         ndx = sub2ndx(dimension(D), T(1:N)||idx[,j]);
         D[ndx] = .;            /* set elements to missing */

You can use this module to compute the k closest neighbors for each observation (row) in a data matrix. For example, the following statements compute the two closest neighbors for each observation. The output shows a few rows of the original data and the coordinates of the closest and next closest observations:

use Unif; read all var {x, y} into X; close;
run NearestNbr(nbrIdx, dist, X, k);
Near1 = X[nbrIdx[,1], ];    /* 1st nearest neighbors */
Near2   = X[nbrIdx[,2], ];  /* 2nd nearest neighbors */
Coordinates of nearest and second nearest neighbors to a set of observations

In summary, this article defines a short module in the SAS/IML language that you can use to compute the k nearest neighbors for a set of N numerical observations. Notice that the computation builds an N x N distance matrix in RAM, so this matrix might consume a lot of memory for large data sets. For example, the distance matrix for a data set with 16,000 observations requires about 1.91 GB of memory. For larger data sets, you might prefer to use PROC DISTANCE or PROC MODECLUS.

Post a Comment

Overlay a curve on a bar chart in SAS

One of the strengths of the SGPLOT procedure in SAS is the ease with which you can overlay multiple plots on the same graph. For example, you can easily combine the SCATTER and SERIES statements to add a curve to a scatter plot.

However, if you try to overlay incompatible plot types, you will get an error message that says
ERROR: Attempting to overlay incompatible plot or chart types.
For example, a histogram and a series plots are not compatible in PROC SGPLOT, so you need to use the Graphics Template Language (GTL) to overlay a custom density estimate on a histogram.

A similar limitation exists for bar charts in PROC SGPLOT: you cannot specify the VBAR and SERIES statements in a single call. However, in SAS 9.3 and beyond you can use the VBARPARM statement in SAS 9.3 to overlay a curve and a bar chart.

In SAS 9.4m3 there is yet another VBAR-like statement that enables you to combine bar charts with one or more other plots. The new the VBARBASIC and the HBARBASIC statements create a bar chart that is compatible with basic plots such as scatter plots, series plots, and box plots. These statements can summarize raw data like the VBAR statement can. In other words, if you use the VBARBASIC and HBARBASIC statements on raw data, the counts will be automatically computed. However, you can also use the statements on pre-summarized data: specify the height of the bars by using the RESPONSE= option.

Overlay a curve on a bar chart in SAS

In most situations it doesn't make sense to overlay a continuous curve on a discrete bar chart, which is why the SG routines have the concept of compatible plot types. However, there is a canonical example in elementary statistics that combines continuous and discrete data: the normal approximation to the binomial distribution.

Overlay normal density curve on a bar chart of binomial probabilities

Recall that if X is the number of successes in n independent trials for which the probability of success is p, then X is binomially distributed: X ~ Binom(n, p). A well-known rule says that if np > 5 and n(1-p) > 5, then the binomial distribution is approximated by a normal distribution with mean np and standard deviation sqrt(np(1-p)).

This rule is often illustrated by overlaying the continuous normal PDF on a bar chart that shows the binomial distribution, as shown to the left. To create this plot, I used the VBARBASIC statement to create the bar chart. Because the VBARBASIC statement creates a "basic plot," you can combine it with another basic plot, such as the line plot created by using a SERIES statement. For fun, I used an INSET statement to overlay a box of parameter values for the graph. The graph shows that the binomial probability at j is approximated by the area under the normal density curve on the interval [j-0.5, j+0.5].

The following SAS statements use the PDF function to evaluate the binomial probabilities and the normal density for the graph. The values for μ and σ are stored in macro variables for later use.

%let p = 0.25;                    /* probability of success */
%let n = 25;                      /* number of trials */
data Binom;
n = &n;  p = &p;  q = 1 - p;
mu = n*p;  sigma = sqrt(n*p*q);   /* parameters for the normal approximation */
Lower = mu-3.5*sigma;             /* evaluate normal density on [Lower, Upper] */
Upper = mu+3.5*sigma;
/* PDF of normal distribution */
do t = Lower to Upper by sigma/20;
   Normal = pdf("normal", t, mu, sigma);       output;
/* PMF of binomial distribution */
t = .; Normal = .;        /* these variables are not used for the bar chart */
do j = max(0, floor(Lower)) to ceil(Upper);
   Binomial = pdf("Binomial", j, &p, &n);      output;
call symput("mu", strip(mu));      /* store mu and sigma in macro variables */
call symput("sigma", strip(round(sigma,0.01)));
label Binomial="Binomial Probability"  Normal="Normal Density";
keep t Normal j Binomial;

The preceding DATA step evaluates the Binom(15, 0.25) probability for the integers j=0, 1, ..., 14. It evaluates the N(6.25, 2.17) PDF on the interval [-1.3, 13.8]. The following call to PROC SGPLOT uses the VBARBASIC statement to overlay the bar chart and the density curve:

title "Binomial Probability and Normal Approximation";
proc sgplot data=Binom;
   vbarbasic j / response=Binomial barwidth=1;      /* requires SAS 9.4M3 */
   series x=t y=Normal / lineattrs=GraphData2(thickness=2);
   inset "n = &n"  "p = &p"  "q = %sysevalf(1-&p)"
         "(*ESC*){unicode mu} = np = &mu"           /* use Greek letters */
         "(*ESC*){unicode sigma} = sqrt(npq) = &sigma" /
         position=topright border;
   yaxis label="Probability";
   xaxis label="x" integer type=linear;             /* force TYPE=LINEAR */

The TYPE=LINEAR option on the XAXIS statement tells the horizontal axis to use interval tick marks. The BARWIDTH=1 option on the VBARBASIC statement makes the bar chart look more like a histogram by eliminating the gaps between bars. The graph is shown at the top of this section.

Alternative visualization: The needle plot

If you are content to show only the height of the binomial probability mass function (PMF), you can use an alternative visualization. The following graph shows a needle plot (the binomial PMF) overlaid with a normal PDF. This visualization does not require 9.4M3. The SGPLOT statements are the same as before, except the binomial probabilities are represented by using the NEEDLE statement: needle x=j y=Binomial / markers;

Overlay normal density curve on a needle plot of binomial probabilities
Post a Comment

Coverage probability of confidence intervals: A simulation approach

The article uses the SAS DATA step and Base SAS procedures to estimate the coverage probability of the confidence interval for the mean of normally distributed data. This discussion is based on Section 5.2 (p. 74–77) of Simulating Data with SAS.

What is a confidence interval?

Recall that a confidence interval (CI) is an interval estimate that contains the population parameter with a specified probability. Because the CI is an estimate, it is computed from a sample. A confidence interval for a parameter is derived by knowing (or approximating) the sampling distribution of a statistic. For symmetric sampling distributions, the CI often has the form m ± w(α, n), where m is an unbiased estimate of the parameter and w(α, n) is a width that depends on the significance level α, the sample size n, and the standard error of the estimate.

Due to sampling variation, the confidence interval for a particular sample might not contain the parameter. A 95% confidence interval means that if you generate a large number of samples and construct the corresponding confidence intervals, then about 95% of the intervals will contain (or "cover") the parameter.

For example, a well-known formula is the confidence interval of the mean. If the population is normally distributed, then a 95% confidence interval for the population mean, computed from a sample of size n, is
[ xbartc s / sqrt(n),    xbar + tc s / sqrt(n) ]

  • xbar is the sample mean
  • tc = t1-α/2, n-1 is the critical value of the t statistic with significance α and n-1 degrees of freedom
  • s / sqrt(n) is the standard error of the mean, where s is the sample standard deviation.
What is a confidence interval? How can you estimate coverage probability? #Statistics #SASTip Click To Tweet

Coverage probability

The preceding discussion leads to the simulation method for estimating the coverage probability of a confidence interval. The simulation method has three steps:

  1. Simulate many samples of size n from the population.
  2. Compute the confidence interval for each sample.
  3. Compute the proportion of samples for which the (known) population parameter is contained in the confidence interval. That proportion is an estimate for the empirical coverage probability for the CI.

You might wonder why this is necessary. Isn't the coverage probability always (1-α) = 0.95? No, that is only true when the population is normally distributed (which is never true in practice) or the sample sizes are large enough that you can invoke the Central Limit Theorem. Simulation enables you to estimate the coverage probability for small samples when the population is not normal. You can simulate from skewed or heavy-tailed distributions to see how skewness and kurtosis affect the coverage probability. (See Chapter 16 of Simulating Data with SAS.)

The simulation method for estimating coverage probability

Let's use simulation to verify that the formula for a CI of the mean is valid when you draw samples from a standard normal population. The following DATA step simulates 10,000 samples of size n=50:

%let N = 50;                                /* size of each sample  */
%let NumSamples = 10000;                    /* number of samples    */  
/* 1. Simulate samples from N(0,1) */
data Normal(keep=SampleID x);
call streaminit(123);
do SampleID = 1 to &NumSamples;             /* simulation loop      */
   do i = 1 to &N;                          /* N obs in each sample */
      x = rand("Normal");                   /* x ~ N(0,1)           */

The second step is to compute the confidence interval for each sample. You can use PROC MEANS to compute the confidence limits. The LCLM= and UCLM= outputs the lower and upper endpoints of the confidence interval to a SAS data set. I also output the sample mean for each sample. Notice that the BY statement is an efficient way to analyze all samples in a simulation study.

/* 2. Compute statistics for each sample */
proc means data=Normal noprint;
   by SampleID;
   var x;
   output out=OutStats mean=SampleMean lclm=Lower uclm=Upper;

The third step is to count the proportion of samples for which the confidence interval contains the value of the parameter. For this simulation study, the value of the population mean is 0. The following DATA step creates an indicator variable that has the value 1 if 0 is within the confidence interval for a sample, and 0 otherwise. You can then use PROC FREQ to compute the proportion of intervals that contain the mean. This is the empirical coverage probability. If you want to get fancy, you can even use the BINOMIAL option to compute a confidence interval for the proportion.

/* 3a. How many CIs include parameter? */
data OutStats;  set OutStats;
   label ParamInCI = "Parameter in CI";
   ParamInCI = (Lower<0 & Upper>0);           /* indicator variable */
/* 3b. Nominal coverage probability is 95%. Estimate true coverage. */
proc freq data=OutStats;
   tables ParamInCI / nocum binomial(level='1' p=0.95); 
Empirical coverage probability for confidence interval

The output from PROC FREQ tells you that the empirical coverage (based on 10,000 samples) is 94.66%, which is very close to the theoretical value of 95%. The output from the BINOMIAL option estimates that the true coverage is in the interval [0.9422,0.951], which includes 0.95. Thus the simulation supports the assertion that the standard CI of the mean has 95% coverage when a sample is drawn from a normal population.

Visualizing the simulation study

You can draw a graph that shows how the confidence intervals depend on the random samples. The following graph shows the confidence intervals for 100 samples. The center of each CI is the sample mean.

proc format;                  /* display 0/1 as "No"/"Yes" */
   value YorN 0="No" 1="Yes";
ods graphics / width=6.5in height=4in;
proc sgplot data=OutStats(obs=100);
format ParamInCI YorN.;
   title  "95% Confidence Intervals for the Mean";
   title2 "Normal Data";
   scatter x=SampleID y=SampleMean / group=ParamInCI markerattrs=(symbol=CircleFilled);
   highlow x=SampleID low=Lower high=Upper / group=ParamInCI legendlabel="95% CI";
   refline 0 / axis=y;
   yaxis display=(nolabel);
Coverage probability of confidence intervals for normal data

The reference line shows the mean of the population. Samples for which the population mean is inside the confidence interval are shown in blue. Samples for which the population mean is not inside the confidence interval are shown in red.

You can see how sample variability affects the confidence intervals. In four random samples (shown in red) the values in the sample are so extreme that the confidence interval does not include the population mean. Thus the estimate of the coverage probability is 96/100 = 96% for these 100 samples. This graph shows why the term "coverage probability" is used: it is the probability that one of the vertical lines in the graph will "cover" the population mean.

The coverage probability for nonnormal data

The previous simulation confirms that the empirical coverage probability of the CI is 95% for normally distributed data. You can use simulation to understand how that probability changes if you sample from nonnormal data. For example, in the DATA step that simulates the samples, replace the call to the RAND function with the following line:

      x = rand("Expo") - 1;                 /* x + 1 ~ Exp(1)       */

You can then rerun the simulation study. This time the samples are drawn from a (shifted) exponential distribution that has mean 0 and unit variance. The skewness for this distribution is 2 and the excess kurtosis is 6. The result from PROC FREQ is that only about 93.5% of the confidence intervals (using the standard formula) cover the true population mean. Consequently, the formula for the CI, which has 95% coverage for normal data, only has about 93.5% coverage for this exponential data.

Coverage probability of (normal) confidence intervals for exponential data

You can create a graph that visualizes the confidence intervals for the exponential data. Again, only the first 100 samples are shown. In this graph, the CIs for nine samples do not contain the population mean, which implies a 91% empirical coverage.

You can also write a SAS/IML program. An example of using SAS/IML to estimate the coverage probability of a confidence interval is posted on the SAS/IML File Exchange.

In summary, you can use simulation to estimate the empirical coverage probability for a confidence interval. In many cases the formula for a CI is based on an assumption about the population distribution, which determines the sampling distribution of the statistic. Simulation enables you to explore how the coverage probability changes when the population does not satisfy the theoretical assumptions.

Post a Comment

Graph a step function in SAS

Last week I wrote about how to compute sample quantiles and weighted quantiles in SAS. As part of that article, I needed to draw some step functions. Recall that a step function is a piecewise constant function that jumps by a certain amount at a finite number of points.

Graph an empirical CDF

In statistics, the canonical step function is the empirical cumulative distribution function. Given a set of data values x1x2 ≤ ... ≤ xn the empirical distribution function (ECDF) is the step function defined by
     F(t) = (number of data values ≤ t) / n.
Notice that an ECDF is constant on each half-open interval [xi, xi+1).

The easiest way to visualize the ECDF in SAS is to use the CDFPLOT statement in PROC UNIVARIATE. The following DATA step creates a set of nine observations. Two values are repeated. The call to PROC UNIVARIATE creates a graph of the empirical CDF:

data A;
input x @@;
3.7  1.0  2.2  4.1  5.0  1.9  3.0  2.2  4.1
ods select cdfplot;
proc univariate data=A;
cdfplot x / vscale=proportion 
         odstitle="Empirical CDF" odstitle2="PROC UNIVARIATE";
ods output cdfplot=outCDF;   /* data set contains ECDF values */
Empirical cumulative distribution function (ECDF step function) as produced by PROC UNIVARIATE in SAS

The ECDF jumps by 1/n = 1/9 at each sorted data value. Because the values 2.2 and 4.1 appear twice in the data, the ECDF jumps by 2/9 at those data values. The ECDF is 0 for any point less than the minimum data value; it is 1 for any point greater than or equal to the maximum data value.

Create an ECDF graph manually

In the previous call to PROC UNIVARIATE, the ODS OUTPUT statement writes a SAS data set that contains the data values in sorted order and the value of the ECDF at each data value. You can use this output data set and the STEP statement in PROC SGPLOT to create your own graph of the ECDF. This gives you complete control over colors, labels, background grids, and other graphical attributes. You can also overlay other plots on the ECDF. For example, the following call to PROC SGPLOT creates an ECDF, adds a background grid, and overlays a fringe plot that shows individual data values:

title "Empirical CDF";
title2 "STEP and FRINGE Statements";
proc sgplot data=outCDF noautolegend;
   step x=ECDFX y=ECDFY;          /* variable names created by PROC UNIVARIATE */
   fringe ECDFX;
   xaxis grid label="x" offsetmin=0.05 offsetmax=0.05;
   yaxis grid min=0 label="Cumulative Proportion";
Empirical CDF function (step function) visualized by using the STEP statement in PROC SGPLOT in SAS

Graph an arbitrary step function in SAS

For the ECDF, we used PROC UNIVARIATE to create a data set that contains the (X,Y) coordinates of each "corner" in the plot. For a general discontinuous function, you need to create a similar data set manually. If you can create the data set, you can use the STEP statement to visualize an arbitrary piecewise constant function.

Alternative visualization of the graph of an empirical CDF (step function)

Some users might want to omit the vertical lines in the graph in order to emphasize the discontinuous nature of the function. For example, the adjacent graph is an alternative approach to visualizing the ECDF. This graph emphasizes that the function is constant on intervals that are closed on the left and open on the right.

You can use the VECTOR statement in PROC SGPLOT to generate this graph. The VECTOR statement draws a line between two arbitrary points. The output from PROC UNIVARIATE provides the X and Y values for the plot, but you need to modify the data slightly because the VECTOR statement needs four variables: the starting and ending coordinates of each line segment.

Data format for using the VECTOR statement in PROC SGPLOT to graph a step function in SAS

The adjacent table shows the shape of the data that is suitable for graphing with the VECTOR statement. You can see that the LAG function is useful for generating the xL column from the xR column. The yL and yR columns are identical except for the first observation because the ECDF is piecewise constant. The VECTOR statement connects the points (xL, yL) and (xR, yR), where the "L" subscript refers to left-hand endpoints and the "R" subscript refers to right-hand endpoints.

/* CDF is step function. 
   Each interval [x_i, x_{i+1}) is closed on the left and open on the right */
title "Empirical CDF";
title2 "VECTOR Statement";
proc sgplot data=ECDF noautolegend;
   vector x=xR y=yR / xorigin=xL yorigin=yL noarrowheads;
   scatter x=xL y=yL / markerattrs=(symbol=CircleFilled color=black);      /* closed */
   scatter x=xR y=yR / filledoutlinedmarkers markerfillattrs=(color=white) /* open */
                       markerattrs=(symbol=CircleFilled color=black);
   xaxis grid label="x";
   yaxis grid min=0 max=1 label="Quantile";

In conclusion, there are several ways to visualize an empirical CDF in SAS. You can also visualize the graph of a general piecewise constant function. You can choose either the STEP statement or the VECTOR statement in PROC SGPLOT to visualize the graph.

Post a Comment

The Lambert W function in SAS

Branches of the real-valued Lambert W function

This article describes how you can evaluate the Lambert W function in SAS/IML software. The Lambert W function is defined implicitly: given a real value x, the function's value w = W(x) is the value of w that satisfies the equation w exp(w) = x. Thus W is the inverse of the function g(w) = w exp(w).

Because the function g has a minimum value at w=-1, there are two branches to the Lambert W function. Both branches are shown in the adjacent graph. The top branch is called the principal (or positive) branch and is denoted Wp. The principal branch has domain x ≥ -1/e and range W(x) ≥ -1. The lower branch, denoted by Wm, is defined on -1/e ≤ x < 0 and has range W(x) ≤ = -1. The subscript "m" stands for "minus" since the range contains only negative values. Some authors use W0 for the upper branch and W-1 for the lower branch.

Both branches are used in applications, but the lower branch Wm is useful for the statistical programmer because you can use it to simulate data from heavy-tailed distribution. Briefly: for a class of heavy-tailed distributions, you can simulate data by using the inverse CDF method, which requires evaluating the Lambert W function.

Defining the Lambert W function in SAS/IML

You can download the SAS/IML functions that evaluate each branch of the Lambert W function. Both functions use an analytical approximation for the Lambert W function, followed by one or two iterations of Halley's method to ensure maximum precision:

  • The LambertWp function implements the principal branch Wp. The algorithm constructs a direct global approximation on [-1/e, ∞) based on Boyd (1998).
  • The LambertWm function implements the second branch Wm. The algorithm follows Chapeau-Blondeau and Monir (2002) (hereafter CBM). The CBM algorithm approximates Wm(x) on three different domains. The approximation uses a series expansion on [-1/e, -0.333), a rational-function approximation on [-0.333, -0.033], and an asymptotic series expansion on (-0.033, 0).

Calling the Lambert W function in SAS/IML

Download the SAS/IML program for this article and save it to a file called Then the following SAS/IML program shows how to call the functions for the upper and lower branches and plot the graphs:

%include "C:\MyPath\";      /* specify path to file */
proc iml;
load module=(LambertWp LambertWm);
title "Lambert W Function: Principal Branch";
x = do(-1/exp(1), 1, 0.01);
Wp = LambertWp(x);
call series(x, Wp) grid={x y} other="refline 0/axis=x; refline 0/axis=y";
title "Lambert W Function: Lower Branch";
x = do(-1/exp(1), -0.01, 0.001);
Wm = LambertWm(x);
call series(x, Wm) grid={x y};

The graphs are not shown because they are indistinguishable from the graph at the beginning of article.

You can use a simple error analysis to investigate the accuracy f these functions. After computing w = W(x), apply the inverse function g(w) = w exp(w) and see how close the result is to the input values. The LambertWp and LambertWm functions support an optional second parameter that specifies how many Halley iterations are performed. For LambertWm, only one iteration is performed by default. You can specify "2" as the second parameter to get two Halley iterations. The following statements show that the maximum error for the default computation is 2E-11 on (-1/e, -0.01). The error decreases to 5E-17 if you request two Halley iterations.
x = do(-1/exp(1), -0.01, 0.001);
Wm = LambertWm(x);            /* default precision: 1 Halley iteration */
g = Wm # exp(Wm);             /* apply inverse function */
maxError1 = max(abs(x - g));  /* maximum error */
Wm = LambertWm(x, 2);         /* higher precision: 2 Halley iterations */
g = Wm # exp(Wm);
maxError2 = max(abs(x - g));
print maxError1 maxError2;

In summary, the SAS/IML language provides an excellent environment for implementing new functions or statistical procedures that are not built into SAS. In this article I implemented methods from journal articles for evaluating the upper and lower branches of the Lambert W function, which is the inverse function of g(w) = w exp(w). Although the Lambert W function does not have a closed-form solution, you can implement an approximation to the function and then use one or two Halley iterations to quickly converge within machine precision.

Post a Comment

Weighted percentiles

Many univariate descriptive statistics are intuitive. However, weighted statistic are less intuitive. A weight variable changes the computation of a statistic by giving more weight to some observations than to others. This article shows how to compute and visualize weighted percentiles, also known as a weighted quantiles, as computed by PROC MEANS and PROC UNIVARIATE in SAS. Recall that percentiles and quantiles are the same thing: the 100pth percentile is equal to the pth quantile.

I do not discuss survey data in this article. Survey statisticians use weights to make valid inferences in survey data, and you can see the SAS documentation to learn about how to use weights to estimate variance in complex survey designs.

How to understand weighted percentiles #Statistics #StatWisdom Click To Tweet

Weights vs frequencies

Before we calculate a weighted statistic, let's remember that a weight variable is not that same as a frequency variable. A frequency variable, which associates a positive integer with each observation, specifies that each observation is replicated a certain number of times. There is nothing unintuitive about the statistics that arise from including a frequency variable. They are the same that you would obtain by duplicating each record according to the value of the frequency variable.

Weights are not frequencies. Weights can be fractional values. When comparing a weighted and unweighted analyses, the key idea is this: an unweighted analysis is equivalent to a weighted analysis for which the weights are all 1. An "unweighted analysis" is really a misnomer; it should be called an "equally weighted" analysis!

In the computational formulas that SAS uses for weighted percentiles, the weights are divided by the sum of the weights. Therefore only relative weights are important, and the formulas simplify if you choose weights that sum to 1. For the remainder of this article, assume that the weights sum to unity and that an unweighted analysis has weights equal to 1/n, where n is the sample size.

Unweighted percentiles

To understand how weights change the computation of percentiles, let's review the standard unweighted computation of the empirical percentiles (or quantiles) of a set of n numbers. First, sort the data values from smallest to largest. Then construct the empirical cumulative distribution function (ECDF). Recall that the ECDF is a piecewise-constant step function that increases by 1/n at each data point. The quantity 1/n represents the fact that each observation is weighted equally in this analysis.

The quantile function is derived from the CDF function, and the quantile function for a discrete distribution is also a step function. You can use the graph of the ECDF to compute the quantiles. For example, suppose your data are
    { 1    1.9    2.2    3    3.7    4.1    5 }
The following graph shows the ECDF for these seven values:

Empirical CDF for data. The percentiles can be read off the graph.

The data values are indicated by tick marks along the horizontal axis. Notice that the ECDF jumps by 1/7 at each data value because there are seven unique values.

I should really show you the graph of the quantile function (an "inverse function" to the CDF), but you can visualize the graph of the quantile function if you rotate your head clockwise by 90 degrees. To find a quantile, start at some value on the Y axis, move across until you hit a vertical line, and then drop down to the X axis to find the datum value. For example, to find the 0.2 quantile (=20th percentile), start at Y=0.2 and move right horizontally until you hit the vertical line over the datum 1.9. Thus 1.9 is the 20th percentile. Similarly, the 0.6 quantile is the data value 3.7. (I omit details about what to do if you hit a horizontal line.)

Of course, SAS can speed up this process. The following call to PROC MEANS displays the 20th, 40th, 60th, and 80th percentiles:

data A;
input x wt;
1    0.25
1.9  0.05
2.2  0.15
3.0  0.25
3.7  0.15
4.1  0.10
5    0.05
proc means data=A p20 p40 p60 p80;
   var x;                         /* unweighted analysis: data only */
Empirical percentiles for seven data values

Weighted percentiles

The previous section shows the relationship between percentile values and the graph of the ECDF. This section describes how the ECDF changes if you specify unequal weights for the data. The change is that the weighted ECDF will jump by the (standardized) weight at each data value. Because the weights sum to unity, the CDF is still a step function that rises from 0 to 1, but now the steps are not uniform in height. Instead, data that have relatively large weights produce a large step in the graph of the ECDF function.

In the previous section, the DATA step defined a weight variable. The weights for this example are
    { 0.25    0.05    0.15    0.25    0.15    0.10    0.05 }
The following graph shows the weighted ECDF for these weights:

Weighted empirical CDF for data. The weighted percentiles can be read off the graph.

By using this weighted ECDF, you can read off the weighted quantiles. For example, to find the 0.2 weighted quantile, start at Y=0.2 and move horizontally until you hit a vertical line, which is over the datum 1.0. Thus 1.0 is the 20th weighted percentile. Similarly, the 0.6 quantile is the data value 3.0. You can confirm this by calling PROC MEANS with a WEIGHT statement, as shown below:

proc means data=A p20 p40 p60 p80;
   weight wt;                      /* weighted analysis */
   var x;
Weighted percentiles

An intuitive visualization of weighted percentiles

You can use a physical model to intuitively understand weighted percentiles. The model is the same as I used to visualize a weighted mean. Namely, imagine a point-mass of wi concentrated at position xi along a massless rod. Finding a weighted percentile p is equivalent to finding the first location along the rod (moving from left to right) at which the proportion of the weight is greater than p. (I omit how to handle special percentiles for which the proportion is equal to p.)

The physical model looks like the following:

Physical model of weighted percentiles:  masses on a rod

From the figure you can see that x1 is the pth quantile for p < 0.25. Similarly, x2 is the pth quantile for 0.25 < p < 0.30, and so forth.

If you want to apply these concepts to your own data, you can download the SAS program that generates the CDF graphs and computes the weighted percentiles.

Post a Comment

Halley's method for finding roots

Edmond Halley (1656-1742) is best known for computing the orbit and predicting the return of the short-period comet that bears his name. However, like many scientists of his era, he was involved in a variety of mathematical and scientific activities. One of his mathematical contributions is a numerical method for finding the roots of a real-valued function. The technique, which is called Halley's method, is similar to Newton's method, but converges more rapidly in the neighborhood of a root.

There are dozens of root-finding methods, but Newton's method is the gold standard for finding roots of general nonlinear functions. It converges quadratically and is easy to implement because the formula requires only the evaluation of a function and its first derivative. Other competing methods do not use derivatives at all, or they use higher-order derivatives.

Halley's method falls into the latter category. Like Newton's method, it requires an initial guess for the root. It evaluates the function and its first two derivatives at an approximate location of the root and uses that information to iteratively improve the approximation. This article compares Halley's method with Newton's method and suggests a class of functions for which Halley's method is preferable.

Halley's root-finding method: An alternative to Newton's method #SASTip Click To Tweet

An example of finding a root in SAS/IML

Consider the function f(x) = x exp(x). If you want to find the values of x for which f(x)=c, you can recast the problem as a root-finding problem and find the roots of the function f(x)-c. For this particular function, the equation f(x)-c has one root if c ≥ 0 and two roots if -1/e < c < 0.

Halley's method: The roots of the function x*exp(x) + 1/(2e)

To be concrete, let c = -1/(2e) so that the equation has two roots. The function f(x)-c is shown to the right and the two roots are marked by red line segments. You can use the built-in FROOT function in SAS/IML to locate the roots. The FROOT function does not use Newton's method. Instead, it requires that you specify an interval on which to search for a root. From the graph, you can specify two intervals that bound roots, as follows:

proc iml;
/* The equation f(x)-c for f(x) = x exp(x) */
start func(x) global(g_target);
   return x # exp(x) - g_target;
g_target = -1 / (2*constant('e')); /* parameter; function has two roots */
intervals = {-3 -2,                /* one root in [-3,-2] */
             -1  0};               /* another root in [-1,0] */
roots = froot("func", intervals);  /* find roots in each interval */
print roots;

The FROOT function is effective and efficient. It always finds an approximate root provided that you can supply a bounding interval on which the function changes signs. However, sometimes you don't know a bounding interval, you only know an approximate value for the root. In that case, Newton-type methods are preferable.

Newton's method versus Halley's method

Halley's method is an extension of Newton's method that incorporates the second derivative of the target function. Whereas Newton's method iterates the formula xn+1 = xn - f(xn) / f′(xn), Halley's method contains an extra term in the denominator:
xn+1 = xn - f(xn) / [f′(xn+1) - 0.5 f(xn) f″(xn) / f′(xn)].
Notice that if f″(xn)=0, then Halley's method reduced to Newton's method.

For many functions, the computational cost of evaluating the extra term in Halley's formula does not offset the gain in convergence speed. In other words, it is often quicker and easier to use Newton's simpler formula than Halley's more complicated formula. However, Halley's method is worth implementing when the ratio f″(x) / f′(x) can be evaluated cheaply. The current function provides an example: f′(x) = (x+1) exp(x) and f″(x) = (x+2) exp(x), so the ratio is simply (x+2) / (x+1). This leads to the following SAS/IML functions. One function implements Newton's iteration and the other implements Halley's iteration:

/* Compute iterations of Newton's method for several initial guesses:
   f(x) = x#exp(x) - c */
start NewtonIters(x0, numIters=1) global(g_target);
   z = j(numIters+1, ncol(x0));
   z[1,] = x0;
   do i = 2 to numIters+1;
      x = z[i-1,];
      fx = x # exp(x) - g_target;         /* f(x)   */
      dfdx = (1+x) # exp(x);              /* f'(x)  */
      z[i,] = x - fx / dfdx;              /* new approximate root */
   return z;
/* Compute iterations of Halley's method for several initial guesses:
   f(x) = x#exp(x) - c */
start HalleyIters(x0, numIters=1) global(g_target);
   z = j(numIters+1, ncol(x0));
   z[1,] = x0;
   do i = 2 to numIters+1;
      x = z[i-1,];
      fx = x # exp(x) - g_target;         /* f(x)   */
      dfdx = (1+x) # exp(x);              /* f'(x)  */
      R = (2+x) / (1+x);                  /* ratio f''(x) / f'(x) */
      z[i,] = x - fx / (dfdx - 0.5*fx#R); /* new approximate root */
   return z;

Notice that the functions are practically identical. Halley's method computes an extra term (R) and includes that term in the iterative formula that converges to the root. I wrote the function in vectorized form so that you can pass in multiple initial guesses and the functions will iterate all guesses simultaneously. The following statements provide four initial guesses and apply both Newton's and Halley's method:

x0 = {-3 -2 -0.5 0.25};        /* multiple initial guesses */
N = NewtonIters(x0, 5);        /* compute 5 Newton iterations */
H = HalleyIters(x0, 3);        /* compute 3 Halley iterations */                
rNames = "Iter=0":"Iter=5";
print N[L="Newton's Method" c=("Guess1":"Guess4") r=rNames];
print H[L="Halley's Method" c=("Guess1":"Guess4") r=rNames];

The tables show that Halley's method tends to converge to a root faster than Newton's method. For the four initial conditions, Newton's method required three, four, or five iterations to converge to within 1e-6 of the root. In contrast, Haley's method used three or fewer iterations to reach the same level of convergence.

Again, for Halley's method to be useful in practice, the ratio f″(x) / f′(x) must be easy to evaluate. To generalize this example, the ratio simplifies if the function is the product of a simple function and an exponential: f(x) = P(x)*exp(x). For example, if P(x) is a polynomial, then f″ / f′ is a rational function.

In summary, Halley's method is a powerful alternative to Newton's method for finding roots of a function f for which the ratio f″(x) / f′(x) has a simple expression. In that case, Halley's root-finding method is easy to implement and converges to roots of f faster than Newton's method for the same initial guess.

Post a Comment