Halley's method for finding roots

Edmond Halley (1656-1742) is best known for computing the orbit and predicting the return of the short-period comet that bears his name. However, like many scientists of his era, he was involved in a variety of mathematical and scientific activities. One of his mathematical contributions is a numerical method for finding the roots of a real-valued function. The technique, which is called Halley's method, is similar to Newton's method, but converges more rapidly in the neighborhood of a root.

There are dozens of root-finding methods, but Newton's method is the gold standard for finding roots of general nonlinear functions. It converges quadratically and is easy to implement because the formula requires only the evaluation of a function and its first derivative. Other competing methods do not use derivatives at all, or they use higher-order derivatives.

Halley's method falls into the latter category. Like Newton's method, it requires an initial guess for the root. It evaluates the function and its first two derivatives at an approximate location of the root and uses that information to iteratively improve the approximation. This article compares Halley's method with Newton's method and suggests a class of functions for which Halley's method is preferable.

Halley's root-finding method: An alternative to Newton's method #SASTip Click To Tweet

An example of finding a root in SAS/IML

Consider the function f(x) = x exp(x). If you want to find the values of x for which f(x)=c, you can recast the problem as a root-finding problem and find the roots of the function f(x)-c. For this particular function, the equation f(x)-c has one root if c ≥ 0 and two roots if -1/e < c < 0.

Halley's method: The roots of the function x*exp(x) + 1/(2e)

To be concrete, let c = -1/(2e) so that the equation has two roots. The function f(x)-c is shown to the right and the two roots are marked by red line segments. You can use the built-in FROOT function in SAS/IML to locate the roots. The FROOT function does not use Newton's method. Instead, it requires that you specify an interval on which to search for a root. From the graph, you can specify two intervals that bound roots, as follows:

proc iml;
/* The equation f(x)-c for f(x) = x exp(x) */
start func(x) global(g_target);
   return x # exp(x) - g_target;
finish;
 
g_target = -1 / (2*constant('e')); /* parameter; function has two roots */
intervals = {-3 -2,                /* one root in [-3,-2] */
             -1  0};               /* another root in [-1,0] */
roots = froot("func", intervals);  /* find roots in each interval */
print roots;
t_halley1

The FROOT function is effective and efficient. It always finds an approximate root provided that you can supply a bounding interval on which the function changes signs. However, sometimes you don't know a bounding interval, you only know an approximate value for the root. In that case, Newton-type methods are preferable.

Newton's method versus Halley's method

Halley's method is an extension of Newton's method that incorporates the second derivative of the target function. Whereas Newton's method iterates the formula xn+1 = xn - f(xn) / f′(xn), Halley's method contains an extra term in the denominator:
xn+1 = xn - f(xn) / [f′(xn+1) - 0.5 f(xn) f″(xn) / f′(xn)].
Notice that if f″(xn)=0, then Halley's method reduced to Newton's method.

For many functions, the computational cost of evaluating the extra term in Halley's formula does not offset the gain in convergence speed. In other words, it is often quicker and easier to use Newton's simpler formula than Halley's more complicated formula. However, Halley's method is worth implementing when the ratio f″(x) / f′(x) can be evaluated cheaply. The current function provides an example: f′(x) = (x+1) exp(x) and f″(x) = (x+2) exp(x), so the ratio is simply (x+2) / (x+1). This leads to the following SAS/IML functions. One function implements Newton's iteration and the other implements Halley's iteration:

/* Compute iterations of Newton's method for several initial guesses:
   f(x) = x#exp(x) - c */
start NewtonIters(x0, numIters=1) global(g_target);
   z = j(numIters+1, ncol(x0));
   z[1,] = x0;
   do i = 2 to numIters+1;
      x = z[i-1,];
      fx = x # exp(x) - g_target;         /* f(x)   */
      dfdx = (1+x) # exp(x);              /* f'(x)  */
      z[i,] = x - fx / dfdx;              /* new approximate root */
   end;
   return z;
finish;
 
/* Compute iterations of Halley's method for several initial guesses:
   f(x) = x#exp(x) - c */
start HalleyIters(x0, numIters=1) global(g_target);
   z = j(numIters+1, ncol(x0));
   z[1,] = x0;
   do i = 2 to numIters+1;
      x = z[i-1,];
      fx = x # exp(x) - g_target;         /* f(x)   */
      dfdx = (1+x) # exp(x);              /* f'(x)  */
      R = (2+x) / (1+x);                  /* ratio f''(x) / f'(x) */
      z[i,] = x - fx / (dfdx - 0.5*fx#R); /* new approximate root */
   end;
   return z;
finish;

Notice that the functions are practically identical. Halley's method computes an extra term (R) and includes that term in the iterative formula that converges to the root. I wrote the function in vectorized form so that you can pass in multiple initial guesses and the functions will iterate all guesses simultaneously. The following statements provide four initial guesses and apply both Newton's and Halley's method:

x0 = {-3 -2 -0.5 0.25};        /* multiple initial guesses */
N = NewtonIters(x0, 5);        /* compute 5 Newton iterations */
H = HalleyIters(x0, 3);        /* compute 3 Halley iterations */                
rNames = "Iter=0":"Iter=5";
print N[L="Newton's Method" c=("Guess1":"Guess4") r=rNames];
print H[L="Halley's Method" c=("Guess1":"Guess4") r=rNames];
t_halley2

The tables show that Halley's method tends to converge to a root faster than Newton's method. For the four initial conditions, Newton's method required three, four, or five iterations to converge to within 1e-6 of the root. In contrast, Haley's method used three or fewer iterations to reach the same level of convergence.

Again, for Halley's method to be useful in practice, the ratio f″(x) / f′(x) must be easy to evaluate. To generalize this example, the ratio simplifies if the function is the product of a simple function and an exponential: f(x) = P(x)*exp(x). For example, if P(x) is a polynomial, then f″ / f′ is a rational function.

In summary, Halley's method is a powerful alternative to Newton's method for finding roots of a function f for which the ratio f″(x) / f′(x) has a simple expression. In that case, Halley's root-finding method is easy to implement and converges to roots of f faster than Newton's method for the same initial guess.

Post a Comment

Create an animation with the BY statement in PROC SGPLOT

Create an animation by using the BY statement in the PROC SGPLOT

It is easy to use PROC SGPLOT and BY-group processing to create an animated graph in SAS 9.4. Sanjay Matange previously discussed how to create an animated plot in SAS 9.4, but he used a macro loop to call PROC SGPLOT many times.

It is often easier to use the BY statement in SAS procedures to create many graphs. Someone recently asked me how I created an animation that shows level sets of a contour plot. This article explains how to create an animation by using the BY statement in PROC SGPLOT.

An animation requires that you create a sequence of images. In SAS 9.4, you can create an animated GIF by using the ODS PRINTER destination. ODS does not care how the images are generated. They can be created by a macro loop. Or, as shown below, they can be generated by using the BY statement in PROC SGPLOT, SGRENDER, or any other procedure in SAS.

As an example, I will create the graph at the top of this article, which shows the annual time series for the stock price of three US companies for 20 consecutive years. The data are contained in the Sashelp.Stocks data set. The following DATA step adds two new variables: Year and Month. The data are then sorted according to Date, which also sorts the data by Year.

data stocks;
   set sashelp.stocks;
   Month = month(date);      /* 1, 2, 3, ..., 12 */
   Year = year(date);        /* 1986, 1987, ..., 2005 */
run;
 
proc sort data=stocks; by date; run;

I will create an animation that contains 20 frames. Each frame will be a graph that shows the stock performance for the three companies in a particular year. You can use PROC MEANS to discover that the stock prices are within the range [10, 210], so that range is used for the vertical axis:

ods graphics / imagefmt=GIF width=4in height=3in;     /* each image is 4in x 3in GIF */
options papersize=('4 in', '3 in')                    /* set size for images */
        nodate nonumber                               /* do not show date, time, or frame number */
        animduration=0.5 animloop=yes noanimoverlay   /* animation details */
        printerpath=gif animation=start;              /* start recording images to GIF */
ods printer file='C:\AnimGif\ByGroup\Anim.gif';  /* images saved into animated GIF */
 
ods html select none;                                 /* suppress screen output */
proc sgplot data=stocks;
title "Stock Performance";
   by year;                                           /* create 20 images, one for each year */
   series x=month y=close / group=stock;              /* each image is a time series */
   xaxis integer values=(1 to 12);                         
   yaxis min=10 max=210 grid;                         /* set common vertical scale for all graphs */
run;
ods html select all;                                  /* restore screen output */
 
options printerpath=gif animation=stop;               /* stop recording images */
ods printer close;                                    /* close the animated GIF file */

The BY statement writes a series of images. They are placed into the animated GIF file that you specify on the FILE= option in the ODS PRINTER statement.

A few tricks are worth mentioning:
  • Use the ODS GRAPHICS statement to specify the size of the image in some physical measurement (inches or centimeters). On the OPTIONS statement, specify those same measurements.
  • You can control the animation by using SAS system options. To create an animated GIF, use OPTIONS PRINTERPATH=GIF ANIMATION=START before you generate the image files. Use OPTIONS PRINTERPATH=GIF ANIMATION=STOP after you generate the image files.
  • Use the ANIMDURATION= option to specify the time interval (in seconds) that each frame appears. Typical values are 0.1 to 0.5.
  • Use the ANIMLOOP= option to specify whether the animation should repeat after reaching the end
  • Use ODS HTML SELECT NONE to prevent the animation frames from appearing in your HTML output. (If you use the LISTING output, replace "HTML" with "LISTING.")
  • By default, the levels of the BY group are automatically displayed in the TITLE2 field of the graph. You can turn off this behavior by specifying the NOBYLINE option. You can use the expression #BYVAL(year) in a TITLE or FOOTNOTE statement to incorporate the BY-group level into a title or footnote.

You can use a browser to view the image. As I did in this blog post, you can embed the image in a web page.

Have fun creating your animations! Leave a comment and tell me about your animated creations.

Post a Comment

The smooth bootstrap method in SAS

Last week I showed how to use the simple bootstrap to randomly resample from the data to create B bootstrap samples, each containing N observations. The simple bootstrap is equivalent to sampling from the empirical cumulative distribution function (ECDF) of the data. An alternative bootstrap technique is called the smooth bootstrap. In the smooth bootstrap you add a small amount of random noise to each observation that is selected during the resampling process. This is equivalent to sampling from a kernel density estimate, rather than from the empirical density.

The example in this article is adapted from Chapter 15 of Wicklin (2013), Simulating Data with SAS.

The bootstrap method in SAS/IML

My previous article used the bootstrap method to investigate the sampling distribution of the skewness statistic for the SepalWidth variable in the Sashelp.Iris data. I used PROC SURVEYSELECT to resample the data and used PROC MEANS to analyze properties of the bootstrap distribution. You can also use SAS/IML to implement the bootstrap method.

The following SAS/IML statements creates 5000 bootstrap samples of the SepalWidth data. However, instead of computing a bootstrap distribution for the skewness statistic, this program computes a bootstrap distribution for the median statistic. The SAMPLE function enables you to resample from the data.

data sample(keep=x);
   set Sashelp.Iris(where=(Species="Virginica") rename=(SepalWidth=x));
run;
 
/* Basic bootstrap confidence interval for median  */
%let NumSamples = 5000;                 /* number of bootstrap resamples */
proc iml;
use Sample;  read all var {x};  close Sample;        /* read data */
call randseed(12345);                   /* set random number seed */
obsStat = median(x);                    /* compute statistic on original data */
s = sample(x, &NumSamples // nrow(x));  /* bootstrap samples: 50 x NumSamples */
D = T( median(s) );                     /* bootstrap distribution for statistic */
call qntl(q, D, {0.025 0.975});         /* basic 95% bootstrap CI */
results = obsStat || q`;
print results[L="Bootstrap Median" c={"obsStat" "P025" "P975"}];
Simple bootstrap confidence interval for median

The SAS/IML program is very compact. The MEDIAN function computes the median for the original data. The SAMPLE function generates 5000 resamples; each bootstrap sample is a column of the s matrix. The MEDIAN function then computes the median of each column. The QNTL subroutine computes a 95% confidence interval for the median as [28, 30]. (Incidentally, you can use PROC UNIVARIATE to compute distribution-free confidence intervals for standard percentiles such as the median.)

The bootstrap distribution of the median

The following statement create a histogram of the bootstrap distribution of the median:

title "Bootstrap Distribution for Median";
call histogram(D) label="Median";           /* create histogram in SAS/IML */
Bootstrap distribution of the median. The data are rounded values.

I was surprised when I first saw a bootstrap distribution like this. The distribution contains discrete values. More than 80% of the bootstrap samples have a median value of 30. The remaining samples have values that are integers or half-integers.

This distribution is typical of the bootstrap distribution for a percentile. Three factors contribute to the shape:

  • The measurements are rounded to the nearest millimeter. Thus the data are discrete integers.
  • The sample median is always a data value or (for N even) the midpoint between two data values. In fact, this statement is true for all percentiles.
  • In the Sample data, the value 30 is not only the median, but is the mode. Consequently, many bootstrap samples will have 30 as the median value.
The smooth bootstrap can analyze percentiles of rounded data #StatWisdom #SASTip Click To Tweet

Smooth bootstrap

Although the bootstrap distribution for the median is correct, it is somewhat unsatisfying. Widths and lengths represent continuous quantities. Consequently, the true sampling distribution of the median statistic is continuous.

The bootstrap distribution would look more continuous if the data had been measured with more precision. Although you cannot change the data, you can change the way that you create bootstrap samples. Instead of drawing resamples from the (discrete) ECDF, you can randomly draw samples from a kernel density estimate (KDE) of the data. The resulting samples will not contain data values. Instead, they will contains values that are randomly drawn from a continuous KDE.

You have to make two choices for the KDE: the shape of the kernel and the bandwidth. This article explores two possible choices:

  • Uniform kernel: A recorded measurement of 30 mm means that the true value of the sepal was in the interval [29.5, 30.5). In general, a recorded value of x means that the true value is in [x-0.5, x+0.5). Assuming that any point in that interval is equally likely leads to a uniform kernel with bandwidth 0.5.
  • Normal kernel: You can assume that the true measurement is normally distributed, centered on the measured value, and is very likely to be within [x-0.5, x+0.5). For a normal distribution, 95% of the probability is contained in ±2σ of the mean, so you could choose σ=0.25 and assume that the true value for the measurement x is in the distribution N(x, 0.25).

For more about the smooth bootstrap, see Davison and Hinkley (1997) Bootstrap Methods and their Application.

Smooth bootstrap in SAS/IML

For the iris data, the uniform kernel seems intuitively appealing. The following SAS/IML program defines a function named SmoothUniform that randomly chooses B samples and adds a random U(-h, h) variate to each data point. The medians of the columns form the bootstrap distribution.

/* randomly draw a point from x. Add noise from U(-h, h) */
start SmoothUniform(x, B, h);
   N = nrow(x) * ncol(x);
   s = Sample(x, N // B);               /* B x N matrix */
   eps = j(B, N);                       /* allocate vector */
   call randgen(eps, "Uniform", -h, h); /* fill vector */
   return( s + eps );                   /* add random uniform noise */
finish;
 
s = SmoothUniform(x, &NumSamples, 0.5); /* columns are bootstrap samples from KDE */
D = T( median(s) );                     /* median of each col is bootstrap distrib */
BSEst = mean(D);                        /* bootstrap estimate of median */
call qntl(q, D, {0.025 0.975});         /* basic 95% bootstrap CI */
results = BSEst || q`;
print results[L="Smooth Bootstrap (Uniform Kernel)" c={"Est" "P025" "P975"}];
Smooth bootstrap estimate and confidence interval for the median

The smooth bootstrap distribution (not shown) is continuous. The mean of the distribution is the bootstrap estimate for the median. The estimate for this run is 29.8. The central 95% of the smooth bootstrap distribution is [29.77, 29.87]. The bootstrap estimate is close to the observed median, but the CI is much smaller than the earlier simple CI. Notice that the observed median (which is computed on the rounded data) is not in the 95% CI from the smooth bootstrap distribution.

Many researchers in density estimation state that the shape of the kernel function does not have a strong impact on the density estimate (Scott (1992), Multivariate Density Estimation, p. 141). Nevertheless, the following SAS/IML statements define a function called SmoothNormal that implements a smoothed bootstrap with a normal kernel:

/* Smooth bootstrap with normal kernel and sigma = h */
start SmoothNormal(x, B, h);
   N = nrow(x) * ncol(x);
   s = Sample(x, N // B);                /* B x N matrix */
   eps = j(B, N);                        /* allocate vector */
   call randgen(eps, "Normal", 0, h);    /* fill vector */
   return( s + eps );                    /* add random normal variate */
finish;
 
s = SmoothNormal(x, &NumSamples, 0.25);  /* bootstrap samples from KDE */

The mean of this smooth bootstrap distribution is 29.89. The central 95% interval is [29.86, 29.91]. As expected, these values are similar to the values obtained by using the uniform kernel.

Summary of the smooth bootstrap

In summary, the SAS/IML language provides a compact and efficient way to implement the bootstrap method for a univariate statistic such as the skewness or median. A visualization of the bootstrap distribution of the median reveals that the distribution is discrete due to the rounded data values and the statistical properties of percentiles. If you choose, you can "undo" the rounding by implementing the smooth bootstrap method. The smooth bootstrap is equivalent to drawing bootstrap samples from a kernel density estimate of the data. The resulting bootstrap distribution is continuous and gives a smaller confidence interval for the median of the population.

Post a Comment

Formats for p-values and odds ratios in SAS

Last week I showed some features of SAS formats, including the fact that you can use formats to bin a continuous variable without creating a new variable in the DATA step. During the discussion I mentioned that it can be confusing to look at the output of a formatted variable if you don't realize that a format has been applied.

Two formats that sometimes puzzle new SAS programmers are the PVALUEw.d format, which displays p-values, and the ODDSRw.d, which displays odds ratios. These formats appear in many SAS statistical tables. They are the reasons that a table might display a very small p-value or odds ratio with the string "< 0.001." This article describes these formats and explains how to interpret extreme odds ratios.

Formatted p-values and odds ratios

The following call to PROC LOGISTIC displays two tables. One has a column for p-values, the other displays odds ratios:

proc logistic data=sashelp.cars plots=none;
where Origin in ("Asia" "USA");
   class Type;
   model Origin = Type MPG_City;
   ods select ParameterEstimates OddsRatios;
   ods output ParameterEstimates=PE OddsRatios=OR;
run;
t_pvalfmt1

I've overlaid a red rectangle on columns that have special formats. The ParameterEstimates table has a column that is formatted by using the PVALUE6.4 format. The OddsRatios table has three columns that are formatted by using the ODDSR8.3 format.

The cells that are highlighted in yellow show the effect of the formats. In the ParameterEstimates table, the p-value for the MPG_City variable is very small, so it is displayed by using the formatted values "<.0001". In the OddsRatios table, the estimate of the odds ratio for the "Hybrid" vehicles versus the "Wagon" vehicles (the reference level) is very large. It is displayed as ">999.999". Furthermore, the 95% confidence interval for the parameter is displayed as [<0.001, >999.999], which you can interpret as "any value in the interval (0, ∞)."

If you know about these formats, you can interpret the meaning of the highlighted cells. You can always discover the formats used in an ODS table by using the ODS OUTPUT statement to write the tables to a SAS data set and then use PROC CONTENTS to display the variables and formats in the table, as follows:

ods select Position(persist);
proc contents data=PE order=varnum; run;
proc contents data=OR order=varnum; run;
ods select all;

The PVALUE and ODDSR formats

The PVALUEw.d and ODDSRw.d formats are very simple. Small values are displayed by using a "less than" symbol. Here "small" means less than 10-d, where d is the number of decimals in the format specification. Usually d=3 or d=4. The ODDSRw.d format has a similar behavior for large numbers. The formatted values are displayed by using the "greater than" symbol.

The following DATA step demonstrates how the PVALUEw.d and ODDSRw.d formats display very small numbers.

data FmtExample(drop=d);
do d = -5 to 3;
   raw = 10**d;
   oddsRatio = raw;
   if d<=0 then pVal = raw;   /* a p-value must be less than 1 */
   else pVal = .;
   output;
end;
format raw BEST8. oddsRatio ODDSR8.3 pVal PVALUE6.4;
run;
 
options missing=" ";          /* display missing values as blanks */
proc print noobs; run;
options missing=".";          /* reset display of missing values */
Formatted display of p-values and odds ratios

The first column shows the raw unformatted values, which are powers of 10. The second column shows those values formatted by using the ODDSR8.3 format. Values that are less than 10-3 or greater than 103 are displayed by using a less than or greater than symbol. The third column shows the values formatted by using the PVALUE6.4 format. Values that are less than 10-4 are displayed by using a less than symbol.

What does ">999.999" mean for an odds ratio?

For the statistical programmer, the question is not "why is SAS displaying funny-looking numbers," but "what do these numbers tell me about my model and my data?"

For a p-value, small numbers indicate statistical significance. This is usually good. It means that an effect is significant or that there is evidence to reject some null hypothesis.

For an odds ratio, medium-sized numbers appear most often. For example, an odds ratio of 10 (or, equivalently, 1/10) might enable you to conclude that "patients in the experimental group are 10 times more likely to survive than patients in the placebo group." Seeing very large numbers (like 104) or very tiny number (10-4) often indicates that something is wrong. Perhaps your model is misspecified, or perhaps the data is degenerate in some way and does not provide any evidence for or against a statistical hypothesis.

As an example, let's try to understand why the odds ratio is so extreme for the example in this article. The model is trying to predict whether a vehicle is manufactured in Asia or the US. An explanatory factor is the Type variable, and the OddsRatio table indicates something strange about the "Hybrid" level. When using a categorical variable to predict a categorical response, it is often useful to look at the cross-tabulation of the data for the response and explanatory variables:

proc freq data=sashelp.cars;
where Origin in ("Asia" "USA");
   tables origin * Type / nocum nopercent norow nocol;
run;
t_pvalfmt3

The output from PROC FREQ shows that the "Hybrid" level of the Type variable has only three observations. Furthermore, all three observations are for Asian-manufactured vehicles. The model tries to predict the probability that a car is Asian, given the Type of the car. However, the data indicate that a hybrid car is "infinitely more likely" to be Asian because all hybrid vehicles in the data set are Asian.

If you want a more rigorous explanation, see the PROC LOGISTIC documentation for how the odds ratio is estimated. For this model, the log of the odds ratio equals zero, which makes the odds ratio undefined. If you look in the SAS log, you will see that PROC LOGISTIC issued a warning that something was wrong with the model:

WARNING: There is possibly a quasi-complete separation of data points. The maximum 
         likelihood estimate may not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results 
         shown are based on the last maximum likelihood iteration. Validity of 
         the model fit is questionable.
Post a Comment

Compute a bootstrap confidence interval in SAS

A common question is "how do I compute a bootstrap confidence interval in SAS?" As a reminder, the bootstrap method consists of the following steps:

  • Compute the statistic of interest for the original data
  • Resample B times from the data to form B bootstrap samples. How you resample depends on the null hypothesis that you are testing.
  • Compute the statistic on each bootstrap sample. This creates the bootstrap distribution, which approximates the sampling distribution of the statistic under the null hypothesis.
  • Use the approximate sampling distribution to obtain bootstrap estimates such as standard errors, confidence intervals, and evidence for or against the null hypothesis.

The papers by Cassell ("Don't be Loopy", 2007; "BootstrapMania!", 2010) describe ways to bootstrap efficiently in SAS. The basic idea is to use the DATA step, PROC SURVEYSELECT, or the SAS/IML SAMPLE function to generate the bootstrap samples in a single data set. Then use a BY statement to carry out the analysis on each sample. Using the BY-group approach is much faster than using macro loops.

To illustrate bootstrapping in Base SAS, this article shows how to compute a simple bootstrap confidence interval for the skewness statistic by using the bootstrap percentile method. The example is adapted from Chapter 15 of Simulating Data with SAS, which discusses resampling and bootstrap methods in SAS. SAS also provides the %BOOT and %BOOTCI macros, which provide bootstrap methods and several kinds of confidence intervals.

Compute a bootstrap confidence interval in Base #SAS. #Statistics Click To Tweet

The accuracy of a statistical point estimate

The following statements define a data set called Sample. The data are measurements of the sepal width for 50 randomly chosen iris flowers of the species iris Virginica. The call to PROC MEANS computes the skewness of the sample:

data sample(keep=x);
   set Sashelp.Iris(where=(Species="Virginica") rename=(SepalWidth=x));
run;
 
/* 1. compute value of the statistic on original data: Skewness = 0.366 */
proc means data=sample nolabels Skew;  var x;  run;

The sample skewness for these data is 0.366. This estimates the skewness of sepal widths in the population of all i. Virginica. You can ask two questions: (1) How accurate is the estimate, and (2) Does the data indicate that the distribution of the population is skewed? An answer to (1) is provided by the standard error of the skewness statistic. One way to answer question (2) is to compute a confidence interval for the skewness and see whether it contains 0.

PROC MEANS does not provide a standard error or a confidence interval for the skewness, but the next section shows how to use bootstrap methods to estimate these quantities.

Resampling with PROC SURVEYSELECT

For many resampling schemes, PROC SURVEYSELECT is the simplest way to generate bootstrap samples. The following statements generate 5000 bootstrap samples by repeatedly drawing 50 random observations (with replacement) from the original data:

%let NumSamples = 5000;       /* number of bootstrap resamples */
/* 2. Generate many bootstrap samples */
proc surveyselect data=sample NOPRINT seed=1
     out=BootSSFreq(rename=(Replicate=SampleID))
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /* OUTHITS                 option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;

The output data set represents 5000 samples of size 50, but the output data set contains fewer than 250,000 observations. That is because the SURVEYSELECT procedure generates a variable named NumberHits that records the frequency of each observation in each sample. You can use this variable on the FREQ statement of many SAS procedures, including PROC MEANS. If the SAS procedure that you are using does not support a frequency variable, you can use the OUTHITS option on the PROC SURVEYSELECT statement to obtain a data set that contains 250,000 observations.

BY group analysis of bootstrap samples

The following call to PROC MEANS computes 5000 skewness statistics, one for each of the bootstrap samples. The NOPRINT option is used to suppress the results from appearing on your monitor. (You can read about why it is important to suppress ODS during a bootstrap computation.) The 5000 skewness statistics are written to a data set called OutStats for subsequent analysis:

/* 3. Compute the statistic for each bootstrap sample */
proc means data=BootSSFreq noprint;
   by SampleID;
   freq NumberHits;
   var x;
   output out=OutStats skew=Skewness;  /* approx sampling distribution */
run;

Visualize the bootstrap distribution

The bootstrap distribution tells you how the statistic (in this case, the skewness) might vary due to random sampling variation. You can use a histogram to visualize the bootstrap distribution of the skewness statistic:

title "Bootstrap Distribution";
%let Est = 0.366;
proc sgplot data=OutStats;
   label Skewness= ;
   histogram Skewness;
   /* Optional: draw reference line at observed value and draw 95% CI */
   refline &Est / axis=x lineattrs=(color=red) 
                  name="Est" legendlabel="Observed Statistic = &Est";
   refline -0.44737 0.96934  / axis=x lineattrs=(color=blue) 
                  name="CI" legendlabel="95% CI";
   keylegend "Est" "CI";
run;
Bootstrap distribution with 95% bootstrap confidence interval in SAS

In this graph, the REFLINE statement is used to display (in red) the observed value of the statistic for the original data. A second REFLINE statement plots (in blue) an approximate 95% confidence interval for the skewness parameter, which is computed in the next section. The bootstrap confidence interval contains 0, thus you cannot conclude that the skewness parameter is significantly different from 0.

Compute a bootstrap confidence interval in SAS

The standard deviation of the bootstrap distribution is an estimate for the standard error of the statistic. If the sampling distribution is approximately normal, you can use this fact to construct the usual Wald confidence interval about the observed value of the statistic. That is, if T is the observed statistic, then the endpoints of the 95% two-sided confidence interval are T ± 1.96 SE. (Or use the so-called bootstrap t interval by replacing 1.96 with tα/2, n-1.) The following call to PROC MEANS produces the standard error (not shown):

proc means data=OutStats nolabels N StdDev;
   var Skewness;
run;

However, since the bootstrap distribution is an approximate sampling distribution, you don't need to rely on a normality assumption. Instead, you can use percentiles of the bootstrap distribution to estimate a confidence interval. For example, the following call to PROC UNIVARIATE computes a two-side 95% confidence interval by using the lower 2.5th percentile and the upper 97.5th percentile of the bootstrap distribution:

/* 4. Use approx sampling distribution to make statistical inferences */
proc univariate data=OutStats noprint;
   var Skewness;
   output out=Pctl pctlpre =CI95_
          pctlpts =2.5  97.5       /* compute 95% bootstrap confidence interval */
          pctlname=Lower Upper;
run;
 
proc print data=Pctl noobs; run;
95% bootstrap confidence interval

As mentioned previously, the 95% bootstrap confidence interval contains 0. Although the observed skewness value (0.366) might not seem very close to 0, the bootstrap distribution shows that there is substantial variation in the skewness statistic in small samples.

The bootstrap percentile method is a simple way to obtain a confidence interval for many statistics. There are several more sophisticated methods for computing a bootstrap confidence interval, but this simple method provides an easy way to use the bootstrap to assess the accuracy of a point estimate. For an overivew of bootstrap methods, see Davison and Hinkley (1997) Bootstrap Methods and their Application.

Post a Comment

Use SAS formats to bin numerical variables

SAS formats are flexible, dynamic, and have many uses. For example, you can use formats to count missing values and to change the order of a categorical variable in a table or plot. Did you know that you can also use SAS formats to recode a variable or to bin a numerical variable into categories? This can be very convenient because you do not need to create a new variable in the data set; you merely apply a format to an existing variable.

Income categories: Are you middle class?

Many people use several IF-THEN/ELSE statements in the DATA step (or in a DATA step view) to create a new discrete variable that represents binned values of a continuous variable. That is a fine technique, but an alternative technique is to create a user-defined format that bins a continuous variable. One advantage of a custom format is that you can apply the format to multiple variables in multiple data sets.

For example, suppose that you want to define income categories such as "working class," "middle class," and the ultra-rich "1 percenters." According to a 2012 Money magazine article, the following cut points divide US household income into seven categories:

/* 2012 income categories for US according to Money magazine */
proc format;
value IncomeFmt  
      low   -<  23000 = "Poverty"        /* < 23 thousand         */
      23000 -<  32500 = "Working"        /* [ 23,  32.5) thousand */
      32500 -<  60000 = "Lower Middle"   /* [ 32.5, 60) thousand  */
      60000 -< 100000 = "Middle"         /* [ 60, 100) thousand   */
     100000 -< 150000 = "Upper Middle"   /* [100, 150) thousand   */
     150000 -< 250000 = "5 Percenter"    /* [150, 250) thousand   */
     250000 -   high  = "1 Percenter";   /* > 250 thousand        */
run;

The call to PROC FORMAT creates a custom format called IncomeFmt. When you assign the IncomeFmt format to a numerical variable, SAS will look at the value of each observation and determine the formatted value from the raw value. For example, a value of 18,000 is less than 23,000, so that value is formatted as "Poverty." A value of 85,000 is in the half-open interval [60000, 100000), so that value is formatted as "Middle."

The following DATA step defines the incomes for 26 fictitious individuals. The IncomeFmt format is assigned to the Income variable:

data incomes;
length Name $10.;
input Name Income @@;
format Income IncomeFmt.;     /* assign IncomeFmt format to Income variable */
datalines;
Amy        65100 Brad      146500 
Carlos    113300 Dimtri     28800 
Eduardo   233300 Felicity   14600 
Guo        43400 Hector    141700 
Irene      53400 Jacob     170300 
Katerina   43100 Liu        66800 
Michael    15800 Nancy      30900 
Oscar      31800 Pablo      65800 
Quentin    40000 Rick       62200 
Stephan    32900 Tracy      64000 
Umberto   124000 Victoria  220700 
Wanda     263800 Xie         9300 
Yolanda    23400 Zachary    93800 
;

The Income variable is a continuous variable, but the format bins each value into one of seven discrete values. Consequently, SAS procedures can analyze the Income variable as if it were a discrete variable. For example, you can count the number of individuals in each income category by calling PROC FREQ:

proc freq data=incomes; 
   tables Income / nocum norow nocol;
run;
Use SAS format to bin a numeric variable

Assigning or unassigning formats at run time

The underlying data values are not lost. You can use a FORMAT statement in a SAS procedure to temporarily assign or unassign a format. If you remove the format, you can analyze the underlying raw data. For example, the following call to PROC UNIVARIATE analyzes the raw incomes:

proc univariate data=incomes;
   format Income;     /* remove the format for this analysis */
   var Income;
run;

In a similar way, if you specify the Income variable on a CLASS statement in a regression procedures, the formatted values are used for the analysis. However, if you do NOT include it on the CLASS statement, then the variable is treated as a continuous variable and the unformatted values are used.

Subset data by using formatted values

If you run PROC PRINT on the income data, it LOOKS like the Income variable is a character variable. Furthermore, it is analyzed like a character variable when used in some SAS procedures such as PROC FREQ. Consequently, you might forget that the Income variable is actually numeric. However, if you treat Income as a character variable in the DATA set or a WHERE clause, then SAS will report an error. For example, the following WHERE clause is incorrect:

proc print data=incomes; 
where Income in ("5 Percenter", "1 Percenter"); /* WRONG: Income is numeric */
run;
ERROR: WHERE clause operator requires compatible variables.

SAS reports an error because the WHERE clause cannot compare the raw (numeric) values of the Income variable with elements of a set that contains two strings. When you see an error message like this, use PROC CONTENTS to investigate the attributes of the variable:

ods select Position;
proc contents data=incomes order=varnum; run;
The format associated with a SAS variable

The output from PROC CONTENTS informs you that the Income variable is numeric and displays the name of the format that is attached to it.

If you know the cutoff values that are used for the format, you could create a valid WHERE clause that uses numeric values: where Income GE 150000. However, usually it makes more sense to create a valid WHERE clause by using the PUT statement to apply the format to the raw data and compare formatted values:

/* use formatted values to subset data */
proc print data=incomes; 
where put(Income, IncomeFmt.) in ("5 Percenter", "1 Percenter");
run;
Use a format to subset data in SAS

You can use other DATA step functions when constructing WHERE clauses. A typical example is when a variable is a SAS date. For example, the Sashelp.Air data set contains a variable named Date. You can use the following WHERE clause to analyze the subset of data that corresponds to the month of January in years prior to 1955:

proc print data=Sashelp.Air;
where month(date)=1 and year(date)<1955;  /* all January dates prior to 1955 */
run;

Summary

As shown in this article, SAS formats are very useful and flexible:

  • You can use a custom format to bin a continuous variable into categories.
  • Within a SAS procedure, you can temporarily assign or unassign a format to change the way that the data are analyzed.
  • The WHERE clause looks at raw data values, so use the PUT function in a WHERE clause if you want to subset the data according to the formatted values.
Post a Comment

Video: Writing packages: A new way to distribute and use SAS/IML programs

My presentation at SAS Global Forum 2016 was titled "Writing Packages: A New Way to Distribute and Use SAS/IML Programs." The paper was published in the conference proceedings several months ago, but I recently recorded a short video that gives an overview of using and creating packages in SAS/IML 14.1:



If your browser does not support embedded video, you can go directly to the video on YouTube.

I'm excited about the opportunities that packages provide. Every year hundreds of scholarly papers are published that use SAS/IML, and I hope that future authors will adopt packages as a standard way to share their work with others.

If you prefer to read about packages, I've written two short introductory blog posts:

For complete details about how to create a package, see the "Packages" chapter of the SAS/IML User's Guide.

Post a Comment

Compute highest density regions in SAS

Bivariate kernel density estimate

In a scatter plot, the regions where observations are packed tightly are areas of high density. A contour plot or heat map of a bivariate kernel density estimate (KDE) is one way to visualize regions of high density.

A SAS customer asked whether it is possible to use SAS to approximate regions that enclose the highest density for a KDE. He also wanted the areas of these regions. I don't know the customer's application, but I can imagine this computation might be useful for ecology. For example, if you count the locations of some endangered species of plant, you might want to estimate a region that contains the highest density and compute the area of that region.

There have been several papers about computing highest density regions (HDRs) for data. Rob Hyndman (1996, TAS), argues that HDRs "are often the most appropriate subset to use to summarize a probability distribution." Hyndman creates "HDR boxplots," which use contours of a KDE to summarize a distribution. A 2011 thesis by A. Fadallah presents the one-dimensional algorithm (p. 6) for finding intervals that contain the highest density.

Hyndman provides a formal definition of an HDR (p. 120) in terms of random variables and probability. For this article, an intuitive definition is sufficient: Given a probability 0 < p < 1 and a density function f, the p100% HDR is the interior of the level set of f that contains probability p.

Recall the intimate connection between densities and probability. For any region, the probability is the integral of the density function over that region. Thus to compute HDRs we need to consider integrals inside a region defined by a level set of the density.

Density contours and regions of high density. #Statistics #SASTip Click To Tweet

Bivariate kernel density estimates

Recall that the KDE procedure can estimate the density of bivariate data. For example, the following statements estimate the bivariate density of the SepalLength and SepalWidth variables in the Sashelp.Iris data set, which contain the dimensions of sepals for 150 iris flowers:

/* define rectangular area on which to estimate density */
%let xmin = 35;   %let xmax = 85;    /* horizontal range */
%let ymin = 15;   %let ymax = 45;    /* vertical range */
 
title "Bivariate Kernel Density Estimate";
proc kde data=sashelp.iris;
bivar SepalLength(gridL=&xmin gridU=&xmax)    /* optional: bwm=0.8 ngrid=100 */
      SepalWidth(gridL=&ymin gridU=&ymax) /   /* optional: bwm=0.8 ngrid=100 */
      out=KDEOut plots=ContourScatter;
ods output Controls=Controls;
run;

The KDE is shown at the top of this article. The regions of high density are shown in red. The highest contours of the density function are the boundaries of the roughly elliptical red regions. Contours of lower density values are peanut-shaped (the whitish color), and the lowest density contours are amoeba-shaped (light blue in color).

These colored regions are level sets of a density function, and they enclose the highest density regions. The rest of this article shows how to compute quantities that are associated with the region enclosed by a density contour.

An algorithm for approximating highest density regions

The KDE procedure computes the density estimate on a regular grid of points. (By default, the grid is 60 x 60.) The area of each grid cell is some value, A. If h is the height of the density function at, say, the bottom right corner of the cell, then you can estimate the probability in the cell as the volume of the rectangular prism with volume V = A h. For the union of many cells, you can approximate the probability (the volume under the density surface) by the Riemann sum of rectangular prisms. Consequently, given a probability p, the following algorithm provides a rough "pixelated" approximation to the density contour that contains probability p:

  1. Sort the grid points in decreasing order by the value of the density function. Let h1h2 ≥ ... ≥ hm be the sorted density values, where m is the number of grid points.
  2. Start with h1, the highest value of the density. Let V1 = A h1 be the approximate volume under the density surface for this cell.
  3. If V1 < p, compute the volume over the cell with the next highest density value. Define V2 = V1 + A h2.
  4. Repeat this accumulation process until you find the first k that Vkp. The union of the first k cells is an approximation to the p100% HDR.

It is worth noting that you can also use this algorithm to visualize the region inside level sets: simply approximate the region for which the density exceeds some specified value. For this second scenario, there is no need to keep track of the probability.

Compute HDRs in SAS

Let's implement this algorithm in SAS. The previous call to the KDE procedure evaluates the density estimate at a 60 x 60 grid of points. By default, the horizontal points are evenly spaced in the interval [min(x), max(x)], but you can (and should) use the GRIDL= and GRIDU= options to expand the lower and upper bounds, as shown previously. Similarly, you can expand the grid range in the vertical direction. You should choose the upper and lower bounds so that the estimated density is essentially zero on the border of the specified region.

The number of points in each direction and the range of each variable are shown in the "Controls" table:

t_kdeHDR

Notice that the call to PROC KDE used the OUT= option to write the 3600 density values to a SAS data set called KDEOut. The ODS OUTPUT statement was used to write the "Control" table to a SAS data set.

You can use SAS or a hand computation to calculate the area of each cell in the regular grid. I will skip the computation and merely store the result in a macro variable:

%let CellArea = 0.43091;      /* area of each cell in the grid */

With that definition, the following SAS statement accumulate the probability and area for the estimated density. To help with the visualization, a fake observation that has density=0 is added to the output of PROC KDE. This ensures that gradient color ramps always use zero as the base of the color scale:

proc sort data=KDEOut;
   by descending density;
run;
 
/* ensure that 0 is the lower limit of the density scale */
data FakeObs; value1=.;  value2=.; density=0; run;
 
/* accumulate probability and area */
data KDEOut;
   set FakeObs KDEOut;         /* add fake observation */
   Area + &CellArea;           /* accumulate area */
   prob + &CellArea * Density; /* accumulate probability */
run;

You can use a WHERE clause to visualize the region that contains the highest density and for which the region contains a specified probability mass. For example, the following call to SGPLOT uses a scatter plot to approximate the 50% HDR. The scatter plot colors markers by the value of the density variable.

title "Approximate 50% HDR";
proc sgplot data=KDEOut aspect=1;
where prob <= 0.5;
   label value1="Sepal Length (mm)"  value2="Sepal Width (mm)";
   styleattrs wallcolor=CXF0F0F0;             /* light gray background */
   scatter x=Value1 y=Value2 / markerattrs=(symbol=SquareFilled size=7)
           colorresponse=Density colormodel=ThreeColorRamp;
   xaxis min=&xmin max=&xmax grid;
   yaxis min=&ymin max=&ymax grid;
run;
Approximate 50% HDR based on kernel density estimate

The 50% HDR is a peanut-shaped region that contains two regions of highest density. In this graph I used the ThreeColorRamp as the gradient color ramp because that is the ramp used by PROC KDE. However, the whitish region does not show up very well, so you might prefer the ThreeColorAltRamp, which uses black instead of white.

In a similar way you can compute an 80% or 95% HDR. You can download the SAS program that computes the results in this article. The program contains statements that compute a 95% HDR.

Animating the highest density regions

You can compute an HDR for any specified probability. The following image shows an animation that was created by using PROC SGPLOT. The animation visualizes the HDRS, starting at a low probability level and ending at a high probability level.

Animation of highest density  regions

Compute the area of an HDR

The previous DATA step also computed the area of the HDR. The following PROC MEANS statement computes the area of the 50% HRD for the sepal widths and lengths. The region covers about 169 square units. If you want to compute the area for a 90% or 95% HDR, just change the value in the WHERE clause. This computation is valid regardless of the number of components in the region.

proc means data=KDEOut max;
where prob <= 0.5;         /* 0.5 is for the 50% HDR */
   var Area;
run;

Closing comments

This article started with data, generated a KDE, and then found HDRs as the interior of level sets of the KDE. However, the level set computations never look at the data, so the computations are valid for any density function. The approximation is rough because it only uses the density values on a uniform grid of points.

For fun, the attached program also uses this method to compute HDRs for bivariate normal data. I wanted to see how the HDRs compare with the elliptical prediction regions. For the examples that I looked at the HDRS are slightly bigger, with a relative error in the 10-15% range.

I do not have any practical experience using HDRs, but I wanted to share this computation in case someone finds it useful. I had never heard of HDRs before last week, so I'd be interested in hearing about how they might be used. I welcome any comments about this approximation.

Post a Comment

Female world leaders by year of election

This week Hillary Clinton became the first woman to be nominated for president of the US by a major political party. Although this is a first for the US, many other countries have already passed this milestone. In fact, 60 countries have already elected women as presidents and prime ministers. For example, the UK was previously led by Margaret Thatcher (1979–1990) and is currently led by Theresa May.

The CNN web site published a list of the 60 countries that have been led by a woman president or prime minister. The list is long and requires that you scroll up and down to compare different countries. To help visualize the data, I decided to turn that list into a graphic. You can download the SAS program that contains the data and creates a graph of female world leaders.

I wanted the graph to show the first year that each country elected a female leader. Consequently, countries like Norway and the UK only appear once on the list even though they have been led by women multiple times.

Female world leaders, 1960-2016

I thought about several different graph types, but finally settled on the scatter plot to the left (click to enlarge). The graph shows the cumulative number of countries that have elected a female president or prime minister.

The vertical axis shows the rank of each country on the list, which makes it easy to find which country was first (Sri Lanka in 1960), second (India), third (Israel), and so forth. The tenth country was Norway. The fiftieth country was Denmark.

You can also easily find out which countries elected their first female leader in a particular year, since that information is plotted on the horizontal axis. However, the graph is not so good at finding a particular country unless you know the approximate election year. Can you find Peru?

If you create the HTML version of this graph in SAS, the graph contains tool tips so that you can hover the mouse over a country's name and find out the name of the woman who was elected. The graph on this page shows that Corazon Aquino was the first female president of the Philippines.

How many countries have elected a female leader? Who was first? #DataViz Click To Tweet

The graph shows that the 1990s was a breakout decade in which many countries elected their first female leader. This trend is better shown by using a bar graph that augments the previous information. The bar graph shows that about 15 countries were added to the list during each of the last three decades.

Election of female leaders by decade

Although countries did not elect female leaders until the 1960s, female rulers are not new to the world. Monarchs such as Nefertiti, Queen Elizabeth I, and Catherine the Great are some famous world leaders from previous centuries. Some women were so influential that they have epochs named after them, such as the Elizabethan era and the Victorian era.

There are still three years left in the 2010s, so the current decade is likely to be the decade in which the most countries elected a woman for the first time. Which country will be number 61 on the list? Time will tell.

Post a Comment

How to visualize a kernel density estimate

A kernel density estimate (KDE) is a nonparametric estimate for the density of a data sample. A KDE can help an analyst determine how to model the data: Does the KDE look like a normal curve? Like a mixture of normals? Is there evidence of outliers in the data?

In SAS software, there are two procedures that generate kernel density estimates. PROC UNIVARIATE can create a KDE for univariate data; PROC KDE can create KDEs for univariate and bivariate data and supports several options to choose the kernel bandwidth.

Kernel density estimate (KDE) and component densities

The KDE is a finite mixture distribution. It is a weighted sum of small density "bumps" that are centered at each data point. The shape of the bumps are determined by the choice of a kernel function. The width of the bumps are determined by the bandwidth.

In textbooks and lecture notes about kernel density estimation, you often see a graph similar to the one at the left. The graph shows the kernel density estimate (in blue) for a sample of 10 data values. The data values are shown in the fringe plot along the X axis. The orange curves are the component densities. Each orange curve is a scaled version of a normal density curve centered at a datum.

This article shows how to create this graph in SAS. You can use the same principles to draw the component densities for other finite mixture models.

The kernel bandwidth

The first step is to decide on a bandwidth for the component densities. The following statements define the data and use PROC UNIVARIATE to compute the bandwidth by using the Sheather-Jones plug-in method:

data sample;
input x @@;
datalines;
46 60 24 15 17 14 21 59 22 16 
;
 
proc univariate data=sample;
   histogram x / kernel(c=SJPI);
run;
Histogram and kernel density estimate (KDE)

The procedure create a histogram with a KDE overlay. The legend of the graph gives a standardized kernel bandwidth of c=0.27, but that is not the bandwidth you want. As explained in the documentation, the kernel bandwidth is derived from the normalized bandwidth by the formula λ = c IQR n-1/5, where IQR is the interquartile range and n is the number of nonmissing observations. For this data, IQR = 30 and n=10, so λ ≈ 5. To save you from having to compute these values, the SAS log displays the following NOTE:

NOTE: The normal kernel estimate for c=0.2661 has a bandwidth of 5.0377

The UNIVARIATE procedure can use several different kernel shapes. By default, the procedure uses a normal kernel. The rest of this article assumes a normal kernel function, although generalizing to other kernel shapes is straightforward.

Compute the component densities

Because the kernel function is centered at each datum, one way to visualize the component densities is to evaluate the kernel function on a symmetric interval about x=0 and then translate that component for every data point. In the following SAS/IML program, the vector w contains evenly spaced points in the interval [-3λ, 3λ], where λ is the bandwidth of the kernel function. (This interval contains 99.7% of the area under the normal curve with standard deviation λ.) The vector k is the normal density function evaluated on that interval, scaled by 1/n where n is the number of nonmissing observations. The quantity 1/n is the mixing probability for each component; the KDE is obtained by summing these components.

The program creates an output data set called component that contains three numerical variables. The ID variable is an ID variable that identifies which observation is being used. The z variable is a translated copy of the w variable. The k variable does not change because every component has the same shape and bandwidth.

proc iml;
use sample;  read all var {x};  close;
n = countn(x);
mixProb = 1/n;
 
lambda = 5;                                 /* bandwidth from PROC UNIVARIATE */
w = do(-3*lambda, 3*lambda, 6*lambda/100);  /* evenly spaced; 3 std dev */
k = mixProb * pdf("Normal", w, 0, lambda);  /* kernel = normal pdf centered at 0 */
 
ID= .; z = .;
create component var {ID z k};             /* create the variables */
do i = 1 to n;
   ID = j(ncol(w), 1, i);                  /* ID var */
   z = x[i] + w;                           /* center kernel at x[i] */
   append;
end;
close component;

Compute the kernel density estimate

The next step is to sum the component densities to create the KDE. The easy way to get the KDE in a data set is to use the OUTKERNEL= option on the HISTOGRAM statement in PROC UNIVARIATE. Alternatively, you can create the full KDE in SAS/IML, as shown below.

The range of the data is [14, 60]. You can extend that range by the half-width of the kernel components, which is 15. Consequently the following statements use the interval [0, 75] as a convenient interval on which to sum the density components. The actual summation is easy in SAS/IML because you can pass a vector of positions to the PDF function i Base SAS.

The sum of the kernel components is written to a data set called KDE.

/* finite mixture density is weighted sum of kernels at x[i] */
a = 0; b = 75;    /* endpoints of interval [a,b] */
t = do(a, b, (b-a)/200);
kde = 0*t;
do i = 1 to n;
   kde = kde + pdf("Normal", t, x[i], lambda) * mixProb;
end;
create KDE var {t kde}; append; close;
quit;

Visualize the KDE

You can merge the original data, the individual components, and the KDE curve into a single SAS data set called All. Use the SGPLOT procedures to overlay the three elements. The individual components are plotted by using the SERIES plot with a GROUP= option. A second SERIES plot graphs the KDE curve. A FRINGE statement plots the positions of each datum as a hash mark. The plot is shown at the top of this article.

data All;
merge sample component KDE;
run;
 
title "Kernel Density Estimate as Weighted Sum of Component Densities";
proc sgplot data=All noautolegend;
   series x=z y=k / group=ID lineattrs=GraphFit2(thickness=1); /* components */
   series x=t y=kde /lineattrs=GraphFit;                       /* KDE curve  */
   fringe x;                                      /* individual observations */
   refline 0 / axis=y;
   xaxis label="x";
   yaxis label="Density" grid;
run;

You can use the same technique to visualize other finite mixture models. However, the FMM procedure creates these plots automatically, so you might never need to create such a plot if you use PROC FMM. The main difference for a general finite mixture model is that the component distributions can be from different families and usually have different parameters. Therefore you will need to maintain a vector of families and parameters. Also, the mixing probabilities usually vary between components.

In summary, the techniques in this article are useful to teachers and presenters who want to visually demonstrate how choosing a kernel shape and bandwidth gives rise to the kernel density estimate.

Post a Comment