Compute a bootstrap confidence interval in SAS

A common question is "how do I compute a bootstrap confidence interval in SAS?" As a reminder, the bootstrap method consists of the following steps:

  • Compute the statistic of interest for the original data
  • Resample B times from the data to form B bootstrap samples. How you resample depends on the null hypothesis that you are testing.
  • Compute the statistic on each bootstrap sample. This creates the bootstrap distribution, which approximates the sampling distribution of the statistic under the null hypothesis.
  • Use the approximate sampling distribution to obtain bootstrap estimates such as standard errors, confidence intervals, and evidence for or against the null hypothesis.

The papers by Cassell ("Don't be Loopy", 2007; "BootstrapMania!", 2010) describe ways to bootstrap efficiently in SAS. The basic idea is to use the DATA step, PROC SURVEYSELECT, or the SAS/IML SAMPLE function to generate the bootstrap samples in a single data set. Then use a BY statement to carry out the analysis on each sample. Using the BY-group approach is much faster than using macro loops.

To illustrate bootstrapping in Base SAS, this article shows how to compute a simple bootstrap confidence interval for the skewness statistic by using the bootstrap percentile method. The example is adapted from Chapter 15 of Simulating Data with SAS, which discusses resampling and bootstrap methods in SAS. SAS also provides the %BOOT and %BOOTCI macros, which provide bootstrap methods and several kinds of confidence intervals.

Compute a bootstrap confidence interval in Base #SAS. #Statistics Click To Tweet

The accuracy of a statistical point estimate

The following statements define a data set called Sample. The data are measurements of the sepal width for 50 randomly chosen iris flowers of the species iris Virginica. The call to PROC MEANS computes the skewness of the sample:

data sample(keep=x);
   set Sashelp.Iris(where=(Species="Virginica") rename=(SepalWidth=x));
/* 1. compute value of the statistic on original data: Skewness = 0.366 */
proc means data=sample nolabels Skew;  var x;  run;

The sample skewness for these data is 0.366. This estimates the skewness of sepal widths in the population of all i. Virginica. You can ask two questions: (1) How accurate is the estimate, and (2) Does the data indicate that the distribution of the population is skewed? An answer to (1) is provided by the standard error of the skewness statistic. One way to answer question (2) is to compute a confidence interval for the skewness and see whether it contains 0.

PROC MEANS does not provide a standard error or a confidence interval for the skewness, but the next section shows how to use bootstrap methods to estimate these quantities.


For many resampling schemes, PROC SURVEYSELECT is the simplest way to generate bootstrap samples. The following statements generate 5000 bootstrap samples by repeatedly drawing 50 random observations (with replacement) from the original data:

%let NumSamples = 5000;       /* number of bootstrap resamples */
/* 2. Generate many bootstrap samples */
proc surveyselect data=sample NOPRINT seed=1
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /* OUTHITS                 option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */

The output data set represents 5000 samples of size 50, but the output data set contains fewer than 250,000 observations. That is because the SURVEYSELECT procedure generates a variable named NumberHits that records the frequency of each observation in each sample. You can use this variable on the FREQ statement of many SAS procedures, including PROC MEANS. If the SAS procedure that you are using does not support a frequency variable, you can use the OUTHITS option on the PROC SURVEYSELECT statement to obtain a data set that contains 250,000 observations.

BY group analysis of bootstrap samples

The following call to PROC MEANS computes 5000 skewness statistics, one for each of the bootstrap samples. The NOPRINT option is used to suppress the results from appearing on your monitor. (You can read about why it is important to suppress ODS during a bootstrap computation.) The 5000 skewness statistics are written to a data set called OutStats for subsequent analysis:

/* 3. Compute the statistic for each bootstrap sample */
proc means data=BootSSFreq noprint;
   by SampleID;
   freq NumberHits;
   var x;
   output out=OutStats skew=Skewness;  /* approx sampling distribution */

Visualize the bootstrap distribution

The bootstrap distribution tells you how the statistic (in this case, the skewness) might vary due to random sampling variation. You can use a histogram to visualize the bootstrap distribution of the skewness statistic:

title "Bootstrap Distribution";
%let Est = 0.366;
proc sgplot data=OutStats;
   label Skewness= ;
   histogram Skewness;
   /* Optional: draw reference line at observed value and draw 95% CI */
   refline &Est / axis=x lineattrs=(color=red) 
                  name="Est" legendlabel="Observed Statistic = &Est";
   refline -0.44737 0.96934  / axis=x lineattrs=(color=blue) 
                  name="CI" legendlabel="95% CI";
   keylegend "Est" "CI";
Bootstrap distribution with 95% bootstrap confidence interval in SAS

In this graph, the REFLINE statement is used to display (in red) the observed value of the statistic for the original data. A second REFLINE statement plots (in blue) an approximate 95% confidence interval for the skewness parameter, which is computed in the next section. The bootstrap confidence interval contains 0, thus you cannot conclude that the skewness parameter is significantly different from 0.

Compute a bootstrap confidence interval in SAS

The standard deviation of the bootstrap distribution is an estimate for the standard error of the statistic. If the sampling distribution is approximately normal, you can use this fact to construct the usual Wald confidence interval about the observed value of the statistic. That is, if T is the observed statistic, then the endpoints of the 95% two-sided confidence interval are T ± 1.96 SE. (Or use the so-called bootstrap t interval by replacing 1.96 with tα/2, n-1.) The following call to PROC MEANS produces the standard error (not shown):

proc means data=OutStats nolabels N StdDev;
   var Skewness;

However, since the bootstrap distribution is an approximate sampling distribution, you don't need to rely on a normality assumption. Instead, you can use percentiles of the bootstrap distribution to estimate a confidence interval. For example, the following call to PROC UNIVARIATE computes a two-side 95% confidence interval by using the lower 2.5th percentile and the upper 97.5th percentile of the bootstrap distribution:

/* 4. Use approx sampling distribution to make statistical inferences */
proc univariate data=OutStats noprint;
   var Skewness;
   output out=Pctl pctlpre =CI95_
          pctlpts =2.5  97.5       /* compute 95% bootstrap confidence interval */
          pctlname=Lower Upper;
proc print data=Pctl noobs; run;
95% bootstrap confidence interval

As mentioned previously, the 95% bootstrap confidence interval contains 0. Although the observed skewness value (0.366) might not seem very close to 0, the bootstrap distribution shows that there is substantial variation in the skewness statistic in small samples.

The bootstrap percentile method is a simple way to obtain a confidence interval for many statistics. There are several more sophisticated methods for computing a bootstrap confidence interval, but this simple method provides an easy way to use the bootstrap to assess the accuracy of a point estimate. For an overivew of bootstrap methods, see Davison and Hinkley (1997) Bootstrap Methods and their Application.

Post a Comment

Use SAS formats to bin numerical variables

SAS formats are flexible, dynamic, and have many uses. For example, you can use formats to count missing values and to change the order of a categorical variable in a table or plot. Did you know that you can also use SAS formats to recode a variable or to bin a numerical variable into categories? This can be very convenient because you do not need to create a new variable in the data set; you merely apply a format to an existing variable.

Income categories: Are you middle class?

Many people use several IF-THEN/ELSE statements in the DATA step (or in a DATA step view) to create a new discrete variable that represents binned values of a continuous variable. That is a fine technique, but an alternative technique is to create a user-defined format that bins a continuous variable. One advantage of a custom format is that you can apply the format to multiple variables in multiple data sets.

For example, suppose that you want to define income categories such as "working class," "middle class," and the ultra-rich "1 percenters." According to a 2012 Money magazine article, the following cut points divide US household income into seven categories:

/* 2012 income categories for US according to Money magazine */
proc format;
value IncomeFmt  
      low   -<  23000 = "Poverty"        /* < 23 thousand         */
      23000 -<  32500 = "Working"        /* [ 23,  32.5) thousand */
      32500 -<  60000 = "Lower Middle"   /* [ 32.5, 60) thousand  */
      60000 -< 100000 = "Middle"         /* [ 60, 100) thousand   */
     100000 -< 150000 = "Upper Middle"   /* [100, 150) thousand   */
     150000 -< 250000 = "5 Percenter"    /* [150, 250) thousand   */
     250000 -   high  = "1 Percenter";   /* > 250 thousand        */

The call to PROC FORMAT creates a custom format called IncomeFmt. When you assign the IncomeFmt format to a numerical variable, SAS will look at the value of each observation and determine the formatted value from the raw value. For example, a value of 18,000 is less than 23,000, so that value is formatted as "Poverty." A value of 85,000 is in the half-open interval [60000, 100000), so that value is formatted as "Middle."

The following DATA step defines the incomes for 26 fictitious individuals. The IncomeFmt format is assigned to the Income variable:

data incomes;
length Name $10.;
input Name Income @@;
format Income IncomeFmt.;     /* assign IncomeFmt format to Income variable */
Amy        65100 Brad      146500 
Carlos    113300 Dimtri     28800 
Eduardo   233300 Felicity   14600 
Guo        43400 Hector    141700 
Irene      53400 Jacob     170300 
Katerina   43100 Liu        66800 
Michael    15800 Nancy      30900 
Oscar      31800 Pablo      65800 
Quentin    40000 Rick       62200 
Stephan    32900 Tracy      64000 
Umberto   124000 Victoria  220700 
Wanda     263800 Xie         9300 
Yolanda    23400 Zachary    93800 

The Income variable is a continuous variable, but the format bins each value into one of seven discrete values. Consequently, SAS procedures can analyze the Income variable as if it were a discrete variable. For example, you can count the number of individuals in each income category by calling PROC FREQ:

proc freq data=incomes; 
   tables Income / nocum norow nocol;
Use SAS format to bin a numeric variable

Assigning or unassigning formats at run time

The underlying data values are not lost. You can use a FORMAT statement in a SAS procedure to temporarily assign or unassign a format. If you remove the format, you can analyze the underlying raw data. For example, the following call to PROC UNIVARIATE analyzes the raw incomes:

proc univariate data=incomes;
   format Income;     /* remove the format for this analysis */
   var Income;

In a similar way, if you specify the Income variable on a CLASS statement in a regression procedures, the formatted values are used for the analysis. However, if you do NOT include it on the CLASS statement, then the variable is treated as a continuous variable and the unformatted values are used.

Subset data by using formatted values

If you run PROC PRINT on the income data, it LOOKS like the Income variable is a character variable. Furthermore, it is analyzed like a character variable when used in some SAS procedures such as PROC FREQ. Consequently, you might forget that the Income variable is actually numeric. However, if you treat Income as a character variable in the DATA set or a WHERE clause, then SAS will report an error. For example, the following WHERE clause is incorrect:

proc print data=incomes; 
where Income in ("5 Percenter", "1 Percenter"); /* WRONG: Income is numeric */
ERROR: WHERE clause operator requires compatible variables.

SAS reports an error because the WHERE clause cannot compare the raw (numeric) values of the Income variable with elements of a set that contains two strings. When you see an error message like this, use PROC CONTENTS to investigate the attributes of the variable:

ods select Position;
proc contents data=incomes order=varnum; run;
The format associated with a SAS variable

The output from PROC CONTENTS informs you that the Income variable is numeric and displays the name of the format that is attached to it.

If you know the cutoff values that are used for the format, you could create a valid WHERE clause that uses numeric values: where Income GE 150000. However, usually it makes more sense to create a valid WHERE clause by using the PUT statement to apply the format to the raw data and compare formatted values:

/* use formatted values to subset data */
proc print data=incomes; 
where put(Income, IncomeFmt.) in ("5 Percenter", "1 Percenter");
Use a format to subset data in SAS

You can use other DATA step functions when constructing WHERE clauses. A typical example is when a variable is a SAS date. For example, the Sashelp.Air data set contains a variable named Date. You can use the following WHERE clause to analyze the subset of data that corresponds to the month of January in years prior to 1955:

proc print data=Sashelp.Air;
where month(date)=1 and year(date)<1955;  /* all January dates prior to 1955 */


As shown in this article, SAS formats are very useful and flexible:

  • You can use a custom format to bin a continuous variable into categories.
  • Within a SAS procedure, you can temporarily assign or unassign a format to change the way that the data are analyzed.
  • The WHERE clause looks at raw data values, so use the PUT function in a WHERE clause if you want to subset the data according to the formatted values.
Post a Comment

Video: Writing packages: A new way to distribute and use SAS/IML programs

My presentation at SAS Global Forum 2016 was titled "Writing Packages: A New Way to Distribute and Use SAS/IML Programs." The paper was published in the conference proceedings several months ago, but I recently recorded a short video that gives an overview of using and creating packages in SAS/IML 14.1:

If your browser does not support embedded video, you can go directly to the video on YouTube.

I'm excited about the opportunities that packages provide. Every year hundreds of scholarly papers are published that use SAS/IML, and I hope that future authors will adopt packages as a standard way to share their work with others.

If you prefer to read about packages, I've written two short introductory blog posts:

For complete details about how to create a package, see the "Packages" chapter of the SAS/IML User's Guide.

Post a Comment

Compute highest density regions in SAS

Bivariate kernel density estimate

In a scatter plot, the regions where observations are packed tightly are areas of high density. A contour plot or heat map of a bivariate kernel density estimate (KDE) is one way to visualize regions of high density.

A SAS customer asked whether it is possible to use SAS to approximate regions that enclose the highest density for a KDE. He also wanted the areas of these regions. I don't know the customer's application, but I can imagine this computation might be useful for ecology. For example, if you count the locations of some endangered species of plant, you might want to estimate a region that contains the highest density and compute the area of that region.

There have been several papers about computing highest density regions (HDRs) for data. Rob Hyndman (1996, TAS), argues that HDRs "are often the most appropriate subset to use to summarize a probability distribution." Hyndman creates "HDR boxplots," which use contours of a KDE to summarize a distribution. A 2011 thesis by A. Fadallah presents the one-dimensional algorithm (p. 6) for finding intervals that contain the highest density.

Hyndman provides a formal definition of an HDR (p. 120) in terms of random variables and probability. For this article, an intuitive definition is sufficient: Given a probability 0 < p < 1 and a density function f, the p100% HDR is the interior of the level set of f that contains probability p.

Recall the intimate connection between densities and probability. For any region, the probability is the integral of the density function over that region. Thus to compute HDRs we need to consider integrals inside a region defined by a level set of the density.

Density contours and regions of high density. #Statistics #SASTip Click To Tweet

Bivariate kernel density estimates

Recall that the KDE procedure can estimate the density of bivariate data. For example, the following statements estimate the bivariate density of the SepalLength and SepalWidth variables in the Sashelp.Iris data set, which contain the dimensions of sepals for 150 iris flowers:

/* define rectangular area on which to estimate density */
%let xmin = 35;   %let xmax = 85;    /* horizontal range */
%let ymin = 15;   %let ymax = 45;    /* vertical range */
title "Bivariate Kernel Density Estimate";
proc kde data=sashelp.iris;
bivar SepalLength(gridL=&xmin gridU=&xmax)    /* optional: bwm=0.8 ngrid=100 */
      SepalWidth(gridL=&ymin gridU=&ymax) /   /* optional: bwm=0.8 ngrid=100 */
      out=KDEOut plots=ContourScatter;
ods output Controls=Controls;

The KDE is shown at the top of this article. The regions of high density are shown in red. The highest contours of the density function are the boundaries of the roughly elliptical red regions. Contours of lower density values are peanut-shaped (the whitish color), and the lowest density contours are amoeba-shaped (light blue in color).

These colored regions are level sets of a density function, and they enclose the highest density regions. The rest of this article shows how to compute quantities that are associated with the region enclosed by a density contour.

An algorithm for approximating highest density regions

The KDE procedure computes the density estimate on a regular grid of points. (By default, the grid is 60 x 60.) The area of each grid cell is some value, A. If h is the height of the density function at, say, the bottom right corner of the cell, then you can estimate the probability in the cell as the volume of the rectangular prism with volume V = A h. For the union of many cells, you can approximate the probability (the volume under the density surface) by the Riemann sum of rectangular prisms. Consequently, given a probability p, the following algorithm provides a rough "pixelated" approximation to the density contour that contains probability p:

  1. Sort the grid points in decreasing order by the value of the density function. Let h1h2 ≥ ... ≥ hm be the sorted density values, where m is the number of grid points.
  2. Start with h1, the highest value of the density. Let V1 = A h1 be the approximate volume under the density surface for this cell.
  3. If V1 < p, compute the volume over the cell with the next highest density value. Define V2 = V1 + A h2.
  4. Repeat this accumulation process until you find the first k that Vkp. The union of the first k cells is an approximation to the p100% HDR.

It is worth noting that you can also use this algorithm to visualize the region inside level sets: simply approximate the region for which the density exceeds some specified value. For this second scenario, there is no need to keep track of the probability.

Compute HDRs in SAS

Let's implement this algorithm in SAS. The previous call to the KDE procedure evaluates the density estimate at a 60 x 60 grid of points. By default, the horizontal points are evenly spaced in the interval [min(x), max(x)], but you can (and should) use the GRIDL= and GRIDU= options to expand the lower and upper bounds, as shown previously. Similarly, you can expand the grid range in the vertical direction. You should choose the upper and lower bounds so that the estimated density is essentially zero on the border of the specified region.

The number of points in each direction and the range of each variable are shown in the "Controls" table:


Notice that the call to PROC KDE used the OUT= option to write the 3600 density values to a SAS data set called KDEOut. The ODS OUTPUT statement was used to write the "Control" table to a SAS data set.

You can use SAS or a hand computation to calculate the area of each cell in the regular grid. I will skip the computation and merely store the result in a macro variable:

%let CellArea = 0.43091;      /* area of each cell in the grid */

With that definition, the following SAS statement accumulate the probability and area for the estimated density. To help with the visualization, a fake observation that has density=0 is added to the output of PROC KDE. This ensures that gradient color ramps always use zero as the base of the color scale:

proc sort data=KDEOut;
   by descending density;
/* ensure that 0 is the lower limit of the density scale */
data FakeObs; value1=.;  value2=.; density=0; run;
/* accumulate probability and area */
data KDEOut;
   set FakeObs KDEOut;         /* add fake observation */
   Area + &CellArea;           /* accumulate area */
   prob + &CellArea * Density; /* accumulate probability */

You can use a WHERE clause to visualize the region that contains the highest density and for which the region contains a specified probability mass. For example, the following call to SGPLOT uses a scatter plot to approximate the 50% HDR. The scatter plot colors markers by the value of the density variable.

title "Approximate 50% HDR";
proc sgplot data=KDEOut aspect=1;
where prob <= 0.5;
   label value1="Sepal Length (mm)"  value2="Sepal Width (mm)";
   styleattrs wallcolor=CXF0F0F0;             /* light gray background */
   scatter x=Value1 y=Value2 / markerattrs=(symbol=SquareFilled size=7)
           colorresponse=Density colormodel=ThreeColorRamp;
   xaxis min=&xmin max=&xmax grid;
   yaxis min=&ymin max=&ymax grid;
Approximate 50% HDR based on kernel density estimate

The 50% HDR is a peanut-shaped region that contains two regions of highest density. In this graph I used the ThreeColorRamp as the gradient color ramp because that is the ramp used by PROC KDE. However, the whitish region does not show up very well, so you might prefer the ThreeColorAltRamp, which uses black instead of white.

In a similar way you can compute an 80% or 95% HDR. You can download the SAS program that computes the results in this article. The program contains statements that compute a 95% HDR.

Animating the highest density regions

You can compute an HDR for any specified probability. The following image shows an animation that was created by using PROC SGPLOT. The animation visualizes the HDRS, starting at a low probability level and ending at a high probability level.

Animation of highest density  regions

Compute the area of an HDR

The previous DATA step also computed the area of the HDR. The following PROC MEANS statement computes the area of the 50% HRD for the sepal widths and lengths. The region covers about 169 square units. If you want to compute the area for a 90% or 95% HDR, just change the value in the WHERE clause. This computation is valid regardless of the number of components in the region.

proc means data=KDEOut max;
where prob <= 0.5;         /* 0.5 is for the 50% HDR */
   var Area;

Closing comments

This article started with data, generated a KDE, and then found HDRs as the interior of level sets of the KDE. However, the level set computations never look at the data, so the computations are valid for any density function. The approximation is rough because it only uses the density values on a uniform grid of points.

For fun, the attached program also uses this method to compute HDRs for bivariate normal data. I wanted to see how the HDRs compare with the elliptical prediction regions. For the examples that I looked at the HDRS are slightly bigger, with a relative error in the 10-15% range.

I do not have any practical experience using HDRs, but I wanted to share this computation in case someone finds it useful. I had never heard of HDRs before last week, so I'd be interested in hearing about how they might be used. I welcome any comments about this approximation.

Post a Comment

Female world leaders by year of election

This week Hillary Clinton became the first woman to be nominated for president of the US by a major political party. Although this is a first for the US, many other countries have already passed this milestone. In fact, 60 countries have already elected women as presidents and prime ministers. For example, the UK was previously led by Margaret Thatcher (1979–1990) and is currently led by Theresa May.

The CNN web site published a list of the 60 countries that have been led by a woman president or prime minister. The list is long and requires that you scroll up and down to compare different countries. To help visualize the data, I decided to turn that list into a graphic. You can download the SAS program that contains the data and creates a graph of female world leaders.

I wanted the graph to show the first year that each country elected a female leader. Consequently, countries like Norway and the UK only appear once on the list even though they have been led by women multiple times.

Female world leaders, 1960-2016

I thought about several different graph types, but finally settled on the scatter plot to the left (click to enlarge). The graph shows the cumulative number of countries that have elected a female president or prime minister.

The vertical axis shows the rank of each country on the list, which makes it easy to find which country was first (Sri Lanka in 1960), second (India), third (Israel), and so forth. The tenth country was Norway. The fiftieth country was Denmark.

You can also easily find out which countries elected their first female leader in a particular year, since that information is plotted on the horizontal axis. However, the graph is not so good at finding a particular country unless you know the approximate election year. Can you find Peru?

If you create the HTML version of this graph in SAS, the graph contains tool tips so that you can hover the mouse over a country's name and find out the name of the woman who was elected. The graph on this page shows that Corazon Aquino was the first female president of the Philippines.

How many countries have elected a female leader? Who was first? #DataViz Click To Tweet

The graph shows that the 1990s was a breakout decade in which many countries elected their first female leader. This trend is better shown by using a bar graph that augments the previous information. The bar graph shows that about 15 countries were added to the list during each of the last three decades.

Election of female leaders by decade

Although countries did not elect female leaders until the 1960s, female rulers are not new to the world. Monarchs such as Nefertiti, Queen Elizabeth I, and Catherine the Great are some famous world leaders from previous centuries. Some women were so influential that they have epochs named after them, such as the Elizabethan era and the Victorian era.

There are still three years left in the 2010s, so the current decade is likely to be the decade in which the most countries elected a woman for the first time. Which country will be number 61 on the list? Time will tell.

Post a Comment

How to visualize a kernel density estimate

A kernel density estimate (KDE) is a nonparametric estimate for the density of a data sample. A KDE can help an analyst determine how to model the data: Does the KDE look like a normal curve? Like a mixture of normals? Is there evidence of outliers in the data?

In SAS software, there are two procedures that generate kernel density estimates. PROC UNIVARIATE can create a KDE for univariate data; PROC KDE can create KDEs for univariate and bivariate data and supports several options to choose the kernel bandwidth.

Kernel density estimate (KDE) and component densities

The KDE is a finite mixture distribution. It is a weighted sum of small density "bumps" that are centered at each data point. The shape of the bumps are determined by the choice of a kernel function. The width of the bumps are determined by the bandwidth.

In textbooks and lecture notes about kernel density estimation, you often see a graph similar to the one at the left. The graph shows the kernel density estimate (in blue) for a sample of 10 data values. The data values are shown in the fringe plot along the X axis. The orange curves are the component densities. Each orange curve is a scaled version of a normal density curve centered at a datum.

This article shows how to create this graph in SAS. You can use the same principles to draw the component densities for other finite mixture models.

The kernel bandwidth

The first step is to decide on a bandwidth for the component densities. The following statements define the data and use PROC UNIVARIATE to compute the bandwidth by using the Sheather-Jones plug-in method:

data sample;
input x @@;
46 60 24 15 17 14 21 59 22 16 
proc univariate data=sample;
   histogram x / kernel(c=SJPI);
Histogram and kernel density estimate (KDE)

The procedure create a histogram with a KDE overlay. The legend of the graph gives a standardized kernel bandwidth of c=0.27, but that is not the bandwidth you want. As explained in the documentation, the kernel bandwidth is derived from the normalized bandwidth by the formula λ = c IQR n-1/5, where IQR is the interquartile range and n is the number of nonmissing observations. For this data, IQR = 30 and n=10, so λ ≈ 5. To save you from having to compute these values, the SAS log displays the following NOTE:

NOTE: The normal kernel estimate for c=0.2661 has a bandwidth of 5.0377

The UNIVARIATE procedure can use several different kernel shapes. By default, the procedure uses a normal kernel. The rest of this article assumes a normal kernel function, although generalizing to other kernel shapes is straightforward.

Compute the component densities

Because the kernel function is centered at each datum, one way to visualize the component densities is to evaluate the kernel function on a symmetric interval about x=0 and then translate that component for every data point. In the following SAS/IML program, the vector w contains evenly spaced points in the interval [-3λ, 3λ], where λ is the bandwidth of the kernel function. (This interval contains 99.7% of the area under the normal curve with standard deviation λ.) The vector k is the normal density function evaluated on that interval, scaled by 1/n where n is the number of nonmissing observations. The quantity 1/n is the mixing probability for each component; the KDE is obtained by summing these components.

The program creates an output data set called component that contains three numerical variables. The ID variable is an ID variable that identifies which observation is being used. The z variable is a translated copy of the w variable. The k variable does not change because every component has the same shape and bandwidth.

proc iml;
use sample;  read all var {x};  close;
n = countn(x);
mixProb = 1/n;
lambda = 5;                                 /* bandwidth from PROC UNIVARIATE */
w = do(-3*lambda, 3*lambda, 6*lambda/100);  /* evenly spaced; 3 std dev */
k = mixProb * pdf("Normal", w, 0, lambda);  /* kernel = normal pdf centered at 0 */
ID= .; z = .;
create component var {ID z k};             /* create the variables */
do i = 1 to n;
   ID = j(ncol(w), 1, i);                  /* ID var */
   z = x[i] + w;                           /* center kernel at x[i] */
close component;

Compute the kernel density estimate

The next step is to sum the component densities to create the KDE. The easy way to get the KDE in a data set is to use the OUTKERNEL= option on the HISTOGRAM statement in PROC UNIVARIATE. Alternatively, you can create the full KDE in SAS/IML, as shown below.

The range of the data is [14, 60]. You can extend that range by the half-width of the kernel components, which is 15. Consequently the following statements use the interval [0, 75] as a convenient interval on which to sum the density components. The actual summation is easy in SAS/IML because you can pass a vector of positions to the PDF function i Base SAS.

The sum of the kernel components is written to a data set called KDE.

/* finite mixture density is weighted sum of kernels at x[i] */
a = 0; b = 75;    /* endpoints of interval [a,b] */
t = do(a, b, (b-a)/200);
kde = 0*t;
do i = 1 to n;
   kde = kde + pdf("Normal", t, x[i], lambda) * mixProb;
create KDE var {t kde}; append; close;

Visualize the KDE

You can merge the original data, the individual components, and the KDE curve into a single SAS data set called All. Use the SGPLOT procedures to overlay the three elements. The individual components are plotted by using the SERIES plot with a GROUP= option. A second SERIES plot graphs the KDE curve. A FRINGE statement plots the positions of each datum as a hash mark. The plot is shown at the top of this article.

data All;
merge sample component KDE;
title "Kernel Density Estimate as Weighted Sum of Component Densities";
proc sgplot data=All noautolegend;
   series x=z y=k / group=ID lineattrs=GraphFit2(thickness=1); /* components */
   series x=t y=kde /lineattrs=GraphFit;                       /* KDE curve  */
   fringe x;                                      /* individual observations */
   refline 0 / axis=y;
   xaxis label="x";
   yaxis label="Density" grid;

You can use the same technique to visualize other finite mixture models. However, the FMM procedure creates these plots automatically, so you might never need to create such a plot if you use PROC FMM. The main difference for a general finite mixture model is that the component distributions can be from different families and usually have different parameters. Therefore you will need to maintain a vector of families and parameters. Also, the mixing probabilities usually vary between components.

In summary, the techniques in this article are useful to teachers and presenters who want to visually demonstrate how choosing a kernel shape and bandwidth gives rise to the kernel density estimate.

Post a Comment

Statistical model building and the SELECT procedures in SAS

Last week I read an interesting paper by Bob Rodriguez: "Statistical Model Building for Large, Complex Data: Five New Directions in SAS/STAT Software." In it, Rodriguez summarizes five modern techniques for building predictive models and highlights recent SAS/STAT procedures that implement those techniques. The paper discusses the following high-performance (HP) procedures in SAS/STAT software:

  • The GLMSELECT procedure builds parsimonious linear regression models. This procedure supports modern effect selection techniques such as LASSO, LAR, and the elastic net. (Technically, the GLMSELECT procedure pre-dates the HP procedures. However, it is multithreaded. The HPREG procedure provides similar functionality in a distributed environment.)
  • The HPGENSELECT procedure builds generalized linear models such as Poisson models, zero-inflated models, Tweedie models, and more. The procedure supports the selection of effects.
  • The QUANTSELECT procedure builds quantile regression models and features effect selection.
  • The GAMPL procedure fits generalized additive models. I have previously shown the power of the GAMPL procedure to fit binary response data.
  • The HPSPLIT procedure builds classification and regression trees. Tree-based procedures are a popular choice when you want a nonparametric model that is interpretable in terms of business rules.

If you are unfamiliar with these newer procedures, I encourage you to read the Rodriguez paper. Although they are designed for distributed computations, you can also run these procedures in single-machine mode.

Five procedures for building statistical models in #SAS for large complex data Click To Tweet

Thoughts on the "SELECT" procedures

The GLMSELECT, HPGENSELECT, and (HP)QUANTSELECT procedures support choosing a small number of effects from among a large list of possible effects. Some statisticians call these "variable selection" procedures, but "effect selection" is a more appropriate term because an important use case is to use the procedures to select important interaction effects from the list of all second order interactions. Regardless of what you call them, the procedures automatically select a small number of effects that provide a good predictive model from among hundreds or thousands of possible effects. You can either accept the selected model or use traditional modeling techniques to refine the model from among the selected candidate effects.

While thinking about the use of the word "SELECT" in the names of these procedures, it occurred to me that there is another SAS/STAT procedure that contains the word SELECT, and that is the SURVEYSELECT procedure. However, the SURVEYSELECT procedure is different in that it randomly selects observations (rows in the design matrix) whereas the previous procedures select variables (columns in the design matrix).

This is not the only example of this observation-variable dichotomy in SAS/STAT. The cluster procedures all have "CLUS" as part of their names. The ACECLUS, CLUSTER, FASTCLUS, and MODECLUS procedures all attempt to group observations into clusters. However, the VARCLUS procedure is a dimension reduction technique that groups variables together.

Post a Comment

Do you write unnecessary SAS statements?

I'm addicted to you.
You're a hard habit to break.
Such a hard habit to break.
—  Chicago, "Hard Habit To Break"

Habits are hard to break. For more than 20 years I've been putting semicolons at the end of programming statements in SAS, C/C++, and Java/Javascript. But lately I've been working in a computer language that does not require semicolons. Nevertheless, my fingers have a mind of their own, and I catch myself typing unnecessary semicolons out of habit.

I started thinking about superfluous statements in the SAS language. Some programmers might argue that if the program still runs correctly, then unnecessary statements are inconsequential. However, as a general rule I think it is a good programming practice to avoid writing unnecessary statements.

Here are a few example of unnecessary SAS statement. Can you think of more?

A RUN statement after a DATALINES statement

The doc for the DATALINES statement in the SAS DATA step states: "The DATALINES statement is the last statement in the DATA step. Use a null statement (a single semicolon) to indicate the end of the input data." In other words, you do not need a RUN statement after the semicolon to run the DATA step. The following example runs correctly and creates a data set:

data A;
input x @@;
1 2 3 4 5 6
;                              /* <== no RUN statement required */

How many times have you seen a RUN statement after a DATALINES statement? Countless! I've even seen examples in the SAS documentation that use this unnecessary statement.

A semicolon after a macro call

If you define a macro that contains a complete set of valid SAS statements, you do not need another semicolon when you call the macro. For example, the following example is valid:

%macro TOP(dsname);
   proc print data=&dsname(obs=5); run;
%TOP(sashelp.class)            /* <== no semicolon required */

It's not a big deal if you type the semicolon because a semicolon is the null statement. It has no performance implications. But for some reason it bothers me when I catch myself doing it.

A RUN statement in a fully interactive procedure

In a fully interactive procedure, statements are executed immediately. The RUN statement has no effect. Examples include PROC IML, PROC SQL, and PROC OPTMODEL. You use the QUIT statement to exit these procedures, which means that the RUN statement is never needed. The following program is correct and runs three statements. In interactive mode, each statement gets run when the SAS parser reaches the semicolon that ends the statement.

proc sql;
create table Example (x num, y num);          /* statement 1 */
insert into Example
   values(1, 2)  values(3, 4)  values(5, 6);  /* statement 2 */
select *
   from Example;                              /* statement 3 */
/* no RUN statement required! */

A RUN statement in (some) procedures that support RUN-group processing

Some SAS procedures are partly interactive. Procedures such as PROC DATASETS, PROC REG, and PROC GLM support RUN-group processing. For these procedures, the RUN statement defines blocks of statements that get executed, but the procedure remains running until it encounters a QUIT statement.

Many SAS/STAT procedures interpret QUIT to mean "run the most recent statements and then quit." For these procedures, you do not need a RUN statement before you call QUIT. The following statements run a regression analysis and then quit the procedure:

proc glm data=sashelp.class;
model weight = height;
quit;    /* <== No RUN statement; Runs previous statements, then quits */

Unfortunately, SAS procedures are not completely consistent in implementing the QUIT statement. In some SAS procedures the QUIT statement means "ignore the most recent statements and quit." The canonical examples are the traditional SAS/GRAPH procedures such as PROC GPLOT. In the following program, the first PLOT statement creates a scatter plot because it is followed by a RUN statement. However, the second plot statement is not followed by a RUN statement, so it is ignored and the second plot is not produced.

proc gplot data=sashelp.class;
   plot weight*height;
run;     /* <== executes previous PLOT statement; does not quit */
   plot weight*age;
quit;    /* <== ignores previous PLOT statement, then quits */
/* use RUN; QUIT; to produce the second plot */

If you aren't sure how a procedure behaves with regards to RUN-group processing, it is always safe to use the RUN and QUIT statements in tandem.

When to include optional statements?

The previous sections describe unnecessary statements that I like to skip. However, sometimes I include optional statements in my programs for clarity, readability, or to practice defensive programming.

SAS supports many optional statements. When you omit an optional statement, the procedure does some default behavior. For example, if you omit the VAR statement, most procedures runs on all numerical variables (for example, PROC MEANS) or on all variables (for example, PROC PRINT). When I want the default behavior, I skip the VAR statement.

Another statement that is technically optional is the RUN statement for a sequence of procedures. Because the next call to a procedure or DATA step will always end the previous procedure, you can technically omit the RUN statement for all but the last procedure. This means that the following program is valid, although I do not recommend this style of programming:

data class;
   set sashelp.class(where=(sex='M'));  /* 'class' is the _LAST_ data set */
proc means;                             /* DATA= _LAST_ */
proc print;                             /* DATA= _LAST_ */

If I'm feeling lazy, I might write these statement during the early exploratory phase of a data analysis. However for serious work I terminate every procedure by using a RUN or QUIT statement. Skipping a RUN statement can lead to undesirable interactions with global statements such as the TITLE statement and ODS statements.

Your thoughts?

There is much more that can be said about these topics. What are your thoughts?

  • Are there unnecessary statements that you write out of habit?
  • Are there optional statements that you always include because it makes the program clearer?
Post a Comment

Color markers in a scatter plot by a third variable in SAS

One of my favorite new features in PROC SGPLOT in SAS 9.4m2 is addition of the COLORRESPONSE= and COLORMODEL= options to the SCATTER statement. By using these options, it is easy to color markers in a scatter plot so that the colors indicate the values of a continuous third variable.

If you are running an earlier release of SAS, you can use the Graph Template Language (GTL) to create a scatter plot where the markers are colored by a continuous response variable.

If you have a discrete variable (like gender), you can use the GROUP= option to color markers.

Color markers in a scatter plot to indicate values of a continuous response. #DataViz #SASTip Click To Tweet

Color markers by a continuous response variable

You can use the COLORRESPONSE= option to visualize the values of a third variable by using colored markers in a scatter plot. For example, the following statements create a scatter plot of weight versus height for 19 students. Each marker is assigned a color that reflects the age of the student.

title "Markers Colored by Age";
proc sgplot data=sashelp.class;
scatter x=height y=weight / colorresponse=age
        markerattrs=(symbol=CircleFilled size=14)  /* big filled markers */
        datalabel=Name;                            /* add labels to markers */
Markers colored by third variable

Click on the image for a larger version. By default, the plot uses a three-color gradient ramp. The smallest value of age (Joyce, age 11) is colored blue. The largest color (Phillip, age 16) is colored red. Markers that correspond to ages near the midrange (in this case, 13.5) are colored black. The gradient color ramp is shown on the right side of the plot. The plot shows that shorter and lighter students tend to be younger than taller and heavier students.

Use a different color ramp

The default color ramp is called ThreeColorAltRamp and is shown above. (The "Alt" version is suitable for use on a light-colored background.) You can change the color ramp by using the COLORMODEL= option in the SCATTER statement. The ODS graphics system provides several other two- and three-color color ramps. One is called the TwoColorRamp color ramp, which uses white as a color for low values and blue as a color for high values. Because the background of the graph is white, some markers will be hard to see unless you take additional actions. The following statements change the background color to a light gray and put a black outline around each marker so that an almost-white marker is still visible against the background:

title2 "Colormodel = TwoColorRamp";
proc sgplot data=sashelp.class;
styleattrs wallcolor=CXF0F0F0;                     /* light gray background */
scatter x=height y=weight / colorresponse=age 
        markerattrs=(symbol=CircleFilled size=14)  /* big filled markers */
        filledoutlinedmarkers                      /* add outlines */
        colormodel=TwoColorRamp;                   /* white-blue ramp */
Color markers by two-color gradient ramp

Define a custom color ramp

You can also use the COLORMODEL= option to create a customer color ramp. There are multiple ways to specify colors in SAS, including English words and hexadecimal integers that begin with the prefix 'CX.' For example, you could create a four-color ramp by using the following COLORMODEL= option:

   colormodel=(blue green orange red);       /* custom 4-color ramp */

Be aware that not all color ramps are created equal. If your color ramp contains colors that vary a lot in brightness and intensity, the viewer's eye will be drawn towards certain markers and aware from others. This can bias the way that the viewer perceives the data.

I wrote a previous article that discussed making wise choices when you create a custom color ramp. In the following statements, I use colors from a six-color "spectral" color ramp in which all colors are perceptually equivalent. (I use the PALETTE function in SAS/IML to generate perceptually balanced color ramps.)

title2 "Custom Colormodel";
proc sgplot data=sashelp.class;
scatter x=height y=weight / colorresponse=age
   markerattrs=(symbol=CircleFilled size=14)
   colormodel=(CX3288BD CX99D594 CXE6F598 CXFEE08B CXFC8D59 CXD53E4F);
xaxis grid; yaxis grid;
Markers colored by custom gradient color ramp

Change the location of the gradient legend

By default the gradient legend appears on the right side of the plot. You can change the location and a few other attributes of the gradient legend by using the GRADIENTLEGEND statement in SAS 9.4m2. For example, if you want the gradient legend to appear at the bottom of the plot and want to modify the title for the legend, you add the following SGPLOT statement:

gradlegend / position=bottom title="Age (yrs)";

In SAS 9.4m3, the COLORRESPONSE= and COLORMODEL= options are not only available in the SCATTER statement, but also are available in other statements, including the DOT, HBAR, HIGHLOW, SERIES, VBAR, VECTOR, and WATERFALL statements.

Post a Comment

Absorbing Markov chains in SAS

Last week I showed how to represent a Markov transition matrix in the SAS/IML matrix language. I also showed how to use matrix multiplication to iterate a state vector, thereby producing a discrete-time forecast of the state of the Markov chain system. This article shows that the expected behavior of a Markov chain can often be determined just by performing linear algebraic operations on the transition matrix.

Absorbing Markov chains in #SAS Click To Tweet

Absorbing Markov chains

Whereas the system in my previous article had four states, this article uses an example that has five states. The ith row of the following transition matrix gives the probabilities of transitioning from State i to any other state:

proc iml;
/* i_th row contains transition probabilities from State i */
P = { 0    0.5  0    0.5  0,
      0.5  0    0.5  0    0,
      0    0.5  0    0    0.5,
      0    0    0    1    0,
      0    0    0    0    1 };

For example, the last row of the matrix indicates that if the system is in State 5, the probability is 1 that it stays in State 5. This is the definition of an absorbing state. An absorbing state is common for many Markov chains in the life sciences. For example, if you are modeling how a population of cancer patients might respond to a treatment, possible states include remission, progression, or death. Death is an absorbing state because dead patients have probability 1 that they remain dead. The non-absorbing states are called transient. The current example has three transient states (1, 2, and 3) and two absorbing states (4 and 5).

If a Markov chain has an absorbing state and every initial state has a nonzero probability of transitioning to an absorbing state, then the chain is called an absorbing Markov chain. The Markov chain determined by the P matrix is absorbing. For an absorbing Markov chain, you can discover many statistical properties of the system just by using linear algebra. The formulas and examples used in this article are taken from the online textbook by Grimstead and Snell.

The first step for analyzing an absorbing chain is to permute the rows and columns of the transition matrix so that all of the transient states are listed first and the absorbing states are listed last. (The P matrix is already in this form.) If there are k absorbing states, the transition matrix in block form looks like the following:

Block form of an absorbing Markov transition matrix

The bottom right block of the transition matrix is a k x k identity matrix and represents the k absorbing states. The top left block contains the probabilities of transitioning between transient states. The upper right block contains the probabilities of transitioning from a transient state to an absorbing state. The lower left block contains zeros because there is no chance of transitioning from an absorbing state to any other state.

The following SAS/IML statements show how to extract the Q and R matrices from the P matrix:

k = sum(vecdiag(P)=1);      /* number of absorbing states */
nt = ncol(P) - k;           /* number of transient states */
Q = P[1:nt, 1:nt];          /* prob of transitions between transient states */ 
R = P[1:nt, nt+1:ncol(P)];  /* prob of transitions to absorbing state */

Expected behavior of absorbing Markov chains

By definition, all initial states for an absorbing system will eventually end up in one of the absorbing states. The following questions are of interest. If the system starts in the transient state i, then:

  1. What is the expected number of steps the system spends in transient state j?
  2. What is the expected number of steps before entering an absorbing state?
  3. What is the probability that the system will be absorbed into the jth absorbing state?

The answers to these questions are obtained by defining the so called fundamental matrix, which is N = (I-Q)-1. The fundamental matrix answers the first question because the entries of N are expected number of steps that the system spends in transient state j if it starts in transient state i:

transState = char(1:nt);        /* labels of transient states */
absState = char(nt+1:ncol(P));  /* labels of absorbing states */
/* Fundamental matrix gives the expected time that the system is 
   in transient state j if it starts in transient state i */
N = inv(I(nt) - Q);  
print N[L="Expected Time in State j" r=transState c=transState];
Expected time in State j for an absorbing Markov chain

The first row indicates that if the system starts in State 1, then on average it will spend 1.5 units of time in State 1 (including the initial time period), 1 unit of time in State 2, and 0.5 units of time in State 3 before transitioning to an absorbing state. Similarly, if the system starts in State 2, you can expect 1 unit, 2 units, and 1 unit of time in States 1, 2, and 3, respectively, before transitioning to an absorbing state.

Obviously, if you sum up the values for each row, you get the expect number of steps until the system transitions to an absorbing state:

/* Expected time before entering an absorbing state if the system
   starts in the transient state i */
t = N[,+];
print t[L="Expected Time until Absorption" r=transState];
Expected ime until absorption for an absorbing Markov chain

The remaining question is, to me, the most interesting. If the system starts in a transient state, you know that it will eventually transition into one of the k absorbing states. But which one? What is the probability that the system ends in the jth absorbing state if it starts in the ith transient state? The answer is obtained by the matrix multiplication N*R because the matrix N is the expected number of steps in each transient state and R contains the probability of transitioning from a transient state to an absorbing state. For our example, the computation is as follows:

/* The probability that an absorbing chain will be absorbed
   into j_th absorbing state if it starts in i_th transient state */
B = N*R;
print B[L="Absorption Probabilities" r=transState c=absState];
Absorption probabilities for an absorbing Markov chain with two absorbing states

The first row of the matrix indicates that if the system starts in State 1, it will end up in State 4 three quarters of the time and in State 5 one quarter of the time. The second rows indicates that if the system starts in State 2, there is a 50-50 chance that it will end up in State 4 or State 5.

Because this Markov chain is a stochastic process, you cannot say with certainty whether the system will eventually be absorbed into State 4 or State 5. However, starting the system in State 1 means that there is a higher probability that the final state will be State 4. Similarly, starting in State 3 means a higher probability that the final state will be in State 5.

There are many other statistics that you can compute for absorbing Markov chains. Refer to the references for additional computations.


Post a Comment