Counting observations in two-dimensional bins

Last Monday I discussed how to choose the bin width and location for a histogram in SAS. The height of each histogram bar shows the number of observations in each bin. Although my recent article didn't mention it, you can also use the IML procedure to count the number of observations in each bin. The BIN function associates a bin to each observation, and the TABULATE subroutine computes the count for each bin. I have previously blogged about how to use this technique to count the observations in bins of equal or unequal width.

bin2d2

You can also count the number of observations in two-dimensional bins. Specifically, you can divide the X direction into kx bins, divide the Y direction into ky bins, and count the number of observations in each of the kx x ky rectangles. In a previous article, I described how to bin 2-D data by using the SAS/IML language and produced the scatter plot at the left. The graph displays the birth weight of 50,000 babies versus the relative weight gain of their mothers for data in the Sashelp.bweight data set.

I recently remembered that you can also count observations in 2-D bins by using the KDE procedure. The BIVAR statement in the KDE procedure supports an OUT= option, which writes a SAS data set that contains the counts in each bin. You can specify the number of bins in each direction by using the NGRID= option on the BIVAR statement, as shown by the following statements:

ods select none;
proc kde data=sashelp.bweight;
   bivar MomWtGain(ngrid=11) Weight(ngrid=7) / out=KDEOut;
run;
ods select all;
 
proc print data=KDEOut(obs=10 drop=density);
run;
hist2bin1

As shown in the output, the the VALUE1 and VALUE2 variables contain the coordinates of the center of each bin. The COUNT variable contains the bin count. You can read the counts into SAS/IML and print it out as a matrix. In the data set, the Y variable changes the fastest as you go down the rows. Consequently, you need to use the SHAPECOL function rather than the SHAPE function to form the 7 x 11 matrix of counts:

proc iml;
kX = 11; kY = 7;
use KDEOut; read all var {Value1 Value2 Count}; close;
M = shapecol(Count, kY, kX);       /* reshape into kY x kX matrix */
/* create labels for cells in X and Y directions */
idx = do(1,nrow(Value1),kY);        /* location of X labels */
labelX = putn(Value1[idx], "5.1");
labelY = putn(Value2[1:kY], "5.0"); 
print M[c=labelX r=labelY];
hist2bin2

As I explained in Monday's blog post, there are two standard ways to choose bins for a histogram: the "midpoints" method and the "endpoints" method. If you compare the bin counts from PROC KDE to the bin counts from my SAS/IML program, you will see small differences. That is because the KDE procedure uses the "midpoints" algorithm for subdividing the data range, whereas I used the "endpoints" algorithm for my SAS/IML program.

Last week I showed how to use heat maps to visualize matrices in SAS/IML. This matrix of counts is begging to be visualized with a heat map. Because the counts vary across several orders of magnitude (from 100 to more than 104), a linear color ramp will not be an effective way to visualize the raw counts. Instead, transform the counts to a log scale and create a heat map of the log-counts:

call HeatmapCont(log10(1+M)) 
                 xvalues=labelX yvalues=labelY
                 colorramp="ThreeColor" legendtitle="Log(Count)"
                 title="Counts in Each Bin";
hist2bin3

The heat map shows the log-count of each bin. If you prefer a light-to-dark heatmap for the density, use the "TwoColor" value for the COLORRAMP= option. The heat map is a great way to see the distribution of counts at a glance. It enables you to see the approximate values of most cells, and you can easily determine cells where there are many or few observations. Of course, if you want to know the exact value of the count in each rectangular cell, look at the tabular output.

Post a Comment

Choosing bins for histograms in SAS

When you create a histogram with statistical software, the software uses the data (including the sample size) to automatically choose the width and location of the histogram bins. The resulting histogram is an attempt to balance statistical considerations, such as estimating the underlying density, and "human considerations," such as choosing "round numbers" for the location and width of bins. Common "round" bin widths include 1, 2, 2.5, and 5, as well as these numbers multiplied by a power of 10.

The default bin width and locations tend to work well for 95% of the data that I plot, but sometimes I decide to override the default choices. This article describes how to set the width and location of bins in histograms that are created by the UNIVARIATE and SGPLOT procedures in SAS.

Why override the default bin locations?

The most common reason to override the default bin locations is because the data have special properties. For example, sometimes the data are measured in units for which the common "round numbers" are not optimal:

  • For a histogram of time measured in minutes, a bin width of 60 is a better choice than a width of 50. Bin widths of 15 and 30 are also useful.
  • For a histogram of time measured in hours, 6, 12, and 24 are good bin widths.
  • For days, a bin width of 7 is a good choice.
  • For a histogram of age (or other values that are rounded to integers), the bins should align with integers.

You might also want to override the default bin locations when you know that the data come from a bounded distribution. If you are plotting a positive quantity, you might want to force the histogram to use 0 as the leftmost endpoint. If you are plotting percentages, you might want to force the histogram to choose 100 as the rightmost endpoint.

To illustrate these situations, let's manufacture some data with special properties. The following DATA step creates two variables. The T variable represents time measured in minutes. The program generates times that are normally distributed with a mean of 120 minutes, then rounds these times to the nearest five-minute mark. The U variable represents a proportion between 0 and 1; it is uniformly distributed and rounded to two decimal places.

data Hist(drop=i);
label T = "Time (minutes)" U = "Uniform";
call streaminit(1);
do i = 1 to 100;
   T = rand("Normal", 120, 30); /* normal with mean 120  */
   T = round(T, 5);             /* round to nearest five-minute mark */
   U = rand("Uniform");         /* uniform on (0,1) */
   U = floor(100*U) / 100;      /* round down to nearest 0.01 */
   output;
end;
run;

How do we control the location of histogram bins in SAS? Read on!

Custom bins with PROC UNIVARIATE: An example of a time variable

I create histograms with PROC UNIVARIATE when I am interested in also computing descriptive statistics such as means and quantiles, or when I want to fit a parametric distribution to the data. The following statements create the default histogram for the time variable, T:

title "Time Data (N=100)";
ods select histogram(PERSIST);  /* show ONLY the histogram until further notice */
proc univariate data=Hist;
histogram T / odstitle=title odstitle2="Default Bins";
run;
histbin1

The default bin width is 20 minutes, which is not horrible, but not as convenient as 15 or 30 minutes. The first bin is centered at 70 minutes; a better choice would be 60 minutes.

The HISTOGRAM statement in PROC UNIVARIATE supports two options for specifying the locations of bins. The ENDPOINTS= option specifies the endpoints of the bins; the MIDPOINTS= option specifies the midpoints of the bins. The following statements use these options to create two customize histograms for which the bin widths are 30 minutes:

proc univariate data=Hist;
histogram T / midpoints=(60 to 210 by 30)  odstitle=title odstitle2="Midpoints";
run;
 
proc univariate data=Hist;
histogram T / endpoints=(60 to 210 by 30)  odstitle=title odstitle2="Endpoints";
run;
histbin2

The histogram on the left has bins that are centered at 30-minute intervals. This histogram makes it easy to estimate that about 40 observations are approximately 120 minutes. The counts for other half-hour increments are similarly easy to estimate. In contrast, the histogram on the right has bins whose endpoints are 60, 90, 120,... minutes. With this histogram, it easy to see that about 35 observations have times that are between 90 and 120 minutes. Similarly, you can estimate the number of observations that are greater than three hours or less than 90 minutes.

Both histograms are equally correct. The one you choose should depend on the questions that you want to ask about the data. Use midpoints if you want to know how many observations have a value; use endpoints if you want to know how many observations are between two values.

If you run the SAS statements that create the histogram on the right, you will see the warning message
WARNING: The ENDPOINTS= list was extended to accommodate the data.
This message informs you that you specified the last endpoint as 210, but that additional bins were created to display all of the data.

Custom bins for a bounded variable

As mentioned earlier, if you know that values are constrained within some interval, you might want to choose histogram bins that incorporate that knowledge. The U variable has values that are in the interval [0,1), but of course PROC UNIVARIATE does not know that. The following statement create a histogram of the U variable with the default bin locations:

title "Bounded Data (N=100)";
proc univariate data=Hist;
histogram U / odstitle=title odstitle2="Default Bins";
run;
histbin3

The default histogram shows seven bins with a bin width of 0.15. From a statistical point of view, this is an adequate histogram. The histogram indicates that the data are uniformly distributed and, although it is not obvious, the left endpoint of the first bin is at 0. However, from a "human readable" perspective, this histogram can be improved. The following statements use the MIDPOINTS= and ENDPOINTS= options to create histograms that have bin widths of 0.2 units:

proc univariate data=Hist;
histogram U / midpoints=(0 to 1 by 0.2)  odstitle=title odstitle2="Midpoints";
run;
 
proc univariate data=Hist;
histogram U / endpoints=(0 to 1 by 0.2)  odstitle=title odstitle2="Endpoints";
run;
ods select all;  /* turn off PERSITS; restore normal output */
histbin4

The histogram on the left is not optimal for these data. Because we created uniformly distributed data in [0,1], we know that the expected count in the leftmost bin (which is centered at 0) is half the expected count of an inner bin. Similarly, the expected count in the rightmost bin (which is centered at 1) is half the count of an inner bins because no value can exceed 1. Consequently, this choice of midpoints is not very good. For these data, the histogram on the right is better at revealing that the data are uniformly distributed and are within the interval [0,1).

Custom bins with PROC SGPLOT

If you do not need the statistical power of the UNIVARIATE procedure, you might choose to create histograms with PROC SGPLOT. The SGPLOT procedure supports the BINWIDTH= and BINSTART= options on the HISTOGRAM statement. The BINWIDTH= option specifies the width for the bins. The BINSTART= option specifies the center of the first bin.

I recommend that you specify both the BINWIDTH= and BINSTART= options, and that you choose the bin width first. Be aware that not all specifications result a valid histogram. If you make a mistake when specifying the bins, you might get the following error
WARNING: The specified BINWIDTH= value will be ignored in order to accommodate the data.
That message usually means that the minimum value of the data was not contained in a bin. For a bin width of h, the BINSTART= value must be less than xmin + h/2, where xmin is the minimum value of the data.

By default, the axis does not show a tick mark for every bin, but you can force that behavior by using the SHOWBINS option. The following statements call the SGPLOT procedure to create histograms for the time-like variable, T. The results are again similar to the custom histograms that are shown in the previous section:

title "Time Data (N=100)";
title2 "Midpoints";
proc sgplot data=Hist;
histogram T / binwidth=15 binstart=60 showbins; /* center first bin at 60 */
run;
 
title2 "Endpoints";
proc sgplot data=Hist;
histogram T / binwidth=15 binstart=52.5;  /* 52.5 = 45 + h/2 */
xaxis values=(45 to 180 by 15);           /* alternative way to place ticks */
run;

The following statements call the SGPLOT procedure to create histograms for the bounded variable, U. The results are similar to those created by the UNIVARIATE procedure:

title "Bounded Data (N=100)";
title2 "Midpoints";
proc sgplot data=Hist;
histogram U / binwidth=0.2 binstart=0 showbins;  /* center first bin at 0 */
run;
 
title2 "Endpoints";
proc sgplot data=Hist;
histogram U / binwidth=0.2 binstart=0.1;      /* center first bin at h/2 */
xaxis values=(0 to 1 by 0.2);       
run;

In summary, for most data the default bin width and location result in a histogram that is both statistically useful and easy to read. However, the default choices can lead to a less-than-optimal visualization if the data have special properties, such as being time intervals or being bounded. In those cases, it makes sense to choose a bin width and a location of the first bin such that reveals your data's special properties. For the UNIVARIATE procedure, use the MIDPOINTS= or ENDPOINTS= options on the HISTOGRAM statement. For the SGPLOT procedure, use the BINWIDTH= and BINSTART= options to create a histogram with custom bins.

Post a Comment

Analyzing activity-tracker data: How many steps per day do YOU take?

My wife got one of those electronic activity trackers a few months ago and has been diligently walking every day since then. At the end of the day she sometimes reads off how many steps she walked, as measured by her activity tracker. I am always impressed at how many steps she racks up during the day through a combination of regular activity and scheduled walks.

There has been a lot written about the accuracy of electronic activity trackers. Personally, I don't worry about accuracy as long as the errors are in the 10% range. The purpose of the tools are to give people an idea of how much they are moving and to encourage them to get off the couch. Whether someone walked 4.2 miles or 3.8 isn't as important as the fact that she walked about 4 miles.

Because my wife's daily numbers seem so impressive, I decided to download some data from other people who use the same device. The device can be linked to a person's Twitter account and programmed to tweet a summary of the person's activity level each day. The Tweets all have a common hashtag and format, so it is easy to download a few hundred tweets and prepare them for data analysis. In an act of statistical voyeurism, I present a descriptive analysis of the activity level of 231 random people who use a particular brand of activity tracker, as reported by their tracker for Sunday, August 17, 2014. You can download the SAS program that analyzes these data.

Distribution of steps

trackerhist The trackers records steps. The histogram at the left shows the distribution of the number of steps taken for the 231 subjects. (Click to enlarge.) A kernel density estimate is overlaid, and various quantiles are displayed in the inset. The histogram shows that about 25% of the people in the sample walked fewer than 4,000 steps, and about half of the people walked fewer than 7,100 steps. About a quarter of the people walked more than 11,600 steps, and the I-am-very-active award goes to those people who walked more than 18,000 steps—they are the upper 95th percentile of this sample.

The tail of the distribution falls off rapidly, which means that there is a low probability of finding someone who walks more than 30,000 steps per day. In other words, this is not a fat-tailed distribution. On the contrary, a person in the upper percentiles is working her tail off!

How many steps does it take to walk a mile?

trackerfit

The tracker also reports distance in miles. The basic device uses an algorithm to convert those steps into an approximate distances so, as they say, your distance may vary (nyuck-nyuck!). The device does not know your exact stride length.

The scatter plot to the left shows the relationship between distance and steps. For people who take many steps, there is substantial variation in the reported distance. Probably some of those people were running or jogging, which changes the length of the stride. The line indicates predicted distance for a given number of steps, as calculated by a robust regression algorithm.

These data can help to answer the question, "How many steps does it take to walk a mile?" Based on these data, it takes an average person 2,272 steps to walk a mile. Of course, shorter people will require more steps and taller people fewer, and there is the whole debate about how accurate these trackers are at estimating distance. Nevertheless, 2,272 steps is a good estimate for a mile. For a simpler number, you can estimate that 10,000 steps is about four miles.

These data also enable you to estimate the length of the average stride, which is 2.32 feet, or about 71 centimeters per step.

What do you think of these data? If you use a fitness tracking device, how many steps do you take each day? Leave a comment.

Post a Comment

Creating heat maps in SAS/IML

In a previous blog post, I showed how to use the graph template language (GTL) in SAS to create heat maps with a continuous color ramp. SAS/IML 13.1 includes the HEATMAPCONT subroutine, which makes it easy to create heat maps with continuous color ramps from SAS/IML matrices. Typical usage includes creating heat maps of data matrices, correlation matrices, or covariance matrices. A more advanced usage is using matrices to visualize the results of computational algorithms or computer experiments.

Data matrices

Heat maps provide a convenient way to visualize many variables and/or many variables in a single glance. For example, suppose that you want to compare and contrast the 24 trucks in the Sashelp.Cars data by using the 10 numerical variables in the data set. The following SAS/IML statements read the data into a matrix and sort the matrix according to the value of the Invoice variable (that is, by price). The Model variable, which identifies each truck, is sorted by using the same criterion:

proc iml;
use Sashelp.Cars where(type="Truck");
read all var _NUM_ into Trucks[c=varNames r=Model];
close Sashelp.Cars;
 
call sortndx(idx, Trucks[,"Invoice"]);     /* find indices that sort the data */
Trucks = Trucks[idx,]; Model = Model[idx]; /* sort the data and the models    */

The original variables are all measured on different scales. For example, the Invoice variable has a mean value of $23,000, whereas the EngineSize variable has a mean value of 4 liters. Most of the values in the matrix are less than 300. Consequently, a heat map of the raw values would not be illuminating: The price variables would be displayed as dark cells (high values) and the rest of the matrix would be almost uniformly light (small values, relative to the prices). Fortunately, the HEATMAPCONT subroutine supports the SCALE="Col" option to standardize each variable to have mean 0 and unit variance. With that option, a heat map reveals the high, middle, and low values of each variable:

call HeatmapCont(Trucks) scale="Col"; /* standardize cols */
heatmaptrucks1

The HEATMAPCONT output shows the following default values:

  • The default color ramp is the 'TwoColor' color ramp. With my ODS style, white is used to display small values and dark blue is used to display large values. The color ramp indicates that the standardized values are approximately in the range [-2, 3], which probably indicates that several variables have positive skewness.
  • The axes are labeled by using the row numbers of the 24 x 10 matrix.
  • The Y axes "points down." In other words, the heat map displays the matrix as it would be printed: the top of the heat map displays the first few rows and the bottom of the heat map displays the last few rows.

A problem with this initial heat map is that we don't know which trucks correspond to the rows, nor do we know which variables corresponds to the columns. You can use the XVALUES= and YVALUES= options to add that information to the heat map. Also, let's use a three-color color ramp and center the range of the color ramp so that white represents the mean value for each variable and red (blue) represents positive (negative) deviations from the mean. The resulting image is shown below (click to enlarge):

call HeatmapCont(Trucks) scale="Col"           /* standardize cols    */
     xvalues=varNames yvalues=Model            /* label rows and cols */
     colorramp="ThreeColor" range={-3 3}       /* color range [-3,3]  */
     legendtitle = "Standardized values" title="Truck Data";
heatmaptrucks2

This visualization of the data enables us to draw several conclusions about the data. In general, expensive cars are powerful (large values of EngineSize/, Cylinders, and Horsepower), large in size (large values of Weight, Wheelbase, and Length), and get poor fuel economy (small values of MPG_City and MPG_Highway). However, two vehicles seem unusual. Compared with similarly priced trucks, the Baja is smaller and more fuel efficient. Compared with similarly priced trucks, the Tundra is more powerful and larger in size.

Correlation matrices

Because the data matrix is sorted by price, it looks like price variables are positively correlated with all variables except for the fuel economy variables. Let's see if that is the case by computing the correlation matrix and displaying it as a heat map:

corr = corr(Trucks);
call HeatmapCont(corr)
     xvalues=varNames yvalues=varNames
     colorramp="ThreeColor" range={-1 1}
     title="Correlations in Truck Data";
heatmaptrucks3

The correlation matrix confirms our initial impression. The price variables are strongly correlated with the variables that indicate the power and size of the trucks. The price is negatively correlated with the fuel efficiency variables.

One of the advantages of computing correlation analysis in the SAS/IML language is that you can easily change the order of the variables. For example, the following statements move the fuel efficiency variables to the end, so that all of the positively correlated variables are grouped together:

newOrder = (1:5) || (8:10) || (6:7);
corr = corr[newOrder, newOrder];
varNames = varNames[newOrder];

Can you think of other applications of visualizing matrices by using the HEATMAPCONT subroutine? Leave a comment.

Post a Comment

Creating a basic heat map in SAS

heatmapcubic

Heat maps have many uses. In a previous article, I showed how to use heat maps with a discrete color ramp to visualize matrices that have a small number of unique values, such as certain covariance matrices and sparse matrices. You can also use heat maps with a continuous color ramp to visualize correlation matrices or data matrices.

This article shows how to use a heat map with a continuous color ramp to visualize a function of two variables. A heat map of a simple cubic function is shown to the left. In statistical applications the function is often a quantity—such as the mean or variance—that varies as a function of two parameters. Heat maps are like contour plots without the contours; you can use a heat map to display smooth or nonsmooth quantities.

You can create a heat map by using the HEATMAPPARM statement in the graph template language (GTL). The following template defines a simple heat map:

proc template;              /* basic heat map with continuous color ramp */
define statgraph heatmap;
dynamic _X _Y _Z _T;        /* dynamic variables */
 begingraph;
 entrytitle _T;             /* specify title at run time (optional) */
  layout overlay;
    heatmapparm x=_X y=_Y colorresponse=_Z /  /* specify variables at run time */
       name="heatmap" primary=true
       xbinaxis=false ybinaxis=false;  /* tick marks based on range of X and Y */
    continuouslegend "heatmap";
  endlayout;
endgraph;
end;
run;

The template is rendered into a graphic when you call the SGRENDER procedure and provide a source of values (the data). The data should contain three variables: two coordinate variables that define a two-dimensional grid and one response variable that provides a value at each grid point. I have used dynamic variables in the template to specify the names of variables and the title of the graph. Dynamic variables are specified at run time by using PROC SGRENDER. If you use dynamic variables, you can use the template on many data sets.

The following DATA step creates X and Y variables on a uniform two-dimensional grid. The value of the cubic function at each grid point is stored in the Z variable. These data are visualized by calling the SGRENDER procedure and specifying the variable names by using the DYNAMIC statement. The image is shown at the top of this article.

/* sample data: a cubic function of two variables */
%let Step = 0.04;
data A;
do x = -1 to 1 by &Step;
   do y = -1 to 1 by &Step;
      z = x**3 - y**2 - x + 0.5;
      output;
   end;
end;
run;
 
proc sgrender data=A template=Heatmap; 
   dynamic _X='X' _Y='Y' _Z='Z' _T="Basic Heat Map";
run;

The GTL template uses default values for most options. The default color ramp is a continuous three-color ramp. For the ODS style I am using, the default ramp uses blue to display low values, white to display middle values, and red to display high values. The XBINAXIS=false and YBINAXIS=false options override the default placement of tick marks on the axes. By default, the tick marks are be placed at data values of the X and Y variable, which are multiple of 0.04 for this data. The following image shows part of the heat map that results if I omit the XBINAXIS= and YBINAXIS= options. I prefer my tick labels to depend on the data range, rather than the particular value of &Step that I used to generate the data.

heatmapcubic2

The HEATMAPPARM statement has many options that you can use to control features of the heat map. Because the heat map is an important tool in visualizing matrices, SAS/IML 13.1 supports functions for creating heat maps directly from matrices. I will discuss one of these functions in my next blog post.

Post a Comment

Guiding numerical integration: The PEAK= option in the SAS/IML QUAD subroutine

One of the things I enjoy about blogging is that I often learn something new. Last week I wrote about how to optimize a function that is defined in terms of an integral. While developing the program in the article, I made some mistakes that generated SAS/IML error messages. By understanding my errors, I learned something new about the default options for the QUAD subroutine in SAS/IML software.

The QUAD subroutine supports the PEAK= option, but until last week I had never used the option. (Obviously the default value works well most of the time, since I have been using the QUAD routine successfully for many years!) I discovered that the PEAK= option can be important to ensuring that a numerical integration problem converges to an accurate answer.

The default value of the PEAK= option is 1. But what does that mean? Here's what I gleaned from the documentation:

  • The PEAK= option is used to scale the integrand and to determine whether the adaptive integration algorithm needs to perform additional iterations.
  • The option specifies a value of x where the magnitude of the integrand f(x) is relatively large. For integrations on infinite and semi-infinite intervals, the value of the PEAK= option specifies where the function has a lot of "mass." For example, the standard normal probability density function is centered at x=0, so you can specify PEAK=0 to integrate that function on (–∞, ∞). This example illustrates why the option is named "PEAK."
  • For best results, choose the PEAK= option to be in the interior of the interval of integration. Don't choose a value that is too close to the boundary. For example, on a finite domain [a, b], the midpoint of the interval is often a good choice: PEAK=(a + b)/2.
  • The PEAK= value should not be a location where the integrand is zero. That is, if PEAK=v, then make sure that f(v) ≠ 0.
  • In spite of its name, the value of the PEAK= option does not need to be the location of the maximum value of the integrand. For integration on a finite domain you can specify almost any value in the interval, subject to the previous rules.

The remainder of this article uses the quadratic function f(x) = 1 – x2 to illustrate how you can to control the QUAD subroutine by using the PEAK= option.

Avoid zeros of the integrand

The following SAS/IML statements define the function to integrate. The integrand has a zero at the value x = 1, which is the default value for the PEAK= option. If you attempt to integrate this function on the interval [0, 2], you see the following scary-looking error message:

proc iml;
start Func(x) global(a);
   return( 1 - x##2 );         /* note that Func(1) = 0 */
finish;
 
call quad(w, "Func", {0 2});  /* default: PEAK=1 */
Convergence could not be attained over the subinterval (0 , 2 )
 
The function might have one of the following:
     1) A slowly convergent or a divergent integral.
     2) Strong oscillations.
     3) A small scale in the integrand: in this case you can change the 
        independent variable in the integrand, or
        provide the optional vector describing roughly the mean and 
        the standard deviation of the integrand.
     4) Points of discontinuities in the interior:
        in this case, you can provide a vector containing
        the points of discontinuity and try again.

The message says that the integration did not succeed, and it identifies possible problems and offers some suggestions. For this example, the integral is not divergent, oscillatory, or discontinuous, so we should look at the third item in the list. The message seems to assume that the integrand is a statistical probability distribution, which is not true for our example. For general functions, the advice could be rewritten as "... provide values for the PEAK= and SCALE= options. The SCALE= value should provide a horizontal estimate of scale; the integrand evaluated at the PEAK= value should provide a vertical estimate of scale."

For the example, –3 ≤ f(x) ≤ 1 when x is in the interval [0, 2], so the integrand has unit scale in the vertical direction. You can choose almost any value of x for the PEAK= option, but you should avoid values where the integrand is very small. The following statement uses PEAK=0.5, but you could also use PEAK= 1.5 or PEAK=0.1234.

call quad(w, "Func", {0 2}) peak=0.5;  
print w;
t_quadpeak

In a similar way, you should make sure that the integrand is defined at the PEAK= value. Avoid singular points and points of discontinuity.

Avoid the boundaries of the interval

Here's another helpful tip: choose the PEAK= option so that it is sufficiently inside the interval of integration. Avoid choosing a value that is too close to the lower or upper limit of integration. The following statement chooses a value for the PEAK= option that is just barely inside the interval [0, 2]. A warning message results:

val = 2 - 1e-7;     /* value is barely inside interval [0,2] */
call quad(w, "Func", {0 2}) peak=val;
Due to the roundoff encountered in computing FUNC
      near the point     2.00000E+00
      convergence can be attained, but only
      with an estimated error     0.00000E+00

The warning message says that a problem occurs near the point x=2. Although the routine will continue and "convergence can be attained," the routine is letting you know that it cannot perform the integration as well as it would like to. If you move the PEAK= value farther away from the limits of integration, the warning goes away and the subroutine can integrate the function more efficiently.

So there you have it. Although the default value of the PEAK= option works well for most integrals, sometimes the programmer needs to provide the QUAD routine with additional knowledge of the integrand. When you provide the PEAK= option, be sure to avoid zeros and discontinuities of the integrand, and stay away from the boundaries of the integration interval.

Post a Comment

Ten tips for learning the SAS/IML language

A SAS customer wrote, "Now that I have access to PROC IML through the free SAS University Edition, what is the best way for me to learn to program in the SAS/IML language? How do I get started with PROC IML?"

That is an excellent question, and I'm happy to offer some suggestions. The following ideas are ordered according to your level of experience with SAS/IML programming. The first few resources will help beginners get started with PROC IML. The last few suggestions will help intermediate-level programmers develop and improve their SAS/IML programming skills.

  1. Get the SAS/IML Software. If your workplace does not license the SAS/IML product, you can download the free SAS University Edition onto your laptop for learning SAS/IML. Even if SAS/IML software is available at work, you might want to download the University Edition if you plan to learn and practice at night or on weekends.
  2. Work through the "Getting Started" chapter of the book Statistical Programming with SAS/IML Software. The chapter is available as a free excerpt from the book's Web page. Notice that I say "work through," not "read." Run the programs as you read. Change the numbers in the examples.
  3. Work through the first six chapters of the SAS/IML User's Guide. A few years ago I revised this documentation to make it more readable, especially the sections about reading and writing data.
  4. Download the SAS/IML tip sheets. By keeping a tip sheet on your desk, you can easily remind yourself of the syntax for common SAS/IML statements and functions.
  5. Subscribe to The DO Loop blog. Most Mondays I blog about elementary topics that do not require advanced programming skills. I also discuss DATA step programs, statistical graphics, and SAS/STAT procedures. I've written more than 100 blog posts that are tagged as "Getting Started."
  6. Program, program, program. The way to learn any programming language is to start writing programs in that language. When I was a university professor, I used to tell my students "Math is not a spectator sport." Programming is similar: In order to get better at programming, you need to practice programming. Many of the previous tips provided you with pre-written programs that you can modify and extend. The paper "Rediscovering SAS/IML Software: Modern Data Analysis for the Practicing Statistician" includes intermediate-level examples that demonstrate the power of the SAS/IML language.
  7. Use the SAS/IML Support Community. When you start writing programs, you will inevitably have questions. The SAS/IML Support Community is a discussion forum where you can post code and ask for help. As you gain experience, try answering questions posted by others!
  8. Think about efficiency. A difference between a novice programmer and an experienced programmer is that the experienced programmer can write efficient programs. In a matrix-vector language such as SAS/IML, that means vectorizing programs: using matrix operations instead of loops over variables and observations. Many programming tips and techniques in the first four chapters of Statistical Programming with SAS/IML Software deal with efficiency issues. As you gain experience, study the efficiency examples and vectorization examples in my blog.
  9. Use the SAS/IML Studio programming interface. I am more productive when I use SAS/IML Studio than when I use PROC IML in the SAS Windowing environment (display manager). I like the color-coded program editor and the ability to develop and run multiple SAS/IML programs simultaneously. I like the debugging features and the dynamically linked graphics are often useful for understanding relationships in data.
  10. Use the SAS/IML File Exchange. The SAS/IML File Exchange is a Web site where you can search for useful programs to use, study, or modify. The exchange is a little like those "Leave a penny; take a penny" bowls at cash registers. If you have written a cool program, contribute it so that others can use it. If you need a function that performs a particular analysis, download it from the site. The site launched in mid-2014, so we need contributions from many SAS/IML programmers before the site will become useful. Do you have a program that you can contribute?

Becoming a better SAS/IML programmer does not happen overnight. Merely reading books and blogs will not make you better. However, the ten tips in this article point out resources that you can use to improve your skills. So roll up your sleeves and start programming!

Post a Comment

Define an objective function that evaluates an integral in SAS

The SAS/IML language is used for many kinds of computations, but three important numerical tasks are integration, optimization, and root finding. Recently a SAS customer asked for help with a problem that involved all three tasks. The customer had an objective function that was defined in terms of an integral. The problem was to find the roots and the maximum value of the function.

An objective function that involves an integral

When constructing an example, it is always helpful to use a problem for which the exact answer is known. For this article, I will define the objective function as follows:

optintegral1

To make the example more realistic, I've included a parameter, a. For each value of x, this function requires evaluating a definite integral. (Integrals with an upper limit of x arise when integrating probability density functions.) The integrand is the function ax y2. The dummy variable is y, so a and x are both treated as constant parameters when evaluating the integrand. Here are some useful facts about this function:

  • The integrand is a polynomial in y, so the integral can be evaluated explicitly. The simplified expression for F is F(x; a) = axx4/3.
  • The two real roots of F can be found explicitly. They are 0 and the cube root of 3a.
  • The maximum value of F can be found explicitly. It is the cube root of 3a/4.

The remainder of this article demonstrates how to find the roots and maximum value of this function, which is defined in terms of an integral.

Defining the function in the SAS/IML language

When evaluating the integral, the values of x and a are constant. In order to integrate a function by using the QUAD subroutine, the module that defines the integrand must have exactly one argument, which supplies a value for the dummy variable, y. Consequently, list other parameters in the GLOBAL statement of the SAS/IML module that defines the integrand, as follows:

proc iml;
/* Define the integrand, which will be called by QUAD subroutine.
   The integrand is a function of ONE variable: the "dummy" variable y.
   All other parameters are listed in the GLOBAL clause. */
start Integrand(y) global(a, x);
   return( a - x # y##2 );
finish;

You can now define a SAS/IML module that evaluates the function F. I like to use param as the name of the argument to an objective function. The name helps me to remember that param is going to be manipulated by root-finding or nonlinear programming (NLP) algorithms in order to find the roots and extrema of the function. Here's how I would define the objective function that evaluates F:

/* Define a SAS/IML function that computes an integral. 
   This module will be called by a root-finding or NLP routine. */
start ObjFunc(param) global(a, x);
   x = param;                         /* copy to global symbol */
   limits = 0 || x;                   /* limits of integration: [0,x] */
   call quad(w, "Integrand", limits); 
   return( choose(x>=0, w, -w) );     /* trick to handle x<0 */
finish;

I used a few tricks and techniques to define this module:

  • The only argument to the function is param. This is the quantity that will be used to search for roots and extrema. It corresponds to the x in the mathematical expression F(x; a).
  • The argument to the module cannot be named x because I have reserved that name for the global variable that is used by the INTEGRAND module. The SAS/IML syntax does not permit you to use the same name in the argument list and in the GLOBAL list for the same module. Instead, the OBJFUNC module copies the value of param to the x variable to ensure that the global variable remains in synch.
  • For this function, the argument determines the limits of integration. If x is always positive, you don't need to do anything special. However, if you want to permit negative values for x, then you need to use the programming trick in my article about how to evaluate an integral where the limits of integration are reversed.

Visualize the function

When tasked with finding the roots or extrema of a function, I recommend that you visualize the function. This is useful for estimating the number of roots, the location of roots, and finding an initial guess for the optimum. Although visualization can be difficult when the objective function depends on a large number of variables, in this case the visualization requires merely graphing the function.

To draw the graph, you must specify values for any parameters. I will set a = 4/3. For that value of a, the function F has roots at 0 and the cube root of 4 (≈ 1.5874). The maximum of F occurs at x = 1.

/* visualize the objective function */
a = 4/3;
t = do(-1, 2, 0.05);          /* graph the function on [-1, 2] */
F = j(1, ncol(t));
do i = 1 to ncol(t);
   F[i] = ObjFunc(t[i]);      /* evaluate the objective function at each x */ 
end;
title "Objective Function for a = 4/3";
call Series(t, F) grid={x y} label={"x" "F(x)"}
                  other="refline 0 / axis=y"; /* add reference line at y=0 */
objintegral

The graph indicates that the function has two roots, one near x=0 and the other near x=1.6. The function appears to have one maximum, which is located near x=1. Of course, these values depend on the value of the parameter a.

Graphing the function is a "sanity check" that gives us confidence that there are no mistakes in the definition of the objective function. It is very important to test and debug the module that defines the objective function before you try to find a maximum! Many people post programs to the SAS/IML Support Community and report that "the optimization reports a strange error." Sometimes they ask whether there is a problem with the SAS/IML optimizing routine. To paraphrase Shakespeare, usually the fault lies not in our optimizers, but in our objective function. Always verify that the objective function is defined correctly before you call an NLP routine.

Find roots of the function

The objective function is ready to work with. The following statements use the FROOT function to find roots on the interval [–1, 1] and [1, 2]:

/* find roots on [-1, 1] and [1, 2] */
intervals = {-1 1,
              1 2};
roots = froot("ObjFunc", intervals);    /* SAS/IML 12.1 */
print roots[format=6.4];
t_objintegral

As stated earlier, the roots are at 0 and the cube root of 4 (≈ 1.5874).

Find a maximum of the function

To find the maximum, you can use one of the nonlinear programming functions in SAS/IML. In previous optimization problems I have used the NLPNRA subroutine, which uses the Newton-Raphson method to compute a maximum. However, Newton-Raphson is probably not the best choice for this problem because the function evaluation requires computing an integral, and therefore contains a small amount of numerical error. Furthermore, unless I "cheat" and use the exact solution to compute exact derivatives, the Newton-Raphson algorithm would have to use finite difference approximations to compute the derivative and second derivative at each point. For this problem these finite-difference computations are likely to be expensive and not very accurate.

Instead, I will use the Nelder-Mead simplex method, which in implemented in the NLPNMS subroutine. The simplex method solves an optimization problem without using any derivatives. This method can be a good choice for optimizing functions that have low precision, such as the result of evaluating an integral or the result of a simulation.

x0  = 0.5;      /* initial guess for maximum */
opt = {1,       /* find maximum of function  */
       0};      /* no printing */
call nlpnms(rc, Optimum, "ObjFunc", x0, opt); /* find maximum */
print Optimum[format=6.4];
t_objintegral2

As discussed previously, the maximum occurs at x=1.

This article has shown how to define a SAS/IML module that computes an integral in order to evaluate the function. The tips and techniques in this article should enable you to find the roots and extrema of complicated functions that arise in computational statistics.

Post a Comment

Stigler's seven pillars of statistical wisdom

Wisdom has built her house;
She has hewn out her seven pillars.
     – Proverbs 9:1

At the 2014 Joint Statistical Meetings in Boston, Stephen Stigler gave the ASA President's Invited Address. In forty short minutes, Stigler laid out his response to the age-old question "What is statistics?" His answer was not a pithy aphorism, but rather a presentation of seven principles that form the foundation of statistical thought. Here are Stigler's seven pillars, with a few of my own thoughts thrown in:

  1. Aggregation: It sounds like an oxymoron that you can gain knowledge by discarding information, yet that is what happens when you replace a long list of numbers by a sum or mean. Every day the news media reports a summary of billions of stock market transactions by reporting a single a weighted average of stock prices: the Dow Jones Industrial Average. Statisticians aggregate, and policy makers and business leaders use these aggregated values to make complex decisions.
  2. The law of diminishing information: If 10 pieces of data are good, are 20 pieces twice as good? No, the value of additional information diminishes like the square root of the number of observations, which is why Stigler nicknamed this pillar the "root n rule." The square root appears in formulas such as the standard error of the mean, which describes the probability that the mean of a sample will be close to the mean of a population.
  3. Likelihood: Some people say that statistics is "the science of uncertainty." One of the pillars of statistics is being able to confidently state how good a statistical estimate is. Hypothesis tests and p-values are examples of how statisticians use probability to carry out statistical inference.
  4. Intercomparisons: When analyzing data, statisticians usually make comparisons that are based on differences among the data. This is different than in some fields, where comparisons are made against some ideal "gold standard." Well-known analyses such as ANOVA and t-tests utilize this pillar.
  5. Regression and multivariate analysis: Children that are born to two extraordinarily tall parents tend to be shorter than their parents. Similarly, if both parents are shorter than average, the children tend to be taller than the parents. This is known as regression to the mean. Regression is the best known example of multivariate analysis, which also includes dimension-reduction techniques and latent factor models.
  6. Design: R. A. Fisher, in an address to the Indian Statistical Congress (1938) said "To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of." A pillar of statistics is the design of experiments, and—by extension—all data collection and planning that leads to good data. Included in this pillar is the idea that random assignment of subjects to design cells improves the analysis. This pillar is the basis for agricultural experiments and clinical trials, just to name two examples.
  7. Models and Residuals: This pillar enables you to examine shortcomings of a model by examining the difference between the observed data and the model. If the residuals have a systematic pattern, you can revise your model to explain the data better. You can continue this process until the residuals show no pattern. This pillar is used by statistical practitioners every time that they look at a diagnostic residual plot for a regression model.

I agree with Stigler's choice of the seven pillars. If someone asks, "What is statistics?" I sometimes replay "It is what statisticians do!" But what do statisticians do? They apply these seven pillars of thought to convert measurements to information. They aggregate. They glean information from small samples. They use probability to report confidence in their estimates. They create ways to quantify data differences. They analyze multivariate data. They design experiments. They build and refine models.

What do you think of Stigler's pillars? Are there others that you would add? Leave a comment.

Post a Comment

Reversing the limits of integration in SAS

In SAS software, you can use the QUAD subroutine in the SAS/IML language to evaluate definite integrals on an interval [a, b]. The integral is properly defined only for a < b, but mathematicians define the following convention, which enables you to make sense of reversing the limits of integration:

reverselimits

Last week I was investigating a question that a SAS customer had posed to SAS Technical Support. In answering the question, it became important to know how the QUAD subroutine behaves if the left limit of integration is greater than the right limit. The QUAD subroutine supports an argument that specifies the limits of integration. The documentation states:

The simplest form of [the argument] provides the limits of the integration on one interval. The first element... should contain the left limit. The second element should be the right limit.... The elements of the vector must be sorted in an ascending order.

That's pretty explicit, so I thought I might get an error message if I specify a left limit that is greater than the right limit. However, that is not what happens! The following program shows the result of integrating the function f(x) = 4 x3 on the interval [0 1]. In the first call, I specify the limits of integration in ascending order. In the second call, I reverse the limits:

proc iml;
start Integrand(y);
   return( 4*y##3 );
finish;
 
call quad(w1, "Integrand", {0 1});   /* integral on [0 1] equals 1 */
call quad(w2, "Integrand", {1 0});   /* reverse limits */
print w1[L="QUAD on [0,1]"]          /* should get 1 */
      w2[L="QUAD on [1,0]"];         /* what do we get here? */
t_reverselimits

The QUAD subroutine does not give an error when the limits are in descending order. Instead, the routine silently sorts the limits and correctly evaluates the integral on the interval [0 1].

This seems like reasonable behavior, but for the problem that I was working on last week, I wanted to define a function where reversing the limits of integration results in reversing the sign of the integral. Now that I understood how the QUAD subroutine behaves, it was easy to write the function:

/* Compute the integral on an interval where the lower
   limit of integration is a and the upper limit is b.
   If a <= b, return the integral on [a,b].
   If a > b, return the opposite of the integral on [b,a] */
start EvalIntegral(FuncName, Limits);
   call quad(w, FuncName, Limits); 
   a = Limits[1]; b = Limits[2];
   return( choose(a<=b, w, -w) );  /* trick to handle a>b */
finish;
 
w3 = EvalIntegral("Integrand", {0 1} );
w4 = EvalIntegral("Integrand", {1 0} );
print w3[L="EvalIntegral on [0,1]"] 
      w4[L="EvalIntegral on [1,0]"];
t_reverselimits2

The EvalIntegral function is very simple. It evaluates the integral with the specified limits. If the limits of integration are in the usual (ascending) order, it returns the value of the integral. If the left limit of integration is greater than the right limit, it returns the opposite of the value that was computed. In other words, it implements the mathematical convention for reversing the limits of integration.

Reversing the limits of integration is a theoretical tool that rarely comes up in practice. However, if you ever need to compute a definite integral where the limits of integration are reversed, this blog post shows how to construct a function that obeys the mathematical convention.

Post a Comment