Simultaneous confidence intervals for a multivariate mean

Many SAS procedure compute statistics and also compute confidence intervals for the associated parameters. For example, PROC MEANS can compute the estimate of a univariate mean, and you can use the CLM option to get a confidence interval for the population mean. Many parametric regression procedures (such as PROC GLM) can compute confidence intervals for regression parameters. There are many other examples.

If an analysis provides confidence intervals (interval estimates) for multiple parameters, the coverage probabilities apply individually for each parameter. However, sometimes it is useful to construct simultaneous confidence intervals. These are wider intervals for which you can claim that all parameters are in the intervals simultaneously with confidence level 1-α.

This article shows how to use SAS to construct a set of simultaneous confidence intervals for the population mean. The middle of this article uses some advanced multivariate statistics. If you only want to see the final SAS code, jump to the last section of this article.

Compute simultaneous confidence intervals for the mean in #SAS. #Statistics Click To Tweet

The distribution of the mean vector

If the data are a random sample from a multivariate normal population, it is well known (see Johnson and Wichern, Applied Multivariate Statistical Analysis, 1992, p. 149; hereafter abbreviated J&W) that the distribution of the sample mean vector is also multivariate normal. There is a multivariate version of the central limit theorem (J&W, p. 152) that says that the mean vector is approximately normally distributed for random samples from any population, provided that the sample size is large enough. This fact can be used to construct simultaneous confidence intervals for the mean.

Recall that the most natural confidence region for a multivariate mean is a confidence ellipse. However, simultaneous confidence intervals are more useful in practice.

Confidence intervals for the mean vector

Before looking at multivariate confidence intervals (CI), recall that many a univariate two-sided CIs are symmetric intervals with endpoints b ± m*SE, where b is the value of the statistic, m is some multiplier, and SE is the standard error of the statistic. The multiplier must be chosen so that the interval has the appropriate coverage probability. For example, the two-sided confidence interval for the univariate mean is has the familiar formula xbar ± tc SE, where xbar is the sample mean, tc is the critical value of the t statistic with significance level α and n-1 degrees of freedom, and SE is the standard error of the mean. In SAS, you can compute tc as quantile("t", 1-alpha/2, n-1).

You can construct similar confidence intervals for the multivariate mean vector. I will show two of the approaches in Johnson and Wichern.

Hotelling T-squared statistic

As shown in the SAS documentation, the radii for the multivariate confidence ellipse for the mean are determined by critical values of an F statistic. The Hotelling T-squared statistic is a scaled version of an F statistic and is used to describe the distribution of the multivariate sample mean.

The following SAS/IML program computes the T-squared statistic for a four-dimensional sample. The Sashelp.iris data contains measurements of the size of petals and sepals for iris flowers. This subset of the data contains 50 observations for the species iris Virginica. (If you don't have SAS/IML software, you can compute the means and standard errors by using PROC MEANS, write them to a SAS data set, and use a DATA step to compute the confidence intervals.)

proc iml;
use sashelp.iris where(species="Virginica"); /* read data */
read all var _NUM_ into X[colname=varNames];
close;
 
n = nrow(X);               /* num obs (assume no missing) */
k = ncol(X);               /* num variables */
alpha = 0.05;              /* significance level */
xbar = mean(X);            /* mean of sample */
stderr = std(X) / sqrt(n); /* standard error of the mean */
 
/* Use T-squared to find simultaneous CIs for mean parameters */
F = quantile("F", 1-alpha, k, n-k);  /* critical value of F(k, n-k) */
T2 = k*(n-1)/(n-k) # F;              /* Hotelling's T-squared is scaled F */
m = sqrt( T2 );                      /* multiplier */
Lower = xbar - m # stdErr;
Upper = xbar + m # stdErr;
T2_CI = (xbar`) || (Lower`) || (Upper`);
print T2_CI[F=8.4 C={"Estimate" "Lower" "Upper"}  R=varNames];
t_bonferroni1

The table shows confidence intervals based on the T-squared statistic. The formula for the multiplier is a k-dimensional version of the 2-dimensional formula that is used to compute confidence ellipses for the mean.

Bonferroni-adjusted confidence intervals

It turns out that the T-squared CIs are conservative, which means that they are wider than they need to be. You can obtain a narrower confidence interval by using a Bonferroni correction to the univariate CI.

The Bonferroni correction is easy to understand. Suppose that you have k MVN mean parameters that you want to cover simultaneously. You can do it by choosing the significance level of each univariate CI to be α/k. Why? Because then the joint probability of all the parameters being covered (assuming independence) will be (1 - α/k)k, and by Taylor's theorem (1 - α/k)k ≈ 1 - α when (α/k) is very small. (I've left out many details! See J&W p. 196-199 for the full story.)

In other words, an easy way to construct simultaneous confidence intervals for the mean is to compute the usual two-sided CIs for significance level α/k, as follows:

/* Bonferroni adjustment of t statistic when there are k parameters */
tBonf = quantile("T", 1-alpha/(2*k), n-1);  /* adjusted critical value */
Lower = xbar - tBonf # stdErr;              
Upper = xbar + tBonf # stdErr;
Bonf_CI = (xbar`) || (Lower`) || (Upper`);
print Bonf_CI[F=8.4 C={"Estimate" "Lower" "Upper"} R=varNames];
t_bonferroni1

Notice that the confidence intervals for the Bonferroni method are narrower than for the T-square method (J&W, p. 199).

The following graph shows a scatter plot of two of the four variables. The sample mean is marked by an X. For reference, the graph includes a bivariate confidence ellipse. The T-squared confidence intervals are shown in blue. The thinner Bonferroni confidence intervals are shown in red.

Bonferroni and T-squared simultaneous confidence intervals for the mean of four-dimensional iris data

Compute simultaneous confidence intervals for the mean in SAS

The previous sections have shown that the Bonferroni method is an easy way to form simultaneous confidence intervals (CIs) for the mean of multivariate data. If you want the overall coverage probability to be at most (1 - α), you can construct k univariate CIs, each with significance level α/k.

You can use the following call to PROC MEANS to construct simultaneous confidence intervals for the multivariate mean. The ALPHA= method enables you to specify the significance level. The method assumes that the data are all nonmissing. If your data contains missing values, use listwise deletion to remove them before computing the simultaneous CIs.

/* Bonferroni simultaneous CIs. For k variables, specify alpha/k 
   on the ALPHA= option. The data should c ontain no missing values. */
proc means data=sashelp.iris(where=(species="Virginica")) nolabels
     alpha=%sysevalf(0.05/4)  /* use alpha/k, where k is number of variables */
     mean clm maxdec=4;
var SepalLength SepalWidth PetalLength PetalWidth;  /* k = 4 */
run;
t_bonferroni3

The values in the table are identical to the Bonferroni-adjusted CIs that were computed earlier. The values in the third and fourth columns of the table define a four-dimensional rectangular region. For 95% of the random samples drawn from the population of iris Virginica flowers, the population means will be contained in the regions that are computed in this way.

Post a Comment

Discover power laws by log-transforming data

Kepler's third law for planetary bodies

A recent issue of Astronomy magazine mentioned Kepler's third law of planetary motion, which states "the square of a planet's orbital period is proportional to the cube of its average distance from the Sun" (Astronomy, Dec 2016, p. 17). The article included a graph (shown at the right) that shows the period and distance for several planetary bodies. The graph is plotted on a log-log scale and shows that the planetary data falls on a straight line through the origin.

I sometimes see Kepler's third law illustrated by using a graph of the cubed distances versus the squared periods. In a cubed-versus-squared graph, the planets fall on a straight line with unit slope through the origin. Since power transformations and log transformations are different, I was puzzled. How can both graphs be correct?

After a moment's thought, I realized that the magazine had done something very clever. Although it is true that a graph of the squared-periods versus the cubed-distances will CONFIRM the relationship AFTER it has been discovered, the magazine's graph gives insight into how a researcher can actually DISCOVER a power-law relationship in the first place! To discover the values of the exponents in a power-law relationship, log-transform both variables and fit a regression line.

How to discover a power law? Log-transform the data! Click To Tweet

How to discover a power law

Suppose that you suspect that a measured quantity Y is related by a power law to another quantity X. That is, the quantities satisfy the relationship Ym = A Xn for some integer values of the unknown parameters m and n and constant A. If you have data for X and Y, how can use discover the values of m and n?

One way is to use linear regression on the log-transformed data. Take the logarithms of both size and simplify to obtain log(Y) = C + (n/m) log(X) where C is a constant. You can use ordinary least squares regression to estimate values of the constant C and the ratio n/m.

Planetary periods and distances

For example, let's examine how a modern statistician could quickly discover Kepler's third law by using logarithms and regression. A NASA site for teachers provides the period of revolution (in years) and the mean distance from the Sun (in astronomical units) for several planetary bodies. Some of the data (Uranus, Neptune, and Pluto) were not known to Kepler. The following SAS DATA step reads the data and computes the log (base 10) of the distances and periods:

data Kepler;
input Name $8. Period Distance;
logDistance = log10(Distance);
logPeriod   = log10(Period);
label logDistance="log(Mean Distance from Sun) (AU)"
      logPeriod  ="log(Orbital Period) (Years)";
datalines;
Mercury   0.241   0.387
Venus     0.616   0.723
Earth     1       1
Mars      1.88    1.524
Jupiter  11.9     5.203
Saturn   29.5     9.539
Uranus   84.0    19.191
Neptune  165.0   30.071
Pluto    248.0   39.457
;

The graph in Astronomy magazine plots distances on the vertical axis and periods horizontally, but it is equally valid to flip the axes. It seems more natural to compute a linear regression of the period as a function of the distance, and in fact this how Kepler expressed his third law:

The proportion between the periodic times of any two planets is precisely one and a half times the proportion of the mean distances.

Consequently, the following call to PROC REG in SAS estimates the power relationship between the distance and period:

proc reg data=Kepler plots(only)=(FitPlot ResidualPlot);
   model logPeriod = logDistance;
run;
kepler2

The linear fit is almost perfect. The R2 value (not shown) is about 1, and the root mean square error is 0.0005. The table of parameter estimates is shown. The intercept is statistically insignificant. The estimate for the ratio (n/m) is 1.49990, which looks suspiciously like a decimal approximation for 3/2. Thus a simple linear regression reveals the powers used in Kepler's third law: the second power of the orbital period is proportional to the third power of the average orbital distance.

A modern log-log plot of Kepler's third law

Not only can a modern statistician easily discover the power law, but it is easy to create a scatter plot that convincingly shows the nearly perfect fit. The following call to PROC SGPLOT in SAS creates the graph, which contains the same information as the graph in Astronomy magazine. Notice that I used custom tick labels for the log-scaled axes:

title "Kepler's Third Law";
title2 "The Squared Period Is Proportional to the Cubed Distance";
proc sgplot data=Kepler;
  scatter y=logPeriod x=logDistance / datalabel=Name datalabelpos=bottomright datalabelattrs=(size=12);
  lineparm x=0 y=0 slope=1.5 / legendlabel="log(Period) = 3/2 log(Distance)" name="line";
  yaxis grid values=(-1 0 1 2 3) valuesdisplay=("0.1" "1" "10" "100" "1000") offsetmax=0;
  xaxis grid values=(-1 0 1 2) valuesdisplay=("0.1" "1" "10" "100");
  keylegend "line" / location=inside opaque;
run;
Kepler's Third Law: For a planetary body, the square of the orbital period is proportional to the cube of the mean distance to the sun

Remark on the history of Kepler's third law

This article shows how a modern statistician can discover Kepler's third law using linear regression on log-transformed data. The regression line fits the planetary data to high accuracy, as shown by the scatter plot on a log-log scale.

It is impressive that Kepler discovered the third law without having access to these modern tools. After publishing his first two laws of planetary motion in 1609, Kepler spent more than a decade trying to find the third law. Kepler said that the third law "appeared in [his] head" in 1618.

Kepler did not have the benefit of drawing a scatter plot on a log-log scale because Descartes did not create the Cartesian coordinate system until 1637. Kepler could not perform linear regression because Galton did not describe it until the 1880s.

However, Kepler did know about logarithms, which John Napier published in 1614. According to Kevin Brown, (Reflections on Relativity, 2016) "Kepler was immediately enthusiastic about logarithms" when he read Napier's work in 1616. Although historians cannot be sure, it is plausible that Kepler used logarithms to discover his third law. For more information about Kepler, Napier, and logarithms, read Brown's historical commentary.

Post a Comment

Append data to add markers to SAS graphs

Do you want to create customized SAS graphs by using PROC SGPLOT and the other ODS graphics procedures? An essential skill that you need to learn is how to merge, join, append, and concatenate SAS data sets that come from different sources. The SAS statistical graphics procedures (SG procedures) enable you to overlay all kinds of customized curves, markers, and bars. However, the SG procedures expect all the data for a graph to be in a single SAS data set. Therefore it is often necessary to append two or more data sets before you can create a complex graph.

This article discusses two ways to combine data sets in order to create ODS graphics. An alternative is to use the SG annotation facility to add extra curves or markers to the graph. Personally, I prefer to use the techniques in this article for simple features, and reserve annotation for adding highly complex and non-standard features.

Overlay curves

sgplotoverlay

In a previous article, I discussed how to structure a SAS data set so that you can overlay curves on a scatter plot.

The diagram at the right shows the main idea of that article. The X and Y variables contain the original data, which are the coordinates for a scatter plot. Secondary information was appended to the end of the data. The X1 and Y1 variables contain the coordinates of a custom scatter plot smoother. The X2 and Y2 variables contain the coordinates of a different scatter plot smoother.

This structure enables you to use the SGPLOT procedure to overlay two curves on the scatter plot. You use a SCATTER statement and two SERIES statements to create the graph. See the previous article for details.

Overlay markers: Wide form

In addition to overlaying curves, I sometimes want to add special markers to the scatter plot. In this article I will show how to add a marker that shows the location of the sample mean. This article shows how to use PROC MEANS to create an output data set that contains the coordinates of the sample mean, then append that data set to the original data.

Add special markers to a graph using PROC SGPLOT #SASTip Click To Tweet

The following statements use PROC MEANS to compute the sample mean for four variables in the SasHelp.Iris data set, which contains the measurements for 150 iris flowers. To emphasize the general syntax of this computation, I use macro variables, but that is not necessary:

%let DSName = Sashelp.Iris;
%let VarNames = PetalLength PetalWidth SepalLength SepalWidth;
 
proc means data=&DSName noprint;
var &VarNames;
output out=Means(drop=_TYPE_ _FREQ_) mean= / autoname;
run;

The AUTONAME option on the OUTPUT statement tells PROC MEANS to append the name of the statistic to the variable names. Thus the output data set contains variables with names like PetalLength_Mean and SepalWidth_Mean. As shown in the diagram in the previous section, this enables you to append the new data to the end of the old data in "wide form" as follows:

data Wide;
   set &DSName Means; /* add four new variables; pad with missing values */
run;
 
ods graphics / attrpriority=color subpixel;
proc sgplot data=Wide;
scatter x=SepalWidth y=PetalLength / legendlabel="Data";
ellipse x=SepalWidth y=PetalLength / type=mean;
scatter x=SepalWidth_Mean y=PetalLength_Mean / 
         legendlabel="Sample Mean" markerattrs=(symbol=X color=firebrick);
run;
Scatter plot with markers for sample means

The first SCATTER statement and the ELLIPSE statement use the original data. Recall that the ELLIPSE statement draws an approximate confidence ellipse for the mean of the population. The second SCATTER statement uses the sample means, which are appended to the end of the original data. The second SCATTER statement draws a red marker at the location of the sample mean.

You can use this same method to plot other sample statistics (such as the median) or to highlight special values such as the origin of a coordinate system.

Overlay markers: Long form

In some situations it is more convenient to append the secondary data in "long form." In the long form, the secondary data set contains the same variable names as in the original data. You can use the SAS data step to create a variable that identifies the original and supplementary observations. This technique can be useful when you want to show multiple markers (sample mean, median, mode, ...) by using the GROUP= option on one SCATTER statement.

The following call to PROC MEANS does not use the AUTONAME option. Therefore the output data set contains variables that have the same name as the input data. You can use the IN= data set option to create an ID variable that identifies the data from the computed statistics:

/* Long form. New data has same name but different group ID */
proc means data=&DSName noprint;
var &VarNames;
output out=Means(drop=_TYPE_ _FREQ_) mean=;
run;
 
data Long;
set &DSName Means(in=newdata);
if newdata then 
   GroupID = "Mean";
else GroupID = "Data";
run;

The DATA step created the GroupID variable, which has the values "Data" for the original observations and the value "Mean" for the appended observations. This data structure is useful for calling PROC SGSCATTER, which supports the GROUP= option, but does not support multiple PLOT statements, as follows:

ods graphics / attrpriority=none;
proc sgscatter data=Long 
   datacontrastcolors=(steelblue firebrick)
   datasymbols=(Circle X);
plot (PetalLength PetalWidth)*(SepalLength SepalWidth) / group=groupID;
run;
Scatter plot matrix with markers for sample means

In conclusion, this article demonstrates a useful technique for adding markers to a graph. The technique requires that you concatenate the original data with supplementary data. Appending and merging data is a technique that is used often when creating ODS statistical graphics in SAS. It is a great technique to add to your programming toolbox.

Post a Comment

Goodness-of-fit tests: A cautionary tale for large and small samples

In the classic textbook by Johnson and Wichern (Applied Multivariate Statistical Analysis, Third Edition, 1992, p. 164), it says:

All measures of goodness-of-fit suffer the same serious drawback. When the sample size is small, only the most aberrant behaviors will be identified as lack of fit. On the other hand, very large samples invariably produce statistically significant lack of fit. Yet the departure from the specified distributions may be very small and technically unimportant to the inferential conclusions.

In short, goodness-of-fit (GOF) tests are not very informative when the sample size is very small or very large.

I thought it would be useful to create simulated data that demonstrate the statements by Johnson and Wichern. Obviously I can't show "all measures of goodness-of-fit," so this article uses tests for normality. You can construct similar examples for other GOF tests.

All measures of goodness-of-fit suffer the same serious drawback. #StatWisdom Click To Tweet

Small data: Only "aberrant behaviors" are rejected

As I showed last week, the distribution of a small sample can look quite different from the population distribution. A GOF test must avoid falsely rejecting the bulk of these samples, so the test necessarily rejects "only the most aberrant behaviors."

To demonstrate how GOF tests work with small samples, let's generate four samples of size N=25 from the following populations:

  • A normal N(4,2) distribution. The population mean and standard deviation are 4 and 2, respectively.
  • A gamma(4) distribution. The population mean and standard deviation are 4 and 2, respectively.
  • A shifted exponential(2) distribution. The population mean and standard deviation are 4 and 2, respectively.
  • A lognormal(1.25, 0.5) distribution. The population mean and standard deviation are 3.96 and 2.11, respectively.

The following SAS DATA step creates the four samples. The Distribution variable identifies the observations in each sample. You can use the SGPANEL procedure to visualize the sample distributions and overlay a normal density estimate, as follows:

data Rand;
call streaminit(1234);
N = 25;
do i = 1 to N;
   Distribution = "Normal     ";
   x = rand("Normal", 4, 2);   output;
   Distribution = "Gamma      ";
   x = rand("Gamma", 4);       output;
   Distribution = "Exponential";
   x = 2 + rand("Expo", 2);    output;
   Distribution = "Lognormal  ";
   x = rand("Lognormal", 1.25, 0.5); output;
end;
run;
 
proc sgpanel data=Rand;
  panelby Distribution / rows=4 layout=rowlattice onepanel novarname;
  histogram x;
  density x / type=normal;
run;
A  panel of histograms for small samples from the normal, lognormal, gamma, and exponential distributions

We know that three of the four distributions are not normal, but what will the goodness-of-fit tests say? The NORMAL option in PROC UNIVARIATE computes four tests for normality for each sample. The following statements run the tests:

ods select TestsForNormality;
proc univariate data=Rand normal;
  class Distribution;
   var x;
run;

The results (not shown) are that the exponential sample is rejected by the tests for normality (at the α=0.05 level), but the other samples are not. The samples are too small for the test to rule out the possibility that the gamma and lognormal samples might actually be normal. This actually makes sense: the distribution of these samples do not appear to be very different from some of the normal samples in my previous blog post.

Large samples and small deviations from fit

As Johnson and Wichern said, a large sample might appear to be normal, but it might contain small deviations that cause a goodness-of-fit test to reject the hypothesis of normality. Maybe the tails are a little too fat. Perhaps there are too many or too few outliers. Maybe the values are rounded. For large samples, a GOF test has the power to detect these small deviations and therefore reject the hypothesis of normality.

I will demonstrate how rounded values can make a GOF test reject an otherwise normal sample. The following DATA step creates a random sample of size N=5000. The X variable is normally distributed; the R variable is identical to X except values are rounded to the nearest tenth.

data RandBig;
call streaminit(1234);
N = 5000;
do i = 1 to N;
   x = rand("Normal", 4, 2);
   r = round(x, 0.1);        /* same, but round to nearest 0.1 */
   output;
end;
run;

There is little difference between the X and R variables. The means and standard deviations are almost the same. The skewness and kurtosis are almost the same. Histograms of the variables look identical. Yet because the sample size is 5000, the GOF tests reject the hypothesis of normality for the R variable at the 95% confidence level. The following call to PROC UNIVARIATE computes the analysis for both X and R:

ods select Moments Histogram TestsForNormality;
proc univariate data=RandBig normal;
var x r;
histogram x r / Normal;
run;
smallsamplegof

Partial results are shown. The first table is for the variable X. The goodness-of-fit tests fail to reject the null hypothesis, so we correctly accept that the X variable is normally distributed. The second table is for the variable R. The GOF tests reject the hypothesis that R is normally distributed, merely because the values in R are rounded to the nearest 0.1 unit.

Rounded values occur frequently in practice, so you could argue that the variables R and X are not substantially different, yet normality is rejected for one variable but not for the other.

And it is not just rounding that can cause GOF tests to fail. Other small and seemingly innocuous deviations from normality would be similarly detected.

In conclusion, be aware of the cautionary words of Johnson and Wichern. For small samples, goodness-of-fit tests do not reject a sample unless it exhibits "aberrant behaviors." For very large samples, the GOF tests "invariably produce statistically significant lack of fit," regardless of whether the deviations from the target distributions are practically important.

Post a Comment

Sampling variation in small random samples

Somewhere in my past I encountered a panel of histograms for small random samples of normal data. I can't remember the source, but it might have been from John Tukey or William Cleveland. The point of the panel was to emphasize that (because of sampling variation) a small random sample might have a distribution that looks quite different from the distribution of the population. The diversity of shapes in the panel was surprising, and it made a big impact on me. About half the histograms exhibited shapes that looked nothing like the familiar bell shape in textbooks.

A small random sample might look quite different than the population #StatWisdom Click To Tweet

In this article I recreate the panel. In the following SAS DATA step I create 20 samples, each of size N. I think the original panel showed samples of size N=15 or N=20. I've used N=15, but you can change the value of the macro variable to explore other sample sizes. If you change the random number seed to 0 and rerun the program, you will get a different panel every time. The SGPANEL procedure creates a 4 x 5 panel of the resulting histograms. Click to enlarge.

%let N = 15;
data Normal;
call streaminit(93779);
do ID = 1 to 20;
   do i = 1 to &N;
      x = rand("Normal");   output;
   end;
end;
run;
 
title "Random Normal Samples of Size &N";
proc sgpanel data=Normal;
   panelby ID /  rows=4 columns=5 onepanel;
   histogram x;
run;
Panel of histograms for random normal samples (N=15). Due to sampling variation, some histograms are not bell-shaped.

Each sample is drawn from the standard normal distribution, but the panel of histogram reveals a diversity of shapes. About half of the ID values display the typical histogram for normal data: a peak near x=0 and a range of [-3, 3]. However, the other ID values look less typical. The histograms for ID=1, 19, and 20 seem to have fewer negative values than you might expect. The distribution is very flat (almost uniform) for ID=3, 9, 13, and 16.

Because histograms are created by binning the data, they are not always the best way to visualize a sample distribution. You can create normal quantile-quantile (Q-Q) plots to compare the empirical quantiles for the simulated data to the theoretical quantiles for normally distributed data. The following statements use PROC RANK to create the normal Q-Q plots, as explained in a previous article about how to create Q-Q plots in SAS:

proc rank data=Normal normal=blom out=QQ;
   by ID;
   var x;
   ranks Quantile;
run;
 
title "Q-Q Plots for Random Normal Samples of Size &N";
proc sgpanel data=QQ noautolegend;
   panelby ID / rows=4 columns=5 onepanel;
   scatter x=Quantile y=x;
   lineparm x=0 y=0 slope=1;
   colaxis label="Normal Quantiles";
run;
Panel of quantile-quantile plots for random normal samples (N=15), which shows the sampling variation of samples

The Q-Q plots show that the sample distributions are well-modeled by a standard normal distribution, although the deviation in the lower tail is apparent for ID=1, 19, and 20. This panel shows why it is important to use Q-Q plots to investigate the distribution of samples: the bins used to create histograms can make us see shapes that are not really there. The SAS documentation includes a section on how to interpret Q-Q plots.

In conclusion, small random normal samples can display a diversity of shapes. Statisticians understand this sampling variation and routinely caution that "the standard errors are large" for statistics that are computed on a small sample. However, viewing a panel of histograms makes the concept of sampling variation more real and less abstract.

Post a Comment

Highlight forecast regions in graphs

A SAS customer asked how to use background colors and a dashed line to emphasize the forecast region for a graph that shows a time series model. The task requires the following steps:

  • Use the ATTRPRIORITY=NONE option on the ODS GRAPHICS statement to make sure that the current ODS style will change line patterns, and use the STYLEATTRS statement to set the line patters for the plot.
  • Add an indicator variable to the data set that indicates which times are in the "past" (the data region) and which times are in the "future" (the forecast region).
  • Use the BLOCK statement in PROC SGPLOT to add a background color that differentiates the past and future regions.
  • Use the SERIES statement to plot the model forecast and use the GROUP= option to visually differentiate the past predictions (solid line) from future predictions (dashed line).

A simple example

Graph that highlights a region and changes the line pattern

A simple "toy" example is the best way to show the essential features of the desired graph. The following DATA step creates a curve in two regions. For x ≤ 6.28, the curve is a sine curve. For x > 6.28, the curve is linear. These two domains correspond to the "Historical" and "Forecast" levels of the indicator variable BlockID. The graph is shown to the left.

data Example;
pi = constant('pi');
BlockID = "Historical";
do x = 0 to 6.28 by 0.01;
   y = sin(x);   output;
end;
BlockID = "Forecast";
do x = 6.28 to 8 by 0.01;
   y = x - 2*pi;   output;
end;
run;
 
ods graphics / attrpriority=none;
title "Background Colors and Line Styles for Forecast";
proc sgplot data=Example noautolegend;
styleattrs  DATACOLORS=(verylightgrey verylightred) /* region */
            DATALINEPATTERNS=(solid dash)           /* line patterns */;
block x=x block=BlockID / transparency=0.75; 
series x=x y=y / group=BlockID lineattrs=(color=black);
run;

The graph emphasizes the forecast region by using color and a line pattern. The ATTRPRIORITY=NONE option ensures that the line patterns alternate between groups. For details, see Sanjay Matange's article about the interactions between the ATTRPRIORITY= option and the STYLEATTRS statement. For rapidly oscillating models, you might want to use the DOT line pattern instead of the DASH line pattern.

I've previous written about how to use the BLOCK statement to emphasize different regions in the domain of a graph.

Of course, this example is very simplistic. The next section shows how you can apply the ideas to a more realistic example.

A time series example with forecast region highlighted

Many SAS procedures create suitable graphs when you enable ODS GRAPHICS. In particular, many SAS/ETS procedures (such as PROC ARIMA) can create graphs that look similar to this example. The following classic example is taken from the PROC ARIMA documentation. The data are the log-transformed number of passengers who flew on commercial airlines in the US between 1949 and 1961. Based on these data, the ARMIA model forecasts an additional 24 months of passenger traffic.

data seriesg;
   input x @@;
   xlog = log( x );
   date = intnx( 'month', '31dec1948'd, _n_ );
   format date monyy.;
   label xlog="log(passengers)";
datalines;
112 118 132 129 121 135 148 148 136 119 104 118
115 126 141 135 125 149 170 170 158 133 114 140
145 150 178 163 172 178 199 199 184 162 146 166
171 180 193 181 183 218 230 242 209 191 172 194
196 196 236 235 229 243 264 272 237 211 180 201
204 188 235 227 234 264 302 293 259 229 203 229
242 233 267 269 270 315 364 347 312 274 237 278
284 277 317 313 318 374 413 405 355 306 271 306
315 301 356 348 355 422 465 467 404 347 305 336
340 318 362 348 363 435 491 505 404 359 310 337
360 342 406 396 420 472 548 559 463 407 362 405
417 391 419 461 472 535 622 606 508 461 390 432
;
 
proc arima data=seriesg  plots(only)=forecast(forecasts);
   identify var=xlog(1,12);
   estimate q=(1)(12) noint method=ml;
   forecast id=date interval=month out=forearima;
run;
ARIMA model and forecast. Graph produced automatically by PROC ARIMA in SAS/ETS

You can see that the ODS graph uses a dashed line to separate the historical (data) region from the forecast region. However, the graph uses a solid line to display all predicted values, even the forecast.

In the previous PROC ARIMA call, I used the OUT= option on the FORECAST statement to create a SAS data set that contains the predicted values and confidence region. The following DATA step adds an indicator variable to the data:

data forecast;
set forearima;
if date <= '01JAN1961'd then BlockID = "Historical";
else BlockID = "Forecast";
run;

You can now create the modified graph by using the STYLEATTRS, BLOCK, and SERIES statements. In addition, a BAND statement adds the confidence limits for the predicted values. A SCATTER statement adds the data values. The XAXIS and YAXIS values overlay a grid on the graph.

ods graphics / attrpriority=none;
title "ARIMA Model and Forecast";
proc sgplot data=forecast noautolegend;
styleattrs  DATACOLORS=(verylightgrey verylightred) /* region */
            DATALINEPATTERNS=(solid dot)            /* line patterns */;
block x=date block=BlockID / transparency=0.75; 
band x=date lower=L95 upper=U95;
scatter x=date y=xlog;
series x=date y=Forecast / group=BlockID lineattrs=(color=black);
xaxis grid display=(nolabel);
yaxis grid;
run;
Time series model with color to indicate forecast region and dotted line to indicate forecast values

The final graph is a customized version of the default graph that is created by using PROC ARIMA. The presentation highlights the forecast region by using a different background color and a different line style.

If you are content with this one-time modification, then you are done. If you want to create this graph every time that you run PROC ARIMA, read the SAS/STAT chapter "ODS Graphics Template Modification" or read Warren Kuhfeld's paper about how to modify the underlying template to customize the graph every time that the procedure runs.

Post a Comment

Need to log-transform a distribution? There's a SAS function for that!

At a conference last week, a presenter showed SAS statements that compute the logarithm of a probability density function (PDF). The log-PDF is a a common computation because it occurs when maximizing the log-likelihood function.

The presenter computed the expression in SAS by using an expression that looked like
y = LOG(PDF("distribution", x, params));
In other words, he computed the PDF and then transformed the density by applying the LOG function.

There is a better way. It is more efficient and more accurate to use the LOGPDF function to compute the log-PDF directly.

Special functions for log-transformed distributions #SASTip Click To Tweet

Log-transformed distributions

SAS provides several functions for computing with log-transformed distributions. In addition to the LOGPDF function, you can use the LOGCDF function to compute probabilities for the log-distribution.

Let's use the gamma distribution to see why the LOGPDF function is usually more efficient. The gamma density function with unit scale parameter is given by
f(x; a) = xa-1 exp(-x) / Γ(a)
where Γ(a) is the value of the gamma function evaluated at a.

The log-transform of the gamma density is
log(f(x; a)) = (a-1)log(x) – x – log(Γ(a))
This formula, which is used when you compute LOGPDF("gamma", x, 2), is cheap to evaluate and does not suffer from numerical underflow when x is large. In particular, recall that in double-precision arithmetic, exp(-x) underflows if x > 709.78, so the PDF function cannot support extremely large values of x. In contrast, the formula for the log-PDF does not experience any numerical problems when x is large. The log-PDF function is also very fast to compute.

The following DATA step illustrates how to use the LOGPDF function to compute the log-gamma density. It computes the log-gamma PDF two ways: by using the LOGPDF function and by using the log-transform of the gamma PDF. The results show that the numerical values are nearly the same, but that the PDF function fails for large values of x:

data LogPDF(drop=a);
a = 2;
do x = 100, 700, 720, 1000;
   LogOfPDF = log( pdf("Gamma", x, a) ); 
   LOGPDF = logpdf("Gamma", x, a);        /* faster and more accurate */
   diff = LOGPDF - LogOfPDF;
   output;
end;
label LOGOfPDF = "LOG(PDF(x))";
run;
 
proc print noobs label; run;
Density of the log-gamma(2) distribution for large x

By the way, SAS provides other functions that are optimized for computing the log-transformation of several well known funcions and quantities:

  • The LOGBETA function computes the logarithm of the beta function. You should use logbeta(a,b) instead of log(beta(a,b)).
  • The LGAMMA function computes the logarithm of the gamma function. You should use lgamma(a) instead of log(gamma(a)).
  • The LCOMB function computes the logarithm of the number of combinations of n objects taken k at a time. You should use lcomb(n,k) instead of log(comb(n,k)).
  • The LFACT function computes the quantity log(n!) for specified values of n. You should use lfact(n) instead of log(fact(n)).

In general, when software provides a function for directly computing the logarithm of a quantity, you should use it. The special-purpose function is typically faster, more accurate, and will handle arguments that are beyond the domain of the non-specialized function.

Post a Comment

Visualize the ages of US presidents

Who was the oldest person elected president of the United States? How about the youngest? Who was the oldest when he left office? Let's look at some data. Table of ages for US presidents Wikipedia has a page that presents a table of the presidents of the US by age. It lists the dates for which each president entered and left office, including those who died or resigned. It also provides the ages for each president on those dates. However, the table is hard to scan. I decided to import the data into SAS and create a graphic that visualizes the relevant data for all presidents. Ages of US presidents. Who was oldest? Youngest? #DataViz Click To Tweet

Reading the president data into SAS

The first step is to load the data into a SAS data set. There are two SAS tricks that you can use to read the data: You can download the SAS file that reads the dates and computes each president's age upon entering and leaving office.

A graph that shows the ages of presidents

I was interested in showing the age of each president when he began and left the presidency. I thought about creating a HIGHLOW plot of age versus the president's number (Washington=1 to Trump=45), but I wanted to also show the approximate years that each president served. Consequently, I decided to use a VECTOR plot with scatter plot overlays. Ages of US Presidents The adjacent graph (click to enlarge) show an arrow for each president. The filled marker at the base of each arrow indicates the date and age at which each president assumed the office. The length of the arrow indicates how long each president served. (Most arrows represent four or eight years.) The open marker at the tip of each arrow indicates the date and age at which each president left office. The graph displays the names of certain presidents. People who became president before turning 50 are shown along the bottom of the graph. The people who became president after the age of 65 are also shown. Notice that William Harrison (age 68) only lived a month past his inauguration. He and Garfield are mere blips on the graph. In contrast, the long arrow of FDR's 12-year presidency in the 1930s is easy to spot.

Oldest and youngest presidents

What can we learn from the graph?
  • You might have heard the claim that Donald Trump will be the oldest person to be inaugurated. The graph shows that Trump will be older than Ronald Reagan was at his 1980 inauguration. However, the graph also shows that Ronald Reagan served two terms, so Reagan was older than Trump at his second inauguration. Reagan was almost 78 when he left office, which makes him the oldest person to serve as president!
  • Who was the youngest president? Many people might guess John F. Kennedy. True, he was the youngest to be elected presidents, but the graph shows that Teddy Roosevelt was about nine months younger when he assumed the office after McKinley's assassination.
  • On several occasions the incoming president was almost exactly the same age as the outgoing president. For example, in 2001 Bill Clinton (age 54.4 years) was succeeded by George W. Bush (age 54.5 years). Bush's arrow appears to be a continuation of Clinton's.
  • In contrast, can you find the biggest age gap between an outgoing president and his successor? Hint: It was a 26-year difference in age!
Do you think you can create a better graph that shows the ages of US presidents? Download the data and post your best ideas!
Post a Comment

One informat to rule them all: Read any date into SAS

If you obtain data from web sites, social media, or other unstandardized data sources, you might not know the form of dates in the data. For example, the US Independence Day might be represented as "04JUL1776", "07/04/1776", "Jul 4, 1776", or "July 4, 1776." Fortunately, the ANYDTDTE informat makes it easy read dates like these into SAS.

The ANYDTDTEw. informat is a flexible alternative to older informats such as DATEw., MMDDYYw., and YYMMDDw. If your dates are in a specific form, the older informats work great and serve to document that all dates must be in that standard form. If the dates are not standardized or you need to read a string like "July 4, 1776", the ANYDTDTE informat is a godsend.

The ANYDTDTE informat for reading dates

The following SAS DATA step shows that the ANYDTDTEw. format combines several older formats into a "super format" that attempts to convert a character string into a date. The ANYDTDTE format can not only replace many of the older formats, but it can be used to convert a string like "Jul 4, 1776" into a date, as follows:

data Dates;
input @1 Style $8.
      @9 Value anydtdte12.;
format Value DATE10.;
datalines;
DATE    04JUL1776
MMDDYY  07041776
MMDDYY  07/04/1776
YYMMDD  17760704 
N/A     Jul 4, 1776
N/A     July 4, 1776
;
 
proc print noobs; run;
Result of using the ANYDTDTE informat to read strings that represent dates

As you can see, the ANYDTDTE informat reads six different strings, but converts all of them to the SAS date value that corresponds to 04JUL1776.

MMDD or DDMM? How does ANYDTDTE interpret ambiguous dates?

The string 07/04/1776 can be interpreted as "April 7, 1776" or "July 4, 1776," depending upon the local convention. Europeans tend to interpret the string as DD/MM/YYYY whereas the US convention is to use MM/DD/YYYY. How does the ANYDTDTEw. informat guess which interpretation might be correct?

The answer is that the informat looks at the DATESTYLE SAS option. By default, the DATESTYLE option uses the LOCALE system option to guess which style to use. You can use PROC OPTIONS to see the value of these options, which are printed to the SAS log:

proc options option=(DATESTYLE LOCALE) value; run;
Option Value Information For SAS Option DATESTYLE
    Value: MDY
    Scope: Default
    How option value set: Locale
 
Option Value Information For SAS Option LOCALE
    Value: EN_US
...

For my system, the DATESTYLE option is set to MDY, which means that the string "07/04/1776" will be interpreted MM/DD/YYYY. If you need to read dates that obey a different convention, you can use the global OPTIONS statement to set the DATESTYLE option:

options DATESTYLE=DMY;    /* change default style convention */
/* Restore default convention: options DATESTYLE=Locale; */

Other "ANY" informats in SAS

There are two other SAS infomats that are similar to the ANYDTDTE informat:

Here's a tip to help you remember these seemingly cryptic names. The first part of the name is "ANYDT", which means that the input string can be ANY datetime (DT) value. The end of the name refers to the numerical value that is produced by the informat. The resulting value can be a date (DTE), a datetime (DTM), or a time (TME) value. Thus the three informats all have the mnemonic form ANYDTXXX where the XXX suffix refers to the value that is produced.

Post a Comment

Visualize a torus in SAS

This article uses graphical techniques to visualize one of my favorite geometric objects: the surface of a three-dimensional torus. Along the way, this article demonstrates techniques that are useful for visualizing more mundane 3-D point clouds that arise in statistical data analysis.

Projections of a 3-D torus: Four visualization techniques for high-dimensional data. #DataViz Click To Tweet

Define points on a torus

A torus is the product of two circles, so it can be parameterized by two angles, usually called θ and φ. You can create a regular grid of points in the square (θ, φ) ∈ [0, 2π] x [0, 2π] and map the points into Euclidean three-space as shown in the following SAS DATA step:

data Torus;
R = 8;       /* radius to center of tube */
A = 3;       /* radius of tube */
pi = constant('pi');
step = 2*pi/50;
/* create torus as parametric image of [0, 2*pi] x [0,2*pi] */
do theta = 0 to 2*pi-step by step;
   do phi = 0 to 2*pi-step by 2*pi/50;
      x = (R + A*cos(phi)) * cos(theta);
      y = (R + A*cos(phi)) * sin(theta);
      z =      A*sin(phi);
      output;
   end;
end;
keep x y z;
run;
 
title "Projections of a Standard Torus";
proc sgscatter data=Torus;
   matrix x y z;
run;
Projections of torus onto coordinate planes

The SGSCATTER procedure displays the projection of the torus onto the three coordinate planes. In the (x,y) plane you can see that the object has a hole, but the projection onto the other coordinate planes are not very insightful because there is a lot of overplotting.

Rotating scatter plot in SAS/IML Studio

One way to avoid overplotting is to visualize the torus as a 3-D cloud of points. The SAS/IML Studio environment supports a rotating 3-D scatter plot, as shown to the left. In SAS/IML Studio you can interactively rotate the 3-D cloud of points, change the marker colors, and perform other techniques in exploratory data analysis.

Alternatively, if you want to look at planar projections with PROC SGSCATTER, you can rotate the torus so that the projections onto the coordinate planes are not degenerate.

Rotating a cloud of points

My previous article defined SAS/IML functions that create rotation matrices. The following SAS/IML program is almost identical to the program in the previous article, so I will not explain each statement. The program reads the Torus data set, rotates the points in a particular way, and writes the rotated coordinates to a SAS data set.

proc iml;
/* choose rotation matrix as product of canonical rotations */
load module=Rot3D;         /* see http://blogs.sas.com/content/iml/2016/11/07/rotations-3d-data.html */
pi = constant('pi');
Rz = Rot3D(-pi/6, "Z");    /* rotation matrix for (x,y) plane */
Rx = Rot3D(-pi/3, "X");    /* rotation matrix for (y,z) plane */ 
Ry = Rot3D( pi/6, "Y");    /* rotation matrix for (x,z) plane */
P = Rx*Ry*Rz;              /* cumulative rotation */
 
use Torus;                 /* read data (points on a torus) */
read all var {x y z} into M;
close Torus;
 
Rot = M * P`;              /* apply rotation and write to data set */
create RotTorus from Rot[colname={"Px" "Py" "Pz"}];
append from Rot;
close;
QUIT;

You can use PROC SGSCATTER to project these rotated points onto the coordinate planes. The TRANSPARENCY= option creates semi-transparent markers that give the illusion of a ghostly see-through torus. This can be an effective technique for visualizing any point cloud, since it enables you to see regions in which overplotting occurs. The statements also use the COLORRESPONSE= option to color the markers by the (original) Z variable. The COLORRESPONSE= option requires SAS 9.4m3.

data Coords;
merge Torus RotTorus;
run;
 
title "Projections of a Rotated Torus";
proc sgscatter data=Coords;
matrix Px Py Pz / markerattrs=(size=12 symbol=CircleFilled)
       transparency=0.75
       colorresponse=Z  colormodel=ThreeColorRamp; /* COLORRESPONSE= requires 9.4m3 */
run;
Projections of rotated torus onto coordinate planes

The transparent markers serve a second function. The torus is composed of a sequence of circles. Therefore, the last circles (near θ = 2π) will obscure the first circles (near θ = 0) if opaque markers are used. The parametric construction is very apparent if you remove the TRANSPARENCY= option.

If you want to plot without transparency, you should sort the data, which is a standard technique in the graphics community where it is called z-ordering or depth sorting. For example, in the (Px,Py) plane you can sort by the Pz variable so that negative Pz values are plotted first and positive Pz values are plotted on top of the earlier markers. Unfortunately, you can only use this sorting trick to plot one pair of coordinates at a time. The following code demonstrates this trick for the (Px, Py) plane:

proc sort data=Coords out=SortCoords;
   by Pz;
run;
 
title "Projection of a Rotated Torus";
title2 "Sorted by Distance Above Coordinate Plane";
proc sgplot data=SortCoords;
   scatter x=Px y=Py / markerattrs=(size=12 symbol=CircleFilled)
           colorresponse=Z  colormodel=ThreeColorRamp;
run;
Projection of torus onto coordinate plane

In conclusion, PROC SGSCATTER enables you to visualize high-dimensional data via projections onto coordinate planes. I demonstrated the techniques on a 3-D torus, but the techniques apply generally to any high-dimensional data. Visualization tricks in this article include:

  • If the projections of the data onto coordinate planes are degenerate, rotate the points.
  • Use semi-transparency to reduce the effect of overplotting.
  • Use the COLORRESPONSE= option to color the markers by one of the variables. This requires SAS 9.4m4.
  • If you do not want to use transparency, sort the data in a direction perpendicular to the projected plane. That makes the rendering look more realistic.

Your high-dimensional point clouds might not look as cool as this torus, but if you use these visualization tips, the data will be easier to interpret and understand. The SGSCATTER procedure in SAS is an easy-to-use routine for investigating relationships among several variables.

Post a Comment