The post Scale a density curve to match a histogram appeared first on The DO Loop.
]]>This article discusses how to scale a probability density curve so that it fits appropriately on a histogram, as shown in the graph to the right. By definition, a probability density curve is scaled so that the area under the curve equals 1. However, a histogram might show counts or percentages for each bin, so it has a different vertical scale. One way to align these two graphical objects is to rescale the density curve. SAS procedures perform this scaling automatically, but it is helpful to understand how the scaling works. This article explains how to rescale a density curve so that it matches the scale of a histogram. Several examples in SAS help to illustrate the concepts.
The formula to rescale a density curve is implicitly included in
the PROC UNIVARIATE documentation.
However, I remember my puzzlement when I first encountered the formulas many years ago.
The formulas in the documentation include two symbols that seem out of place: a symbol for the bandwidth (h) of the
histogram and a symbol (v) that is related to the VSCALE= option on the HISTOGRAM statement. (If you are unfamiliar with the VSCALE= option, see the appendix.)
For example,
the following formula specifies the normal density curve that is overlaid on a histogram when you use the NORMAL option on the HISTOGRAM statement:
\(
p(x) = hv \frac{1}{\sigma \sqrt{2\pi }} \exp \left(-\frac{1}{2} (\frac{x - \mu }{\sigma })^{2}\right)
\)
Without the h and v symbols, the formula is the normal probability density function with mean μ and standard deviation σ. So why are the symbols h and v present? The answer is that you need these factors to rescale the density function to match the histogram's scale. These symbols also appear in the documentation of other probability distribution and in the formula for a kernel density estimate.
Again, I want to emphasize that SAS performs the scaling automatically. SAS procedures do not require you to perform any manual calculations.
A primary use of a histogram is as an empirical estimate for the density of the probability distribution of the data. In physics, we learn that a density is always a ratio. In 3-D, it is mass per unit volume. In 2-D, it is mass per unit area, and in 1-D it is mass per unit length. In statistics, we replace "mass" with "probability." Hence, a one-dimensional probability density for a continuous random variable is probability per unit length.
When histograms are first introduced in school, they are often displayed on the "count scale," which means that the height of each bar represents the number of observations that falls into each histogram bin, which has width h. This scale is used by the histogram at the top of this article. On this scale, the heights of the bars vary between 0 and n, the sample size.
Another popular scale is the proportion scale. Let c[k] be the number of observations in the k_th bin, which is the interval [t, t+h) for some value of t. The proportion c[k]/n estimates the probability that a continuous random variable is in the k_th bin. Therefore, you obtain an estimated probability density by dividing by the length of the bin: (c[k]/n) / h. If f(t) is the underlying probability density function that generates the data, then we expect that (asymptotically) f(t) ≈ c[k]/(nh) when t is inside the k_th bin.
That formula enables us to understand the symbols v and h in the PROC UNIVARIATE documentation:
In all cases, for a histogram that has bin width h, you can align the probability density curve with the histogram by multiplying the curve by v*h and choosing v appropriately.
The appearance of v in the formulas make intuitive sense because you must adjust for the scale of the histogram axis (counts, proportion, or percentage). But what is the intuition behind why the bin width, h, appears as a multiplying factor for the probability density?
To answer that question, let's perform a thought experiment. Suppose you have data that are distributed uniformly on [-1,1]. Suppose you create a histogram on the proportion scale for these data. The heights of the bars depend on the bin size:
This argument shows that for uniform data, you expect the heights of the bars to be proportional to the bandwidth.
If you prefer to see graphs rather than mathematical arguments, the following SAS statements create a set of uniform data and plot the two histograms. The bin width for the first histogram is twice as big as the second histogram. Accordingly, the bars for the first histogram are approximately twice as tall.
data Unif; call streaminit(1234); do i=1 to 5000; x = rand("Uniform", -1, 1); output; end; run; title "Bin Width h=0.2"; proc univariate data=Unif; var x; histogram x / vscale=proportion endpoints=(-1 to 1 by 0.2) grid odstitle=title; ods select histogram; run; title "Bin Width h=0.1"; proc univariate data=Unif; var x; histogram x / vscale=proportion endpoints=(-1 to 1 by 0.1) grid odstitle=title; ods select histogram; run; |
Notice that the first histogram has a vertical axis [0, 0.1] whereas the scale of the second histogram is [0, 0.05].
Notice that the proportion scale for a histogram is not the same as the natural scale for a probability distribution. For any probability distribution, the area under the curve is exactly 1. However, you can look at the two graphs and see that the area of the bars is not 1.
This explains why h appears as a multiplicative factor when you scale a probability density. I have only presented an intuitive argument for a uniform distribution, but the argument holds more generally. For example, the following SAS code repeats the experiment for normally distributed data:
data Normal; call streaminit(1234); do i=1 to 5000; x = rand("Normal", 0, 0.25); output; end; run; title "Bin Width h=0.2"; proc univariate data=Normal; var x; histogram x / vscale=proportion endpoints=(-1 to 1 by 0.2) grid odstitle=title; ods select histogram; run; title "Bin Width h=0.1"; proc univariate data=Normal; var x; histogram x / vscale=proportion endpoints=(-1 to 1 by 0.1) grid odstitle=title; ods select histogram; run; |
Again, we see that the height of the histogram is proportional the bin width, h. When h is bigger, there are fewer bars, but they tend to be taller. The scale on the first histogram is [0, 0.3]. When h is small, there are more bars, each with a smaller height. The scale on the second histogram is [0, 0.15].
For the proportion scale, the total area of the bars is h. Since the area under a probability density curve is 1, you need to multiply the density curve by h to align it with the histogram on the proportion scale.
The documentation for PROC UNIVARIATE includes formulas for overlaying a probability density function on a histogram. The formulas all contain scaling parameter h and v, which multiply the density function. This article explains why those parameters are present.
The HISTOGRAM statement in PROC UNIVARIATE supports the VSCALE= option. Valid values for the VSCALE= option are COUNT, PROPORTION, and PERCENT. As the name implies, the VSCALE= option specifies the scale of the vertical scale of the histogram.
You can control the vertical scale by using the VSCALE= option. By default, VSCALE=PERCENT. Notice that the shape of a histogram does not change when you modify the VSCALE= option. The only thing that changes are the numbers on the vertical axis.
The post Scale a density curve to match a histogram appeared first on The DO Loop.
]]>The post A bootstrap confidence interval for an R-square statistic appeared first on The DO Loop.
]]>There are many ways to bootstrap in SAS. For regression models, the most common approach is to use case resampling. However, for many statistics that are associated with a linear regression model, there is a convenient shortcut for generating the bootstrap distributions in SAS: Use the MODELAVERAGE statement in PROC GLMSELECT to perform bootstrap resampling.
To illustrate the bootstrap method, the following SAS statements create a sample of size 10,000 observations from the Sashelp.BWeight data set, which contains data about pregnant mothers and their babies. The call to PROC GLMSELECT fits a regression model for predicting the weight of the baby from an intercept and eight effects: seven main effects and one quadratic interaction.
%let N = 10000; /* sample size */ data BWeight; set sashelp.bweight(obs=&N); keep Weight Black Married Boy MomSmoke CigsPerDay MomAge MomWtGain; run; proc glmselect data=BWeight seed=12345; model Weight = Black Married Boy MomSmoke CigsPerDay MomAge MomAge*MomAge MomWtGain / selection=none; ods select ModelInfo NObs Dimensions FitStatistics; ods output FitStatistics=Fit1; run; |
The output shows some information about the data set and the model. There are nine parameters in the model: eight for the specified effects and one for the Intercept for the model. The FitStatistics table shows seven common statistics for linear regression models, including the R-square statistic, which is R^{2} = 0.1097 for this model. The goal of this article is to provide a 95% confidence interval for R^{2}.
The bootstrap method can estimate CIs for any of those seven statistics (and for the regression coefficients) so let's put the name of the statistic in a macro variable and put the point estimate into another macro variable. Subsequent statements will refer only to these macro variable, which means that the code in this article can also bootstrap the other fit statistics such as RMSE, Adjusted R-square, AIC, and so forth.
/* put label and value into macro variables */ %let STAT = R-Square; data _null_; set Fit1; where Label1="&STAT"; call symputx("StatVal", nValue1); run; %put &STAT = &StatVal; |
R-Square = 0.1097377994 |
As mentioned earlier, you can use the MODELAVERAGE statement in PROC GLMSELECT to obtain a bootstrap distribution of the R-square statistic. I am only going to use a small number (B=1,000) of bootstrap resamples because I want to quickly visualize the bootstrap distribution. Because the sample is large (N=10,000) and because the R-square value is not very extreme (R2=0.11), I suspect that the distribution will look approximately normal and be centered near the observed value of R^{2}. If so, I can use the Olkin-Finn formula to obtain an asymptotic estimate for the CI. If the distribution does NOT appear to be normal, I won't use the Olkin-Finn formula but instead compute additional bootstrap estimates and obtain a bootstrap CI that has more digits of accuracy.
/* Use PROC GLMSELECT for bootstrap estimates and CI of R-square */ %let NumSamples = 1000; /* number of bootstrap resamples */ proc glmselect data=BWeight seed=12345; model Weight = Black Married Boy MomSmoke CigsPerDay MomAge MomAge*MomAge MomWtGain / selection=none; modelaverage alpha=0.05 sampling=URS nsamples=&NumSamples /* use URS for bootstrap */ details tables(only)=(ParmEst); ods select ModelAverageInfo AvgParmEst; /* display only bootstrap summary */ ods output FitStatistics=FitStats; /* (SLOW in 9.4!) save bootstrap distribution */ run; /* from the many bootstrap statistics, keep only the R-square statistics */ data MyBootDist; set FitStats(where=(Label1="&STAT")); label nValue1="&STAT"; run; title "Bootstrap Distribution of R-Square"; title2 "N=&N; Num Boot Sample=&NumSamples"; proc sgplot data=MyBootDist noautolegend; histogram nValue1; density nValue1 / type=normal(mu=&StatVal); refline &StatVal / axis=x label="Observed R^2" lineattrs=(color=red); run; |
On the MODELAVERAGE statement, I use sampling=URS to obtain bootstrap samples (samples with replacement). I use the DETAILS statement to produce an R-square statistic for each bootstrap sample. The ODS OUTPUT statement writes the 1,000 FitStatistics tables to a data set named FitStats that has 7*1000 observations. I am only interested in the R-square statistic, so I use a DATA step to subset the data. I then call PROC SGPLOT to display a histogram and overlay a normal density curve that is centered at the observed value of R^{2}. The graph indicates that the distribution of the R-square statistic appears to be normal, although there seems to be a small bias between the observed statistic and the bootstrap estimate.
Let's use the percentile method to estimate a 95% bootstrap confidence interval. The lower limit of the CI is estimated by the 2.5th percentile of the bootstrap estimates. The upper limit is estimated by the 97.5th percentile. You can use PROC UNIVARIATE to estimate these quantiles:
/* Use approx sampling distribution to make statistical inferences */ proc univariate data=MyBootDist noprint; var nValue1; output out=Pctl mean=BootR2 std=BootSE /* compute 95% bootstrap confidence interval */ pctlpre =Boot_ pctlpts =2.5 97.5 pctlname=LCL UCL; run; proc print data=Pctl noobs; run; |
According to this small bootstrap computation, an estimate for the standard error of R-square is 0.006. An estimate for a 95% confidence interval is [0.01, 0.123].
Because the distribution seems to be approximately normal, the next section computes the same quantities by using the Olkin-Finn formula.
The Olkin-Finn formula is an asymptotic (large sample) formula that assumes that the R-square statistic is normal and centered at the observed value of R^{2}. Both of these assumptions appear to be satisfied here. To apply the Olkin-Finn formula to obtain a 100(1-α)% confidence interval, you need to supply the following information:
The following DATA step applies the Olkin-Finn formula:
/* apply the Olkin-Finn formula */ data AsympR2; alpha = 0.05; /* obtain 100*(1-alpha)% confidence interval */ n = &N; /* sample size */ k = 8; /* number of parameters, not including intercept */ R2 = &StatVal; /* observed R^2 for sample */ SE = sqrt((4*R2*((1-R2)**2)*((n-k-1)**2))/((n**2-1)*(n + 3))); /* use normal quantile to estimate CI (could also use t quantile) */ z = quantile("Normal", 1-alpha/2); OF_LCL = R2 - z*SE; OF_UCL = R2 + z*SE; run; title "Asymptotic Confidence Interval for R-Square"; proc print noobs data=AsympR2; var R2 SE OF_LCL OF_UCL; run; |
For this regression model, the Olkin-Finn estimate of the standard error of R-square is 0.006. An estimate for a 95% CI is [0.01, 0.121]. This estimate for the standard error is the same as the bootstrap estimate. The lower confidence limit is also the same. The upper confidence limits are almost the same (0.123 versus 0.121). Thus, the Olkin-Finn estimates and the bootstrap estimates are very close to each other for these data and for this model.
The Olkin-Finn formula should be used for large samples. What happens if we reduce the size of the sample? To investigate this question, change the sample size by specifying the macro value of N:
%let N = 500; /* new sample size */ |
The rest of the program does not need to be modified. When you rerun the program, you get the following histogram that shows the bootstrap distribution versus a normal approximation centered at the observed R^{2} value.
For this sample, there is a large bias between the observed R^{2} value and the mean of the bootstrap distribution. (The distribution also appears to be slightly skewed to the right.) The Olkin-Finn formula incorporates that bias. Whereas the bootstrap estimate of the CI is [0.072, 0.156], the Olkin-Finn estimate is [0.048, 0.144], which is shifted to the left. In this case, I would recommend using a bootstrap estimate of the CI instead of using the Olkin-Finn estimate.
This article shows how to use the MODELAVERAGE statement in PROC GLMSELECT in SAS to obtain bootstrap estimates for the standard error and confidence interval for the R-square statistic in a linear regression model. With a minor modification, the same program enables you to bootstrap other common fit statistics, such as RMSE or AIC. The bootstrap estimates make very few assumptions and are valid for both large and small samples.
The article also compares the bootstrap estimates to estimates from the Olkin-Finn formula. The Olkin-Finn estimates are intended for large samples and assume that the sampling distribution of the R-square statistic is normal. Furthermore, bias can be an issue when applying the Olkin-Finn formula. I suggest performing a small bootstrap analysis first to check if the bootstrap distribution of the R-square statistic appears to be normal and is centered at the observed value.
The post A bootstrap confidence interval for an R-square statistic appeared first on The DO Loop.
]]>The post The distribution of the R-square statistic appeared first on The DO Loop.
]]>In general, when you receive advice on the internet, you should always scrutinize the information. In this case, the Olkin-Finn formula does not match my intuition and experience. The formula assumes that the sampling distribution of the R-square statistic is normal, but that is not correct in all cases. For simple one-variable regression, the R-square value is equal to the squared Pearson correlation between the explanatory variable, X, and the response variable, Y. Furthermore, the distribution of the correlation statistic is fairly complicated and is not normal.
To better understand the sampling distribution of the R-square statistic and when it is approximately normal, I ran a simulation study. This article presents the results of the simulation study and discusses when to use and when to avoid using the Olkin-Finn formula.
A confidence interval is based on the sampling distribution of a statistic on random samples of data. Often (but not always) the sampling distribution depends on the distribution of the data. In order to obtain a formula, researchers often make simplifying assumptions. Common assumptions include that the data are multivariate normal, that the model is correctly specified, that the sample is very large, and so forth. The last assumption (large sample) is often used to obtain formulas that are asymptotically valid as the sample size increases. I eventually found a reference that states that the Olkin-Finn formula should not be used for small samples (degrees of freedom less than 60), so it is definitely an asymptotic formula.
By definition, the R^{2} value is in the interval [0,1]. As we will see, the distribution of the R^{2} statistic is highly skewed when R^{2} is near 0 or 1. Consequently, the distribution isn't normal, and you shouldn't use the Olkin-Finn formula when R^{2} is near 0 or 1. However, "nearness" depends on the sample size: when the sample size is large you can use the formula on a wider interval than when the sample size is small.
The best way to understand the distribution of the R^{2} statistic is to use a simulation. For data, I will choose a bivariate normal (BVN) sample for which the Pearson correlation is ρ. As mentioned earlier, if (X,Y) ~ BVN(0, ρ), then the regression of Y onto X has and R^{2} value equal to the squared correlation between X and Y (=ρ^{2}). For simplicity, I will break the simulation study into two procedure calls. First, for each value of ρ, use PROC IML to simulate 1,000 samples of size n=100 from BVN(0,ρ). Then, I will use PROC REG to run a regression on each sample and save the R^{2} statistic into an output data set.
/* output a simulation data set: For each rho, generate 1,000 random samples of size N for (X,Y) are bivariate normal with correlation rho */ %let NumSamples = 1000; %let N = 100; proc iml; N = &N; /* sample size */ mu = {0 0}; Sigma = I(2); rhoList = round( do(0, 0.9, 0.1), 0.1) || 0.95; rho=j(N,1); SampleID=j(N,1); create Sim var {'rho' 'SampleID' 'Y' 'X'}; do i = 1 to ncol(rhoList); Sigma[{2 3}] = rhoList[i]; /* set off-diagonal elements of cov matrix */ rho[,] = rhoList[i]; do ID = 1 to &NumSamples; Z = randnormal(N, mu, Sigma); X = Z[,1]; Y = Z[,2]; SampleID[,] = ID; append; end; end; close; QUIT; /* run a procedure to obtain statistics for each random sample. In this case, obtain the R-square statistic for the regression model Y = b0 + b1*X */ proc reg data=Sim noprint outest=PE; by rho SampleID; model Y = X / rsquare; /* add R-square statistic to OUTEST= data set */ quit; |
At this point, the PE data set contains a variable named _RSQ_. There is a value of _RSQ_ for each sample, and there are 1,000 samples for each ρ value. To visualize the distribution of the R-square statistic, you can use box plots, as follows:
/* plot the distribution of the R-square statistics for each rho */ proc sgplot data=PE noautolegend; vbox _RSQ_ / category=rho nooutliers; scatter x=rho y=_RSQ_ / transparency=0.8 jitter; yaxis grid min=0 max=1; run; |
For n=100, the box plots show that the distribution of the R^{2} statistic is skewed except when ρ is near sqrt(2) ≈ 0.7. Equivalently, ρ^{2} is near 0.5, which is as far as ρ^{2} can be from 0 and 1. To make the distributions easier to visualize, you can create a panel of histograms:
title "Distribution of R-Square Statistics"; title2 "n = &n"; proc sgpanel data=PE noautolegend; where rho in (0, 0.2, 0.4, 0.6, 0.8); panelby rho / rows=5 onepanel uniscale=column; histogram _RSQ_ / binwidth=0.02 binstart=0.01; density _RSQ_ / type=kernel; colaxis min=0 offsetmin=0.01; run; |
The graph demonstrates that for a small sample (n=100), the distribution of the R^{2} statistic is not normal when R^{2} is near 0. However, a normal approximation seems reasonable for ρ=0.6 and ρ=0.8. If you increase the sample size to, say, n=500, and rerun the program, you will notice that the R^{2} distributions become narrower and more normal. The distribution remains skewed for extreme values of ρ.
This article reminds us that most formulas have assumptions behind them. Sometimes formulas are applicable to our data, and sometimes they are not. In this article, I use simulation to visualize the sampling distribution of the R^{2} statistic for a simple linear regression on a small set of bivariate normal data. The simulations suggest that you should use care when using the Olkin-Finn formula for small samples. The Olkin-Finn formula is asymptotic and assumes large n. Even for large samples, the formula assumes that the sampling distribution of R-square is approximately normal, which means that you shouldn't use it when R^{2} is near 0 or near 1.
I have no issues with using the Olkin-Finn formula for large data sets when the observed R-square statistic is not too extreme. But what should you do to obtain a confidence interval for small and mid-sized data? One possibility is to use a bootstrap analysis to estimate a confidence interval. In SAS, the GLMSELECT procedure supports the MODELAVERAGE statement, which enables you to perform a simple bootstrap analysis of many regression statistics. A bootstrap analysis does not require large samples and can correctly handle situations where the sampling distribution of R-square is skewed.
As discussed in this article, the Olkin-Finn formula provides an asymptotic confidence interval for R-square. The following text is from p. 88 of Cohen et al., 1988:
The text uses R_{2} for the observed estimate, which is slightly confusing, but I will stick with their notation.
If R^{2} is the point estimate for a linear regression that has k parameters (not counting the intercept)
and the sample has n observations, the formula approximates the sampling distribution by a normal distribution that is centered at
R^{2} and has a standard deviation that is equal to
SE =
sqrt((4 R^{2} ((1-R^{2})^{2}) ((n-k-1)^{2})) / ((n^{2}-1) (n + 3)))
To obtain a 100(1-α) confidence interval, you can use a normal quantile for the α level, or you can use a quantile from a t distribution with n-k-1 degrees of freedom. The following SAS DATA step shows both estimates:
/* Online citation for formula is Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.) The formula for the squared standard error of R^2 seems to originate from Olkin and Finn (1995), "Correlations Redux", Psychological Bulletin, 118(1), pp. 155-164. This is an asymptotic interval, so best for n - k - 1 > 60 For simple 1-var regression, R-sq = corr(X,Y)**2 so for rho = 0.2, R-sq = 0.04 rho = 0.6, R-sq = 0.36 */ %let N = 50; /* smaller than recommended */ data AsympR2; alpha = 0.05; /* obtain 100*(1-alpha)% confidence interval */ n = &N; /* sample size */ k = 1; /* number of parameters, not including intercept */ do R2 = 0.04, 0.36; /* observed R^2 for sample */ SE = sqrt((4*R2*((1-R2)**2)*((n-k-1)**2))/((n**2-1)*(n + 3))); /* use normal quantile to estimate CI (too conservative) */ z = quantile("Normal", 1-alpha/2); LCL_N = R2 - z*SE; UCL_N = R2 + z*SE; /* use t quantile to estimate CI */ t = quantile("T", 1-alpha/2, n-k-1); LCL_T = R2 - t*SE; UCL_T = R2 + t*SE; output; end; run; title "Asymptotic Confidence Interval (Normal Quantile)"; proc print noobs data=AsympR2; var SE z LCL_N UCL_N; ID R2; run; title "Asymptotic Confidence Interval (t Quantile)"; proc print noobs data=AsympR2; var SE t LCL_T UCL_T; ID R2; run; |
Notice that a naïve application of the Olkin-Finn formula can lead to a confidence interval (CI) that include negative values of R-square, although R-square can never be negative. You can compare the two examples to the panel of histograms that is shown earlier in this article. When ρ=0.2, the R-square value is 0.04; When ρ=0.6, the R-square value is 0.36. You can see how the Olkin-Finn CI compares to the true distribution in the histograms.
The post The distribution of the R-square statistic appeared first on The DO Loop.
]]>The post Visualize a multivariate regression model when using spline effects appeared first on The DO Loop.
]]>This article explains how to score the model on new data and how to visualize the predicted values by plotting them against the explanatory variable that you used to generate the spline effects. Although the model uses spline effects, everything in this article also applies to models that do not use splines.
I have previously discussed how to use the EFFECT statement in SAS regression procedures to build regression models that use cubic splines to model nonlinear relationships between the response variable and an explanatory variable. For procedures that do not support an EFFECT statement, you can write the spline basis functions to a data set and then use those effects as explanatory variables in a model.
Visualizing a regression model that has multiple covariates can be a challenge because the "obvious" visualization does not work! The "obvious" (but wrong!) visualization uses an OUTPUT statement to write the predicted values to a data set, then plots the predicted values versus one of the continuous variables. But the following example shows that this naïve approach does not give you an effective visualization when there are multiple covariates:
/* sort the data and choose a subset of 4-, 6-, and 8-cylinder vehicles */ proc sort data=sashelp.cars out=cars(Where=(Type^="Hybrid" and Cylinders in (4,6,8))); by EngineSize; run; /* Fit data by using restricted cubic splines. The EFFECT statement is supported by many procedures: GLIMMIX, GLMSELECT, LOGISTIC, PHREG, ... */ proc glmselect data=cars; effect spl = spline(EngineSize / details /* define spline effects */ naturalcubic basis=tpf(noint) /* natural cubic splines, omit constant effect */ knotmethod=equal(4)); /* 4 evenly spaced interior knots */ class Cylinders; model mpg_city = spl Cylinders Weight / selection=none; /* include categorical and continuous covariates */ output out=GLSOut pred=p_MPG_City; /* try to use predicted values on original data */ quit; title "Predicted Values at Data Values from OUTPUT Statement"; title2 "This Output Is Not Useful!"; proc sgplot data=GLSOut; scatter x=EngineSize y=MPG_City; series x=EngineSize y=p_MPG_City / group=Cylinders; xaxis grid; yaxis grid label="Predicted MPG (City)"; run; |
This graph looks strange, but the strangeness is not because the model uses splines. Every multivariate regression model will look similar when you project the predicted values onto one of the covariates. The reason that two observations that have the same (or nearly the same) value for the horizontal variable in the graph (EngineSize, in this example) might have very different values in one of the other covariates. This means that for every horizontal value, there are multiple response values. Consequently, when you use a SERIES statement to plot the predicted values, the line looks jagged as it traces out multiple Y values for each X value.
You can see the range of predicted values by using PROC PRINT. The following statements display the predicted values for observations at the extreme right edge of the graph:
proc print data=GLSOut; where EngineSize=6.0; var Make Model EngineSize Cylinders Weight MPG_City p_MPG_City; run; |
The table shows that the Weight values are different for the five vehicles that have 6.0-liter engines. Because the Weight variable is included in the model, the predicted values are also different. Thus, when EngineSize=6, the SERIES statement draws a vertical segment that spans the range [9.84, 14.34] in the vertical direction. The same explanation is valid for other values of EngineSize, which leads to the jagged line.
The solution to this problem is to use a sliced fit plot. In a sliced fit plot, you evaluate the model at the following set of points:
The good news is that there is a straightforward way to obtain a sliced fit plot in SAS! Many SAS regression procedures support the STORE statement, which enables you to create an item store for the model. You can use the EFFECTPLOT statement in PROC PLM to create a sliced fit plot. For more control over the visualization, you can use the SCORE statement in PROC PLM to evaluate the model at any set of points that you want.
This works because the item store contains not only the parameter estimates for the spline basis, but also the transformation that generates the spline effects from the original explanatory variable.
For example, the following model includes spline effects. The program uses the STORE statement to save the model to an item store, then uses PROC PLM to plot the predicted values against the EngineSize variable for various values of the covariates:
proc glmselect data=cars; effect spl = spline(EngineSize / details /* define spline effects */ naturalcubic basis=tpf(noint) /* natural cubic splines, omit constant effect */ knotmethod=equal(4)); /* 4 evenly spaced interior knots */ class Cylinders Origin; model mpg_city = spl Cylinders Weight / selection=none; /* include categorical and continuous covariates */ store out=GLSModel; run; title "Predicted Values at Specified Values from PROC PLM"; title2 "Model Uses Splines"; proc plm restore=GLSModel; effectplot slicefit(x=EngineSize sliceby=Cylinders) / at(Weight=3500); run; |
Notice the information that you can supply to the SLICEFIT keyword in the EFFECTPLOT statement:
PROC PLM is also useful to scoring the model on a new set of data is straightforward. This is sometimes necessary when you want complete control over the values at which to evaluate the model. To score the model at arbitrary locations, generate the scoring data and use the SCORE statement in PROC PLM, as follows:
/* Create a scoring data set. For cars, mean(Weight)=3570 and median(Weight)=3473, so use 3500 as "round" value. */ data ScoreIn; Weight = 3500; /* a value close to the mean */ do Cylinders = 4,6,8; /* important modes */ do EngineSize = 1.5 to 6 by 0.1; output; end; end; run; /* score the model by using PROC PLM */ proc plm restore=GLSModel; score data=ScoreIn out=ScoreOut; quit; |
You can use the ScoreOut data set to create your own sliced fit plot by using PROC SGPLOT. Scoring the data enables you to overlay observations, add labels, and enhance the plot in other ways. For example, you can combine the predicted values with the original data and overlay them, as follows:
/* you can combine the original data and the predicted value to visualize the sliced model */ data Combine; set cars ScoreOut(rename=( Weight = scoreWeight Cylinders = scoreCylinders EngineSize = scoreEngineSize )); run; title "Model Scored at Weight=3500"; proc sgplot data=Combine; scatter x=EngineSize y=MPG_City / markerattrs=(symbol=CircleFilled) colorresponse=Weight colormodel=(yellow red); series x=scoreEngineSize y=Predicted / group=scoreCylinders name="pred" nomissinggroup; xaxis grid; yaxis grid label="MPG (City)"; gradlegend / position=bottom; keylegend "pred" / title="Cylinders" across=1 position=right; run; |
This graph shows a potential pitfall of evaluating a model at the mean or median of a covariate. For these data, the EngineSize and Weight variables are highly correlated (r=0.83). A consequence of that fact is that small vehicles tend have small engines and small weights; large vehicles tend to have large engines and large weights. (There are also sports cars, which have large engines and small weights.) When you evaluate the model at the value Weight=3500 and overlay the predicted values on the data, the fit does not seem to pass through the bulk of the data. It under-predicts the MPG for small cars and over-predicts the MPG for large cars. This happens because many vehicles have weights that are far from 3,500 pounds. You should not expect the predicted MPG_City (which is for reference value, Weight=3500) to be close to the observed MPG_City when the observed weight is far from the reference value.
This article shows how to use the EFFECTPLOT statement in PROC PLM to visualize multivariate regression models, with an emphasis on models that include spline effects. The EFFECTPLOT SLICEFIT statement enables you to specify the variable to use for the X axis, and the value of the covariates at to evaluate the model. Optionally, you can specify a classification variable and get several curves, one for each level of the classification variable.
In a similar way, you can use the SCORE statement in PROC PLM to evaluate the model at an arbitrary set of values for the explanatory variable. When used correctly and in conjunction with PROC SGPLOT, this gives you absolute control over the visualization of multivariate regression models.
The ideas in this article also apply to models that do NOT use spline effects.
The post Visualize a multivariate regression model when using spline effects appeared first on The DO Loop.
]]>The post Find the label of a variable in SAS appeared first on The DO Loop.
]]>However, you can easily look up a label. This article shows how to look up a variable's label and copy it into a macro variable so that it can be used in reports and graphs. I recently used this technique to display a variable's label on the axis of a graph.
Did you know that there are DATA step functions in SAS that get the attributes of variables?
For example, the following DATA step uses the VLABEL function, followed by a call to the SYMPUT subroutine to copy the value into a macro variable.
%let DSName = Sashelp.Iris; /* data set name */ %let Var = SepalLength; /* response variable */ /* look up the label of the variable; create a macro variable */ data _null_; set &DSName(obs=1); /* load attributes of variables */ label = vlabel(&Var); /* get the variable's label */ call symput("Labl", label); run; %put &=Labl; |
LABL=Sepal Length (mm) |
The call to the VLABEL is straightforward. My implementation reads one observation from the input data. There is, in fact, a sneaky way to load the attributes without reading ANY observations, but it is somewhat esoteric. For the interested reader, I direct you to the common SAS trick for getting the number of observations into a macro variable, which uses an "IF 0 THEN SET" clause to set the variable attributes and uses a STOP statement to handle the fact that the DATA step does not encounter an end-of-file token. See the link for details.
The previous DATA step is short and easy to understand, but I can hear some of my readers exclaim, "But wait! What about DICTIONARY tables?"
Experienced SAS programmers might use DICTIONARY tables for this task. You can get metadata about variables (formats, lengths, labels, etc.) by using PROC SQL to examine the Dictionary.columns table, or by using the DATA step and the Sashelp.vcolumn table. DICTIONARY tables are powerful resources, but the VLABEL function is much simpler in this instance. For a comparison, I include an Appendix that demonstrates how to use dictionaries.
You can use the VLABEL function in a SAS DATA step to get the label for a variable. Similar functions exist to get other variable attributes, such as formats (VFORMAT), type (VTYPE), and name (VNAME). These functions provide a simple way to get a variable's metadata. An example that uses DICTIONARY tables is shown in the Appendix.
If you have ever been to a SAS conference, you know that DICTIONARY tables are a favorite topic for SAS programmers. DICTIONARY tables are read-only tables that provide information about the state of the SAS session, including libraries, data sets, variables, and system options.
For completeness, here is some DATA step code that gets the label for a variable. The first DATA step extracts the libref and the data set name from a two-level specification of the data set. In some situations, you can skip this step.
%let DSName = Sashelp.Iris; /* two-level data set name */ %let Var = SepalLength; /* response variable */ /* if necessary, extract libname and member name from two-level name */ data _null_; k = find("&DSName", "."); /* find the position of the dot */ if k > 0 then lib = upcase(substr("&DSName", 1, k-1)); /* if two-level name */ else lib = "WORK"; /* if one-level name */ mem = upcase(substr("&DSName", k+1)); call symput("Lib", lib); /* LIB contains the libref */ call symput("Mem", mem); /* MEM contains the data set name */ run; |
The result of the previous DATA step is to parse a two-level data set name and create two macro variables: LIB contains the libref and MEM contains the name of the data set. You can use these values to access the DICTIONARY table that contains information about columns. Here is the DATA step approach:
/* look up the label of the original variable; create macro variable */ data _null_; set Sashelp.vcolumn(where=( libname="&Lib" and memname="&Mem" & upcase(name)=upcase("&Var"))); call symput("Labl", label); put label; run; %put &=Labl; |
If you prefer, you can also use the PROC SQL to obtain this information:
/* Alternative: use PROC SQL and the DICTIONARY tables */ proc sql ; select label into :Labl from dictionary.columns where libname="&Lib" AND memname="&Mem" AND upcase(name)=upcase("&Var"); quit; %put &=Labl; |
The post Find the label of a variable in SAS appeared first on The DO Loop.
]]>The post Create filled density plots in SAS appeared first on The DO Loop.
]]>A SAS programmer wanted to visualize density estimate for some univariate data. The data had several groups, so he wanted to create a panel of density estimate, which you can easily do by using PROC SGPANEL in SAS. However, the programmer's boss wanted to see filled density estimates, such as shown to the right. These graphs are also known as shaded density plots. This article shows how to create a graph like this in SAS.
The first step in any visualization is to prepare data that are suitable to visualize. In my blog, I often use data sets in the SASHELP library, which includes familiar data sets such as Sashelp.Cars, Sashelp.Heart, and Sashelp.Iris. For this task, I will use Sashelp.BWeight, which includes birthweights for 50,000 live singleton births in the US, along with information about the mothers. One of the variables is MomSmoke, which is a binary (0/1) variable that records whether the mother smokes cigarettes. It is well known that babies of smoking mothers tend to weigh less than babies of non-smoking mothers.
You can use PROC SGPANEL to visualize the weight of the babies for smoking and non-smoking mothers. The following statements overlay a histogram and a kernel density estimate (KDE) for each group. To make it easier for you to use your own data, I include a few macro variables. You can specify the names of your data set and the variable whose distribution you want to visualize. You can also specify the classification (grouping) variable and the number of levels, which equals the number of rows in the panel of density estimates.
/* generalize the problem: Specify the name of the data set, the name of the response variable, and the name of the classification variable */ %let DSName = Sashelp.BWeight; /* data set name */ %let Var = Weight; /* response variable */ %let ClassVar = MomSmoke; /* classification variable */ ods graphics / width=480px height=480px; title "Panel from PROC SGPANEL"; proc sgpanel data=&DSName noautolegend; panelby &ClassVar / layout=ROWLATTICE rowheaderpos=right; histogram &Var; density &Var / type=kernel; /* or TYPE=NORMAL for a normal estimate */ rowaxis grid; colaxis grid; run; |
This is the graph that I recommend that you use to demonstrate the empirical difference between the weights of babies for smoking and non-smoking mothers. However, the SAS programmer wanted to display ONLY the KDE and wanted to fill the area underneath the KDE curve. Unfortunately, the DENSITY statement does not support a "FILL" option. However, the BAND statement can be used to display a curve and to fill the area beneath the curve.
I have previously shown how to use the BAND statement to display the upper or lower tail densities for a distribution. You can use the same ideas to display a panel of shaded density estimates. However, I first want to show that you can get a nice-looking panel by using only PROC UNIVARIATE.
The HISTOGRAM statement supports many options and suboptions. Many people use the KERNEL option to overlay a kernel density estimate on a histogram. However, you can use the NOBARS option to hide the histogram bars, leaving only the KDE. Furthermore, you can use the KERNEL(FILL) suboption to display shaded density curves. Lastly
title "Panel of Kernel Density Estimates"; proc univariate data=&DSName; class &ClassVar; var &Var; histogram &Var / odstitle=title nocurvelegend nobars /* hide histogram */ kernel(fill) /* display KDE, or use NORMAL(FILL) */ statref=mean /* optional: display mean for each group */ ncol=1 nrow=2 /* optional: set NROW=(number of levels in ClassVar) */ ; ods select histogram; run; |
This panel of KDEs includes reference lines for the mean of each group, which can be an effective way to visualize the difference between the means of the groups. The histograms do not appear, and the KDEs are filled. If you do not like the default layout of ticks on the horizontal axes, you can use the ENDPOINTS= or MIDPOINTS= options to change the tick locations.
Although PROC UNIVARIATE does a good job creating a panel of shaded density estimates, PROC SGPANEL supports more ways to control colors, placement of headers, and other attributes. The HISTOGRAM statement in PROC UNIVARIATE supports the OUTKERNEL= option, which you can use to write a data set that contains the coordinates for the KDE curves of each group. You can then use PROC SGPANEL to read that data and display a panel of curves.
If you want to obtain shaded density estimates, use the BAND statement to specify the upper limit of a band. Use zero (0) as the lower limit. The following statements create a basic panel of shaded KDEs.
/* Write the KDEs to a data set */ proc univariate data=&DSName; class &ClassVar; var &Var; histogram &Var / odstitle=title nobars overlay kernel outkernel=OutKer; /* write kernel estimates to data set */ ods select histogram; /* ignore this graph */ run; /* Read KDE coordinates into PROC SGPLOT or PROC SGPANEL. If you use LAYOUT=ROWLATTICE, you do not need to specify the levels of the grouping variable */ proc sgpanel data=OutKer; *label _VALUE_ = "Weight of Baby (g)" _PERCENT_="Percent"; /* optional: set labels */ panelby &ClassVar / layout=ROWLATTICE rowheaderpos=right onepanel; band x=_VALUE_ upper=_Percent_ lower=0; rowaxis grid offsetmin=0; colaxis grid ; run; |
For brevity, I did not change any default attributes in this plot, other than adding grid lines. The graph at the top of this article includes small changes to the axis labels and the tick marks.
If you want to try out this idea on your own data, change the values of the macro variables and rerun only the program in the previous section. For example, here is the same code applied to the SepalLength variable in the Sashelp.Iris data, grouped by the Species variable:
/* run the program on different data */ %let DSName = Sashelp.Iris; /* data set name */ %let Var = SepalLength; /* response variable */ %let ClassVar = Species; /* classification variable */ /* Write the KDEs to a data set */ proc univariate data=&DSName; class &ClassVar; var &Var; histogram &Var / odstitle=title nobars overlay kernel outkernel=OutKer; /* write kernel estimates to data set */ ods select histogram; run; /* Read KDE coordinates into PROC SGPLOT or PROC SGPANEL. If you use LAYOUT=ROWLATTICE, you do not need to specify the levels of the grouping variable */ title "Distribution of &Var Variable for Each Level of &ClassVar"; proc sgpanel data=OutKer; label _VALUE_ = "Sepal Length (mm)" _PERCENT_="Percent"; panelby &ClassVar / layout=ROWLATTICE rowheaderpos=right onepanel; band x=_VALUE_ upper=_Percent_ lower=0; rowaxis grid offsetmin=0; colaxis grid ; run; |
Instead of hard-coding the label, you could look up the label for the response variable, copy it into a macro variable, and automatically use it on the LABEL statement. This is a nice application of using dictionaries, so I will demonstrate it in a future article.
This article shows how to create a panel of filled kernel density estimates for subpopulations that belong to groups. The UNIVARIATE procedure can create this panel by itself, but for more control you might want to write the density estimates to a SAS data set and use PROC SGPANEL to display the KDEs. To display filled KDEs, you need to use the BAND statement instead of the SERIES statement.
The post Create filled density plots in SAS appeared first on The DO Loop.
]]>The post On the correctness of a discrete simulation appeared first on The DO Loop.
]]>I recently demonstrated this technique by simulating many tosses for a pair of fair six-side dice. The sum can be any of the values {2, 3, 4, ..., 12}, and the theoretical probabilities for this discrete distribution are known for each value. For example, the sum is 2 with probability 1/36, the sum is 3 with probability 2/36, and so forth. In a previous article, I simulated 1,000 tosses and used PROC FREQ in SAS to compare the empirical distribution of the sums to the theoretical distribution. For the example, the p-value for the chi-square test is large, which indicates that the empirical and theoretical distributions are similar.
However, I ended the article with an unsettling question. A hypothesis test will REJECT the null hypothesis for 5% of the random samples that you generate. So, if the test rejects the null hypothesis, how do you know whether your simulation is wrong or if you just got unlucky?
The obvious answer is to generate more than one random sample. For example, if you generate 100 random samples, expect about 5% of them to reject the null hypothesis. I recently discussed scenarios in which the p-values are uniformly distributed when the null hypothesis is true. The example in that article was a test of the mean in a continuous sample. However, the ideas in that article generalize to other hypothesis tests. This article shows how to examine the distribution of the p-values for a chi-square test that assesses the correctness of a discrete distribution.
In a simulation, you specify the model from which you will draw random samples. In this example, the sum of the two (virtual) dice are supposed to have a discrete triangular distribution. The following SAS program generates 10,000 random samples. Each sample simulates tossing two dice a total of 1,000 times. After generating the samples, you can use PROC FREQ to count the proportion of times that the sum was 2, 3, ..., and 12 in each sample. You can then use a chi-square test to compare the empirical distribution to the known theoretical probabilities for the sum. Notice that the program generates all samples in a single data set and uses a BY-group analysis to analyze each sample.
/* Simulate 10,000 random samples. Each sample is 1,000 tosses of two six-sided dice. Report the total sum of the dice. */ %let seed = 1234; %let NumSamples = 10000; /* number of samples */ %let N = 1000; /* size of each sample */ data SumDice2; call streaminit(&seed); do SampleID = 1 to &NumSamples; do i = 1 to &N; d1 = rand("Integer", 1, 6); d2 = rand("Integer", 1, 6); sum = d1 + d2; /* simulate the sum of two six-sided dice */ output; end; end; keep SampleID sum; run; /* For each sample, perform a chi-square test to see if the sample seems to come from the model with known theoretical probabilities. Save the chi-square statistic and p-value in a data set. */ %let probSum = 0.0277777778 0.0555555556 0.0833333333 0.1111111111 0.1388888889 0.1666666667 0.1388888889 0.1111111111 0.0833333333 0.0555555556 0.0277777778; proc freq data=SumDice2 noprint; by SampleID; tables sum / nocum chisq testp=(&probSum); /* run 10,000 chi-square tests */ output out=ChiSqStat pchi; run; |
The ChiSqStat data set contains 10,000 chi-square statistics in the _PCHI_ variable. The distribution of _PCHI_ variable approximates the sampling distribution of the chi-square test statistic. Since these data are all simulated from the model specified in the hypothesis test, the test statistics should obey a chi-square distribution with 10 degrees of freedom. Recall that the chi-square distribution with d degrees of freedom is equivalent to a gamma(d/2, 2) distribution. Therefore, the following call to PROC UNIVARIATE overlays a chi-square(10) distribution on the sampling distribution:
/* Recall that chisq(DF) = gamma(DF/2, 2). See https://blogs.sas.com/content/iml/2015/12/07/proc-univariate-distributions.html The possible number of sums is 11, so DF = 11 - 1 = 10. Fit a chisq(10) distribution by using PROC UNIVARIATE */ title "Sampling Distribution of Chi-Square Statistic"; proc univariate data=ChiSqStat; var _PCHI_; histogram _PCHI_ / gamma(shape=5 scale=2) odstitle=title; /* overlay chisq(10) */ run; |
The histogram shows that a chi-square(10) distribution is a perfect fit for the sampling distribution of the chi-square statistic. As I explained in a previous article, this implies that the associated p-values are uniformly distributed. You can visualize this situation by plotting a histogram of the P_PCHI variable, as follows:
title "Distribution of P-Values for Hypothesis Test"; proc sgplot data=ChiSqStat; histogram P_PCHI / binwidth=0.05 binstart=0.025 scale=percent transparency=0.5; /* center first bin at h/2 */ refline 5 / axis=y; run; |
The p-values are uniformly distributed. This should give you confidence that the dice rolls are correctly simulated from the theoretical discrete probability distribution.
This article combines the ideas in two previous articles. In the first article, I showed how to simulate dice rolls in the SAS DATA step. In the second article, I showed that if data are simulated from a model for a hypothesis test, the p-values for the test are uniformly distributed. For simplicity, I showed that the p-values are uniformly distributed for a familiar hypothesis test of the mean of a continuous distribution. However, this article shows that the same ideas apply to some hypothesis tests for discrete distributions, provided that the test statistic is continuous.
The post On the correctness of a discrete simulation appeared first on The DO Loop.
]]>The post Rank, order, and sorting appeared first on The DO Loop.
]]>For clarity, this article assumes that the data do not contain any tied (duplicate values). For simplicity, all sorts are assumed to be in ascending order.
Ranks are related to the sort order for a set of values. The smallest value is ranked 1, the next smallest is ranked 2, and so on. This is the "race ranking," where the athlete with the smallest time wins first place; the largest time is considered last place. (Golf also rewards low scores.) Mathematically, you sort the data values in ascending order and assign the rank as the ordinal position of the sorted data. If there are no tied values, the rank is unique.
For example, in the data {75, 88, 82, 94, 68}, the corresponding ranks are {2, 4, 3, 5, 1} because 68 is the first (sorted) value, 75 is the second (sorted) value, and so forth. For n unique values, the ranks are integers in the range [1,n].
In SAS you can use the RANK procedure to assign ranks. In the SAS IML language, the RANK function supports the same functionality.
proc iml; /* for clarity, assume no tied values */ x = {75, 88, 82, 94, 68}; /* what is the sorted order? Copy x to s, then sort, so that x is not overwritten */ s = x; call sort(s); /* RANK is the order in which the values appear if you sort them. The smallest value gets rank=1; the largest gets rank=N. */ rank = rank(x); position = T(1:nrow(x)); print position x rank s[L='sort(x)']; |
For each observation in the data, the table shows the position (observation number), the data value, and the rank. The last column is the vector of sorted data values. I added arrows to visualize how the rank works. For each value, you look at the corresponding rank. That tells you the row of the last (sorted) column in which the value will appear when you sort the values in ascending order.
The order() function in R returns what I call the "sort indices." These are also called the "anti-ranks" or the "inverse ranks" because they represent the inverse operation for the ranking process. You start with the data in sorted order. You then ask, for each sorted value, what was the original position of that value. This is the sort index.
For example, in the data {75, 88, 82, 94, 68}, the sorted order is {68, 75, 82, 88, 94}. The first sort index is 5 because the first sorted value (68) was originally in the 5th position. The second sort index is 1 because the second sorted value (75) was originally in the first position. In the SAS IML language, you can use the SORTNDX subroutine to compute the sort index, as follows:
/* the sort index is the vector of indices, ndx, such that x[ndx] is sorted. This is equivalent to the order() function in R. */ call sortndx(ndx, x); print position x ndx s[L='sort(x)']; |
The table shows the position (observation number), the data value, and the sort index. The last column is the vector of sorted data values. Again, I added arrows to visualize how the sort index works. Note that the arrows now move from the right to the left. For each sorted value, you look at the corresponding sort index. That tells you the row (observation number) of the data value in the original order.
Both the rank and the sort index enable you to sort the data. This is most easily seen by considering the sort index. If ndx is a vector that contains the sort index for the vector x, then x[ndx] is the sorted vector. This is why the name "sort index" is used: it is the vector of indices that sorts the data.
You can also use the rank to sort the data. However, you need to use the rank on the left-hand side of an assignment statement! That is, if y is a vector that is the same dimension as x and if r is the vector of ranks, then the assignment y[r]= x puts the sorted values of x into the vector y. The following SAS IML statements show both techniques for sorting the data.
/* the sort index sorts the data when it is used as a subscript on the LHS */ z = x[ndx]; print x ndx z[L='x[ndx]']; /* the rank sorts the data when it is used as a subscript on the RHS */ y = j(nrow(x),1,.); /* y is the same size as x */ y[rank] = x; print x rank y[L='y[rank]=x']; |
For both tables, the last column shows the data in (ascending) sorted order. The first table used the expression x[ndx], where ndx is the sort index. The second table used the assignment y[r]=x, where r is the rank of x.
There are two ways to express the permutation that transforms a data vector into a sorted vector. The sort index (also called the anti-rank or inverse rank) is the vector of indices that sorts the data. The rank is the position of each element after sorting the data in ascending order. The sort index and the rank are inverse operations. In the SAS, you can use the SORTNDX subroutine to compute the sort index. You can use the RANK function to compute the ranks. The SORTNDX subroutine is equivalent to the order() function in R.
The post Rank, order, and sorting appeared first on The DO Loop.
]]>The post The distribution of p-values under the null hypothesis appeared first on The DO Loop.
]]>I think data simulation is a great way to discuss the conditions for which p-values are (or are not) uniformly distributed. This article does the following:
This section discusses some of the theory, use, and misuse of p-values. If you are familiar with p-value or just want to jump straight to an example, you can skip this section.
P-values can be a source of confusion, misinterpretation, and misuse. Even professional statisticians get confused. Students and scientists often don't understand them. It is for these reasons that the American Statistical Association (ASA) released a statement in 2016 clarifying the appropriate uses of p-values (Wasserman and Lazar, 2016, "The ASA Statement on p-Values: Context, Process, and Purpose", TAS). This statement was a response to several criticisms about how p-values are used in science. It was also a response to decisions by several reputable scientific journals to "ban" p-values in the articles that they publish.
I encourage the interested reader to read the six points in Wasserman and Lazar (2016) that describe interpretations, misinterpretations, applications, and misapplications of p-values.
The ASA statement on p-values is written for practitioners. Wasserman and Lazar (2016) define a p-value as follows: "Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value." Under this definition, the p-value is one number that is computed by using one set of data. Unfortunately, this definition does not enable you to discuss the distribution of the p-value.
To discuss a distribution, you must adopt a broader perspective: You must think of a p-value as a random variable. This is discussed in the excellent article by Greenland (2019) and in the equally excellent article by Murdoch, Tsai, & Adcock (2008).
The idea is that the data (X) are a random sample from the assumed model distribution. When you generate a random sample, the data determines the test statistic for the null hypothesis. Therefore, the test statistic is a random variable. To each test statistic, there is a p-value. Therefore, the p-value, too, is a random variable. If the distribution of the test statistic is known, then the p-value is uniformly distributed, by definition, because it is the probability integral transform of the test statistic.
This section shows a familiar example of a hypothesis test and a p-value: the one-sample t test for the mean. Recall that the one-sample t test assumes that a random sample contains independent draws from a normal distribution. (You can also use the t test for large samples from a non-normal continuous distribution, but, for simplicity, let's use normal data.) Assume that the data are drawn from N(72, σ). The null hypothesis is that H0: μ ≥ 72 versus the alternative hypothesis that μ < 72. Let's generate N=20 independent normal variates and use PROC TTEST in SAS to run the one-sided t test on the simulated data:
/* define the parameters */ %let N = 20; /* sample size */ %let MU0 = 72; /* true value of parameter */ data Sim1; /* random sample from normal distrib with mu=&MU0 */ call streaminit(1234); do i = 1 to &N; x = rand("Normal", &MU0, 10); output; end; run; /* H0: generating distribution has mu<=&MU0. Since N is small, also assume data is normal. */ proc ttest data=Sim1 alpha=0.05 H0=&MU0 sides=L plots=none; var X; run; |
The output includes a table of descriptive statistics (the top table) for the sample. The middle table shows inferences about the parameters, such as that the true mean is probably less than 73.31. The last table is the one I want to focus on. It summarizes the test statistic for the hypothesis test. The test statistic has the value -1.06. Under the assumption that statistic is distributed according to the t distribution with 19 degrees of freedom, the probability that we would observe a value less than -1.06 is Pr(T < -1.06) = 0.1510.
Here's an important point: Since we simulated the data from the N(72, σ), we KNOW that the null hypothesis is true, and we know that the test statistics REALLY DOES have a T(DF=19) distribution! Therefore, we are not surprised when the t test has a large p-value and we fail to reject the null hypothesis.
Recall, however, that we might get unlucky when we simulate the sample. For 5% of normal samples, the t test will produce a small p-value and we will reject the null hypothesis. For that reason, when I check the correctness of a simulation, I like to generate many random samples, not just one.
If you want to understand the distribution of the p-values, you can perform a Monte Carlo simulation study in which you generate many random samples of size N from the same distribution. You run a t test on each sample, which generates many test statistics and many p-values. The following SAS program carries out this simulation study, as suggested by Murdoch, Tsai, & Adcock (2008):
/* simulation of many random samples */ %let NumSim = 10000; /* number of random samples */ data SimH0; call streaminit(1234); do SampleID = 1 to &NumSim; do i = 1 to &N; x = rand("Normal", &MU0, 10); output; end; end; run; ods exclude all; proc ttest data=SimH0 alpha=0.05 H0=&MU0 sides=L plots=none; by SampleID; var X; ods output TTests=TTests; run; ods exclude none; title "Test Statistics for One-Sample T-Test"; title2 "H0: mu0 <= &MU0"; proc sgplot data=TTests; histogram tValue / scale=percent transparency=0.5; run; |
The data satisfy the assumptions of the t test and are sampled from a normal distribution for which the true parameter value satisfies the null hypothesis. Consequently, the histogram shows that the distribution of the test statistic looks very much like a t distribution with 19 degrees of freedom.
What does this have to do with p-values? Well, a p-value is just a transformation of a test statistic. The p-value is a way to standardize the test statistic so that it can be interpreted more easily. For any t, the corresponding left-tailed p-value is p = CDF("T", t, 29), or 1-CDF for a right-tailed p-value. As explained in an earlier article, if t is distributed as T(DF=29), then p is U(0,1). You can demonstrate that fact by plotting the p-values:
As promised, the p-values are uniformly distributed in (0,1). If you reject the null hypothesis when the p-value is less than 0.05, you will make a Type 1 error 5% of the time.
Notice that many things must happen simultaneously for the p-values to be uniformly distributed. The data must conform to the assumptions of the test and contain independent random variates from the model that is used to construct the test statistic. If so, then the test statistic follows a known t distribution. (Sometimes we only know an asymptotic distribution of the test statistic.) If all these stars align, then the p-values will be uniformly distributed.
If any of the previous assumptions are not true, then the p-values do not need to be uniformly distributed. The easiest situation to visualize is if the data are not sampled from a distribution whose mean equals the mean of the null hypothesis. For example, we generated the data from N(72, σ), but what happens if we run a t test for the hypothesis H0: μ ≥ 73? In this scenario, we expect more samples to reject the new null hypothesis. How many more? As suggested by Murdoch, Tsai, & Adcock (2008), we don't need to guess or compute, we can simply run the new hypothesis test for all 10,000 samples (which are from N(72, 10)) and plot the p-values:
/* what happens if the samples are NOT from the null hypothesis? */ %let MU1 = %sysevalf(1+&MU0); ods exclude all; proc ttest data=SimH0 alpha=0.05 H0=73 sides=L plots=none; by SampleID; var x; ods output TTests=TTestsNotH0; run; ods exclude none; title "P-Values for One-Sample T-Test"; title2 "H0: mu = &MU0, but actually mu = &MU1"; proc sgplot data=TTestsNotH0; histogram probT / datalabel binwidth=0.05 binstart=0.025 /* center first bin at h/2 */ scale=percent transparency=0.5; refline 5 / axis=y; run; |
For the new hypothesis test, about 11.62% of the samples have a p-value that is less than 0.05. That means that if you use 0.05 as a cutoff value, you will reject the null hypothesis 11.62% of the time. This is good, because the samples are actually generated from a distribution whose mean is LESS than the hypothetical value of 73. In this case, the null hypothesis is false, so you hope to reject it. If you further increase the value used in the null hypothesis (try using 75 or 80!) the test will reject even more samples.
This article uses simulation to illustrate the oft-repeated assertion that "p-values are uniformly distributed under the null hypothesis." The article shows an example of a hypothesis test for a one-sample t test. When the data satisfy the assumptions of the t test and are generated so that the null hypothesis is correct, then the test statistic has a known sampling distribution. The p-value is a transformation of the test statistic. When the test statistic follows its assumed distribution, then the p-values are uniform. Otherwise, the p-values are not uniform, which means that using a p-value cutoff such as 0.05 will reject the null hypothesis with a probability that is different than the cutoff value.
For a second example, see "On the correctness of a discrete simulation."
The post The distribution of p-values under the null hypothesis appeared first on The DO Loop.
]]>The post Dice and the correctness of a simulation appeared first on The DO Loop.
]]>It is easy to simulate the sum of two six-sided dice by using the DATA step in SAS. For fair dice, each face has a probability of 1/6 of appearing. You can simulate rolling a die in either of two ways. Older SAS programmers will probably use an expression such as CEIL(6*RAND("Uniform")), which generates the integers 1–6 with equal probability. However, an equivalent (and clearer!) syntax is to use the "Integer" distribution to generate the integers directly, as follows:
/* Simulate 1,000 rolls of two dice. Report the sum of the faces. */ data Dice1; call streaminit(1234); do i = 1 to 1000; d1 = rand("Integer", 1, 6); d2 = rand("Integer", 1, 6); sum = d1 + d2; output; end; run; |
The Dice1 data set contains 1,000 simulated dice rolls. Each row displays the face of each die and the sum of the faces.
When you create a simulated data set, it is important to assess whether the simulation is correct. There are a few techniques that I like to use. In increasing order of complexity, they are:
To assess the correctness of the simulation, it is useful to know the exact distribution of the model from which you are drawing random samples. For the sum of two dice, you can use a simple counting argument to show that the discrete probability distribution is triangular:
For a discrete distribution, you can use PROC FREQ to count the number of observations for each category and to graph the frequency distribution, as follows:
proc freq data=Dice1; tables sum / nocum plots=freqplot(scale=percent); run; |
The FREQ procedure outputs a table of counts and percentages (not shown) and a bar chart that shows the same information. The bar chart gives you faith that the simulation is correct because the heights of the bars are close to the theoretical probabilities. The sums range from 2 to 12, with 6, 7, and 8 being the most common rolls. The distribution of the sums is roughly symmetric.
However, you might be concerned that the empirical distribution does not show a prominent peak at 7. Does the graph show the kind of variation that you should expect in a random sample of size N=1,000, or is the asymmetry showing some insidious error in the program or in the RAND function in SAS?
To answer that question, you can run a hypothesis test that tests the empirical frequencies of the samples against the expected values for the discrete triangular distribution. The null hypothesis is that the observed frequencies are from a discrete probability distribution that has a specified table of probabilities. In PROC FREQ, the TESTP= option enables you to specify the expected probabilities for each category, and the CHISQ option requests a chi-square test that compares the empirical and expected frequencies, as follows:
/* exact probabilities for the sum of two dice */ %let probSum = 0.0277777778 0.0555555556 0.0833333333 0.1111111111 0.1388888889 0.1666666667 0.1388888889 0.1111111111 0.0833333333 0.0555555556 0.0277777778; proc freq data=Dice1; tables sum / nocum chisq testp=(&probSum); run; |
For each of the 11 possible sums, PROC FREQ displays the sample percentages and the expected percentages under the hypothesis that the sums are distributed according to the probabilities specified in the TESTP= option. You can see that the empirical percentages are close to the expected values. Furthermore, the CHISQ option produces a table that shows the result of a chi-square test for the specified proportions. The test statistic is the sum of the quantities (frequency[i]– expected[i])^{2} / expected[i], where frequency[i]is the observed count and expected[i]is the expected value under the null hypothesis. For these simulated data, the p-value for the test (0.2039) is not very small, so there is no reason to reject the null hypothesis. We conclude that there is no evidence to doubt that these simulated data are drawn from the distribution for the sum of two fair six-sided dice.
In summary, this article shows how to use the DATA step to simulate tossing a pair of six-sided dice. After simulating the data, you might wonder whether the simulation was correct. You can visualize the distribution of the sample to check whether it has the correct shape. You can use statistical hypothesis tests if you need further evidence that the sample has the statistical properties that you would expect in random sample from a known probability distribution.
But here's a dirty little secret: Even if the simulation is correct, the hypothesis test will REJECT the null hypothesis 5% of the time (assuming an α = 0.05 significance level). So, if the test rejects the null hypothesis, how do you know whether your simulation is wrong or if you just got unlucky? One thing you can do is to change the random number seed and rerun the test again. Did the test reject the null again? If so, you probably have a problem in your simulation. But if the second run does not reject the null, your simulation might be okay.
I am not suggesting that your rerun a test until you get a p-value that you like! In practice, you need to have confidence in your simulation. There are principled ways to check the validity of a simulation. They require that you generate many samples and look at the distribution of statistics across the random samples. I discuss these issues in a subsequent article.
The post Dice and the correctness of a simulation appeared first on The DO Loop.
]]>