The post Gershgorin discs and the location of eigenvalues appeared first on The DO Loop.
]]>
The Gershgorin Disc Theorem appears in Golub and van Loan (p. 357, 4th Ed; p. 320, 3rd Ed), where it is called the Gershgorin Circle Theorem. The theorem states that the eigenvalues of any N x N matrix, A, are contained in the union of N discs in the complex plane. The center of the i_th disc is the i_th diagonal element of A. The radius of the i_th disc is the absolute values of the off-diagonal elements in the i_th row. In symbols,
D_{i} = {z ∈ C | |z - A_{i i}| ≤ r_{i} }
where
r_{i} = Σ_{i ≠ j} |A_{i j}|.
Although the theorem holds for matrices with complex values, this article only uses real-valued matrices.
An example of Gershgorin discs is shown to the right. The discs are shown for the following 4 x 4 symmetric matrix:
At first glance, it seems inconceivable that we can know anything about the eigenvalues without actually computing them. However, two mathematical theorems tells us quite a lot about the eigenvalues of this matrix, just by inspection. First, because the matrix is real and symmetric, the Spectral Theorem tells us that all eigenvalues are real. Second, the Gershgorin Disc Theorem says that the four eigenvalues are contained in the union of the following discs:
Although the eigenvalues for this matrix are real, the Gershgorin discs are in the complex plane. The discs are visualized in the graph at the top of this article. The true eigenvalues of the matrix are shown inside the discs.
For this example, each disc contains an eigenvalue, but that is not true in general. (For example, the matrix A = {1 −1, 2 −1} does not have any eigenvalues in the disc centered at x=1.) What is true, however, is that disjoint unions of discs must contain as many eigenvalues as the number of discs in each disjoint region. For this matrix, the discs centered at x=25 and x=200 are disjoint. Therefore they each contain an eigenvalue. The union of the other two discs must contain two eigenvalues, but, in general, the eigenvalues can be anywhere in the union of the discs.
The visualization shows that the eigenvalues for this matrix are all positive. That means that the matrix is not only symmetric but also positive definite. You can predict that fact from the Gershgorin discs because no disc intersects the negative X axis.
Of course, you don't have to perform the disc calculations in your head. You can write a program that computes the centers and radii of the Gershgorin discs, as shown by the following SAS/IML program, which also computes the eigenvalues for the matrix:
proc iml; A = { 200 30 -15 5, 30 100 5 5, -15 5 55 0, 5 5 0 15}; evals = eigval(A); /* compute the eigenvalues */ center = vecdiag(A); /* centers = diagonal elements */ radius = abs(A)[,+] - abs(center); /* sum of abs values of off-diagonal elements of each row */ discs = center || radius || round(evals,0.01); print discs[c={"Center" "Radius" "Eigenvalue"} L="Gershgorin Discs"]; |
For this example, the matrix is strictly diagonally dominant. A strictly diagonally dominant matrix is one for which the magnitude of each diagonal element exceeds the sum of the magnitudes of the other elements in the row. In symbols, |A_{i i}| > Σ_{i ≠ j} |A_{i j}| for each i. Geometrically, this means that no Gershgorin disc intersects the origin, which implies that the matrix is nonsingular. So, by inspection, you can determine that his matrix is nonsingular.
The Gershgorin theorem is most useful when the diagonal elements are distinct. For repeated diagonal elements, it might not tell you much about the location of the eigenvalues. For example, all diagonal elements for a correlation matrix are 1. Consequently, all Gershgorin discs are centered at (1, 0) in the complex plane. The following graph shows the Gershgorin discs and the eigenvalues for a 10 x 10 correlation matrix. The eigenvalues of any 10 x 10 correlation matrix must be real and in the interval [0, 10], so the only new information from the Gershgorin discs is a smaller upper bound on the maximum eigenvalue.
Gershgorin's theorem can be useful for unsymmetric matrices, which can have complex eigenvalues. The SAS/IML documentation contains the following 8 x 8 block-diagonal matrix, which has two pairs of complex eigenvalues:
A = {-1 2 0 0 0 0 0 0, -2 -1 0 0 0 0 0 0, 0 0 0.2379 0.5145 0.1201 0.1275 0 0, 0 0 0.1943 0.4954 0.1230 0.1873 0 0, 0 0 0.1827 0.4955 0.1350 0.1868 0 0, 0 0 0.1084 0.4218 0.1045 0.3653 0 0, 0 0 0 0 0 0 2 2, 0 0 0 0 0 0 -2 0 }; |
The matrix has four smaller Gershgorin discs and three larger discs (radius 2) that are centered at (-1,0), (2,0), and (0,0), respectively. The discs and the actual eigenvalues of this matrix are shown in the following graph. Not only does the Gershgorin theorem bound the magnitude of the real part of the eigenvalues, but it is clear that the imaginary part cannot exceed 2. In fact, this matrix has eigenvalues -1 ± 2 i, which are on the boundary of one of the discs, which shows that the Gershgorin bound is tight.
In summary, the Gershgorin Disc Theorem provides a way to visualize the possible location of eigenvalues in the complex plane. You can use the theorem to provide bounds for the largest and smallest eigenvalues.
I was never taught this theorem in school. I learned it from a talented mathematical friend at SAS. I use this theorem to create examples of matrices that have particular properties, which can be very useful for developing and testing software.
This theorem also helped me to understand the geometry behind "ridging", which is a statistical technique in which positive multiples of the identity are added to a nearly singular X`X matrix. The Gershgorin Disc Theorem shows the effect of ridging a matrix is to translate all of the Gershgorin discs to the right, which moves the eigenvalues away from zero while preserving their relative positions.
You can download the SAS program that I used to create the images in this article.
There are several papers on the internet about Gershgorin discs. It is a favorite topic for advanced undergraduate projects in mathematics.
The post Gershgorin discs and the location of eigenvalues appeared first on The DO Loop.
]]>The post Critical values of the Kolmogorov-Smirnov test appeared first on The DO Loop.
]]>This is a wonderfully liberating result! No longer are we statisticians constrained by the entries in a table in the appendix of a textbook. In fact, you could claim that modern computation has essentially killed the standard statistical table.
Before we compute anything, let's recall a little statistical theory. If you get a headache thinking about null hypotheses and sampling distributions, you might want to skip the next two paragraphs!
When you run a hypothesis test, you compare a statistic (computed from data) to a hypothetical distribution (called the null distribution). If the observed statistic is way out in a tail of the null distribution, you reject the hypothesis that the statistic came from that distribution. In other words, the data does not seem to have the characteristic that you are testing for. Statistical tables use "critical values" to designate when a statistic is in the extreme tail. A critical value is a quantile of the null distribution; if the observed statistic is greater than the critical value, then the statistic is in the tail. (Technically, I've described a one-tailed test.)
One of the uses for simulation is to approximate the sampling distribution of a statistic when the true distribution is not known or is known only asymptotically. You can generate a large number of samples from the null hypothesis and compute the statistic on each sample. The union of the statistics approximates the true sampling distribution (under the null hypothesis) so you can use the quantiles to estimate the critical values of the null distribution.
You can use simulation to estimate the critical value for the Kolmogorov-Smirnov statistical test for normality. For the data in my previous article, the null hypothesis is that the sample data follow a N(59, 5) distribution. The alternative hypothesis is that they do not. The previous article computed a test statistic of D = 0.131 for the data (N = 30). If the null hypothesis is true, is that an unusual value to observe? Let's simulate 40,000 samples of size N = 30 from N(59,5) and compute the D statistic for each. Rather than use PROC UNIVARIATE, which computes dozens of statistics for each sample, you can use the SAS/IML computation from the previous article, which is very fast. The following simulation runs in a fraction of a second.
/* parameters of reference distribution: F = cdf("Normal", x, &mu, &sigma) */ %let mu = 59; %let sigma = 5; %let N = 30; %let NumSamples = 40000; proc iml; call randseed(73); N = &N; i = T( 1:N ); /* ranks */ u = i/N; /* ECDF height at right-hand endpoints */ um1 = (i-1)/N; /* ECDF height at left-hand endpoints */ y = j(N, &NumSamples, .); /* columns of Y are samples of size N */ call randgen(y, "Normal", &mu, &sigma); /* fill with random N(mu, sigma) */ D = j(&NumSamples, 1, .); /* allocate vector for results */ do k = 1 to ncol(y); /* for each sample: */ x = y[,k]; /* get sample x ~ N(mu, sigma) */ call sort(x); /* sort sample */ F = cdf("Normal", x, &mu, &sigma); /* CDF of reference distribution */ D[k] = max( F - um1, u - F ); /* D = max( D_minus, D_plus ) */ end; title "Monte Carlo Estimate of Sampling Distribution of Kolmogorov's D Statistic"; title2 "N = 30; N_MC = &NumSamples"; call histogram(D) other= "refline 0.131 / axis=x label='Sample D' labelloc=inside lineattrs=(color=red);"; |
The test statistic is right smack dab in the middle of the null distribution, so there is no reason to doubt that the sample is distributed as N(59, 5).
How big would the test statistic need to be to be considered extreme? To test the hypothesis at the α significance level, you can compute the 1 – α quantile of the null distribution. The following statements compute the critical value for α = 0.05 and N = 30:
/* estimate critical value as the 1 - alpha quantile */ alpha = 0.05; call qntl(Dcrit_MC, D, 1-alpha); print Dcrit_MC; |
The estimated critical value for a sample of size 30 is 0.242. This compares favorably with the exact critical value from a statistical table, which gives D_{crit} = 0.2417 for N = 30.
You can also use the null distribution to compute a p value for an observed statistic. The p value is estimated as the proportion of statistics in the simulation that exceed the observed value. For example, if you observe data that has a D statistic of 0.28, the estimated p value is obtained by the following statements:
Dobs = 0.28; /* hypothetical observed statistic */ pValue = sum(D >= Dobs) / nrow(D); /* proportion of distribution values that exceed D0 */ print Dobs pValue; |
This same technique works for any sample size, N, although most tables critical values only for all N ≤ 30. For N > 35, you can use the following asymptotic formulas, developed by Smirnov (1948), which depend only on α:
It is reasonable to assume that the results of this article apply only to a normal reference distribution. However, Kolmogorov proved that the sampling distribution of the D statistic is actually independent of the reference distribution. In other words, the distribution (and critical values) are the same regardless of the continuous reference distribution: beta, exponential, gamma, lognormal, normal, and so forth. That is a surprising result, which explains why there is only one statistical table for the critical values of the Kolmogorov D statistic, as opposed to having different tables for different reference distributions.
In summary, you can use simulation to estimate the critical values for the Kolmogorov D statistic. In a vectorized language such as SAS/IML, the entire simulation requires only about a dozen statements and runs extremely fast.
The post Critical values of the Kolmogorov-Smirnov test appeared first on The DO Loop.
]]>The post What is Kolmogorov's D statistic? appeared first on The DO Loop.
]]>This article focuses on the case where the reference distribution is a continuous probability distribution, such as a normal, lognormal, or gamma distribution. The D statistic can help you determine whether a sample of data appears to be from the reference distribution. Throughout this article, the word "distribution" refers to the cumulative distribution.
The letter "D" stands for "distance." Geometrically, D measures the maximum vertical distance between the empirical cumulative distribution function (ECDF) of the sample and the cumulative distribution function (CDF) of the reference distribution. As shown in the adjacent image, you can split the computation of D into two parts:
The D statistic is simply the maximum of D^{–} and D^{+}.
You can compare the statistic D to critical values of the D distribution, which appear in tables. If the statistic is greater than the critical value, you reject the null hypothesis that the sample came from the reference distribution.
In SAS, you can use the HISTOGRAM statement in PROC UNIVARIATE to compute Kolmogorov's D statistic for many continuous reference distributions, such as the normal, beta, or gamma distributions. For example, the following SAS statements simulate 30 observations from a N(59, 5) distribution and compute the D statistic as the maximal distance between the ECDF of the data and the CDF of the N(59, 5) distribution:
/* parameters of reference distribution: F = cdf("Normal", x, &mu, &sigma) */ %let mu = 59; %let sigma = 5; %let N = 30; /* simulate a random sample of size N from the reference distribution */ data Test; call streaminit(73); do i = 1 to &N; x = rand("Normal", &mu, &sigma); output; end; drop i; run; proc univariate data=Test; ods select Moments CDFPlot GoodnessOfFit; histogram x / normal(mu=&mu sigma=&sigma) vscale=proportion; /* compute Kolmogov D statistic (and others) */ cdfplot x / normal(mu=&mu sigma=&sigma) vscale=proportion; /* plot ECDF and reference distribution */ ods output CDFPlot=ECDF(drop=VarName CDFPlot); /* for later use, save values of ECDF */ run; |
The computation shows that D = 0.131 for this simulated data. The plot at the top of this article shows that the maximum occurs for a datum that has the value 54.75. For that observation, the ECDF is farther away from the normal CDF than at any other location.
The critical value of the D distribution when N=30 and α=0.05 is 0.2417. Since D < 0.2417, you should not reject the null hypothesis. It is reasonable to conclude that the sample comes from the N(59, 5) distribution.
Although the computation is not discussed in this article, you can use PROC NPAR1WAY to compute the statistic when you have two samples and want to determine if they are from the same distribution. In this two-sample test, the geometry is the same, but the computation is slightly different because the reference distribution is itself an ECDF, which is a step function.
Although PROC UNIVARIATE in SAS computes the D statistic (and other goodness-of-fit statistics) automatically, it is not difficult to compute the statistic from first principles in a vector language such as SAS/IML.
The key is to recall that the ECDF is a piecewise constant function that changes heights at the value of the observations. If you sort the data, the height at the i_th sorted observation is i / N, where N is the sample size. The height of the ECDF an infinitesimal distance before the i_th sorted observation is (i – 1)/ N. These facts enable you to compute D^{–} and D^{+} efficiently.
The following algorithm computes the D statistic:
The following SAS/IML program implements the algorithm:
/* Compute the Kolmogorov D statistic manually */ proc iml; use Test; read all var "x"; close; N = nrow(x); /* sample size */ call sort(x); /* sort data */ F = cdf("Normal", x, &mu, &sigma); /* CDF of reference distribution */ i = T( 1:N ); /* ranks */ Dminus = F - (i-1)/N; Dplus = i/N - F; D = max(Dminus, Dplus); /* Kolmogorov's D statistic */ print D; |
The D statistic matches the computation from PROC UNIVARIATE.
The SAS/IML implementation is very compact because it is vectorized. By computing the statistic "by hand," you can perform additional computations. For example, you can find the observation for which the ECDF and the reference distribution are farthest away. The following statements find the index of the maximum for the Dminus and Dplus vectors. You can use that information to find the value of the observation at which the maximum occurs, as well as the heights of the ECDF and reference distribution. You can use these values to create the plot at the top of this article, which shows the geometry of the Kolmogorov D statistic:
/* compute locations of the maximum D+ and D-, then plot with a highlow plot */ i1 = Dminus[<:>]; i2 = Dplus [<:>]; /* indices of maxima */ /* Compute location of max, value of max, upper and lower curve values */ x1=x[i1]; v1=DMinus[i1]; Low1=(i[i1]-1)/N; High1=F[i1]; x2=x[i2]; v2=Dplus [i2]; Low2=F[i2]; High2=i[i2]/N; Result = (x1 || v1 || Low1 || High1) // (x2 || v2 || Low2 || High2); print Result[c={'x' 'Value' 'Low' 'High'} r={'D-' 'D+'} L='Kolmogorov D']; |
The observations that maximize the D^{–} and D^{+} statistics are x=54.75 and x=61.86, respectively. The value of D^{–} is the larger value, so that is the value of Kolmogorov's D.
For completeness, the following statement show how to create the graph at the top of this article. The HIGHLOW statement in PROC SGPLOT is used to plot the vertical line segments that represent the D^{–} and D^{+} statistics.
create KolmogorovD var {x x1 Low1 High1 x2 Low2 High2}; append; close; quit; data ALL; set KolmogorovD ECDF; /* combine the ECDF, reference curve, and the D- and D+ line segments */ run; title "Kolmogorov's D Statistic"; proc sgplot data=All; label CDFy = "Reference CDF" ECDFy="ECDF"; xaxis grid label="x"; yaxis grid label="Cumulative Probability" offsetmin=0.08; fringe x; series x=CDFx Y=CDFy / lineattrs=GraphReference(thickness=2) name="F"; step x=ECDFx Y=ECDFy / lineattrs=GraphData1 name="ECDF"; highlow x=x1 low=Low1 high=High1 / lineattrs=GraphData2(thickness=4) name="Dm" legendlabel="D-"; highlow x=x2 low=Low2 high=High2 / lineattrs=GraphData3(thickness=4) name="Dp" legendlabel="D+"; keylegend "F" "ECDF" "Dm" "Dp"; run; |
Why would anyone want to compute the D statistic by hand if PROC UNIVARIATE can compute it automatically? There are several reasons:
The post What is Kolmogorov's D statistic? appeared first on The DO Loop.
]]>The post Write to a SAS data set from inside a SAS/IML loop appeared first on The DO Loop.
]]>proc iml; X = {1 2 3, 4 5 6, 7 8 9, 10 11 12}; create MyData from X[colname={'A' 'B' 'C'}]; /* create data set and variables */ append from X; /* write all rows of X */ close; /* close the data set */ |
In other programs, the results are computed inside an iterative DO loop. If you can figure out how many observations are generated inside the loop, it is smart to allocate room for the results prior to the loop, assign the rows inside the loop, and then write to a data set after the loop.
However, sometimes you do not know in advance how many results will be generated inside a loop. Examples include certain kinds of simulations and algorithms that iterate until convergence. An example is shown in the following program. Each iteration of the loop generates a different number of rows, which are appended to the Z matrix. If you do not know in advance how many rows Z will eventually contain, you cannot allocate the Z matrix prior to the loop. Instead, a common technique is to use vertical concatenation to append each new result to the previous results, as follows:
/* sometimes it is hard to determine in advance how many rows are in the final result */ free Z; do n = 1 to 4; k = n + floor(n/2); /* number of rows */ Y = j(k , 3, n); /* k x 3 matrix */ Z = Z // Y; /* vertical concatenation of results */ end; create MyData2 from Z[colname={'A' 'B' 'C'}]; /* create data set and variables */ append from Z; /* write all rows */ close; /* close the data set */ |
Concatenation within a loop tends to be inefficient. As I like to say, "friends don't let friends concatenate results inside a loop!"
If your ultimate goal is to write the observations to a data set, you can write each sub-result to the data set from inside the DO loop! The APPEND FROM statement writes whatever data are in the specified matrix, and you can call the APPEND FROM statement multiple times. Each call will write the contents of the matrix to the open data set. You can update the matrix or even change the number of rows in the matrix. For example, the following program opens the data set prior to the DO loop, appends to the data set multiple times (each time with a different number of rows), and then closes the data set after the loop ends.
/* alternative: create data set, write to it during the loop, then close it */ Z = {. . .}; /* tell CREATE stmt that data will contain three numerical variables */ create MyData3 from Z[colname={'A' 'B' 'C'}]; /* open before the loop. The TYPE of the variables are known. */ do n = 1 to 4; k = n + floor(n/2); /* number of rows */ Z = j(k , 3, n); /* k x 3 matrix */ append from Z; /* write each block of data */ end; close; /* close the data set */ |
The following output shows the contents of the MyData3 data set, which is identical to the MyData2 data set:
Notice that the CREATE statement must know the number and type (numerical or character) of the data set variables so that it can set up the data set for writing. If you are writing character variables, you also need to specify the length of the variables. I typically use missing values to tell the CREATE statement the number and type of the variables. These values are not written to the data set. It is the APPEND statement that writes data.
I previously wrote about this technique in the article "Writing data in chunks," which was focused on writing large data set that might not fit into RAM. However, the same technique is useful for writing data when the total number of rows is not known until run time. I also use it when running simulations that generate multivariate data. This technique provides a way to write data from inside a DO loop and to avoid concatenating matrices within the loop.
The post Write to a SAS data set from inside a SAS/IML loop appeared first on The DO Loop.
]]>The post Discrimination, accuracy, and stability in binary classifiers appeared first on The DO Loop.
]]>Daymond discussed the following three criteria for choosing a model:
My article about comparing the ROC curves for predictive models contains two competing models: A model from using PROC LOGISTIC and an "Expert model" that was constructed by asking domain experts for their opinions. (The source of the models is irrelevant; you can use any binary classifier.) You can download the SAS program that produces the following table, which estimates the area under each ROC curve, the standard error, and 90% confidence intervals:
The "Expert" model has a larger Area statistic and a smaller standard error, so you might choose to deploy it as a "champion model."
In his presentation, Daymond asked an important question. Suppose one month later you run the model on a new batch of labeled data and discover that the area under the ROC curve for the new data is only 0.73. Should you be concerned? Does this indicate that the model has degraded and is no longer suitable? Should you cast out this model, re-train all the models (at considerable time and expense), and deploy a new "champion"?
The answer depends on whether you think Area = 0.73 represents a degraded model or whether it can be attributed to sampling variability. The statistic 0.73 is barely more than 1 standard error away from the point estimate, and you will recall that 68% of a normal distribution is within one standard deviation of the mean. From that point of view, the value 0.73 is not surprising. Furthermore, the 90% confidence interval indicates that if you run this model every day for 100 days, you will probably encounter statistics lower than 0.68 merely due to sampling variability. In other words, a solitary low score might not indicate that the model is no longer valid.
If "asymptotic normality" makes you nervous, you can use the bootstrap method to obtain estimates of the standard error and the distribution of the Area statistic. The following table summarizes the results of 5,000 bootstrap replications. The results are very close to the asymptotic results in the previous table. In particular, the standard error of the Area statistic is estimated as 0.08 and in 90% of the bootstrap samples, the Area was in the interval [0.676, 0.983]. The conclusion from the bootstrap computation is the same as for the asymptotic estimates: you should expect the Area statistic to bounce around. A value such as 0.73 is not unusual and does not necessarily indicate that the model has degraded.
You can use the bootstrap computations to graphically reveal the stability of the two models. The following comparative histogram shows the bootstrap distributions of the Area statistic for the "Expert" and "Logistic" models. You can see that not only is the upper distribution shifted to the right, but it has less variance and therefore greater stability.
I think Daymond's main points are important to remember. Namely, discrimination and accuracy are important for choosing a model, but understanding the stability of the model (the variation of the estimates) is essential for determining when a model is no longer working well and should be replaced. There is no need to replace a model for a "bad score" if that score is within the range of typical statistical variation.
Ling, D. (2019), "Measuring Model Stability", Proceedings of the SAS Global Forum 2019 Conference.
Download the complete SAS program that creates the analyses and graphs in this article.
The post Discrimination, accuracy, and stability in binary classifiers appeared first on The DO Loop.
]]>The post How to simulate data from a generalized linear model appeared first on The DO Loop.
]]>Recall that a generalized linear model has three components:
Notice that only the response variable is randomly generated. In a previous article about simulating data from a logistic regression model, I showed that the following SAS DATA step statements can be used to simulate data for a logistic regression model. The statements model a binary response variable, Y, which depends linearly on two explanatory variables X1 and X2:
/* CORRECT way to simulate data from a logistic model with parameters (-2.7, -0.03, 0.07) */ eta = -2.7 - 0.03*x1 + 0.07*x2; /* linear predictor */ mu = logistic(eta); /* transform by inverse logit */ y = rand("Bernoulli", mu); /* simulate binary response with probability mu */ |
Notice that the randomness occurs only during the last step when you generate the response variable. Sometimes I see a simulation program in which the programmer adds a random term to the linear predictor, as follows:
/* WRONG way to simulate logistic data. This is a latent-variable model. */ eta = -2.7 - 0.03*x1 + 0.07*x2 + RAND("NORMAL", 0, 0.8); /* WRONG: Leads to a misspecified model! */ ... |
Perhaps the programmer copied these statements from a simulation of a linear regression model, but it is not correct for a fixed-effect generalized linear model. When you simulate data from a generalized linear model, use the first set of statements, not the second.
Why is the second statement wrong? Because it has too much randomness. The model that generates the data includes a latent (unobserved) variable. The model you are trying to simulate is specified in a SAS regression procedure as MODEL Y = X1 X2, but the model for the latent-variable simulation (the second one) should be MODEL Y = X1 X2 X3, where X3 is the unobserved normally distributed variable.
I haven't figured out all the mathematical ramifications of (incorrectly) adding a random term to the linear predictor prior to applying the logistic transform, but I ran a simulation that shows that the latent-variable model leads to biased parameter estimates when you fit the simulated data.
You can download the SAS program that generates data from two models: from the correct model (the first simulation steps) and from the latent-variable model (the second simulation). I generated 100 samples (each containing 5057 observations), then used PROC LOGISTIC to generate the resulting 100 sets of parameter estimates by using the statement MODEL Y = X1 X2. The results are shown in the following scatter plot matrix.
The blue markers are the parameter estimates from the correctly specified simulation. The reference lines in the upper right cells indicate the true values of the parameters in the simulation: (β_{0}, β_{1}, β_{2}) = (-2.7, -0.03, 0.07). You can see that the true parameter values are in the center of the cloud of blue markers, which indicates that the parameter estimates are unbiased.
In contrast, the red markers show that the parameter estimates for the misspecified latent-variable model are biased. The simulated data does not come from the model that is being fit. This simulation used 0.8 for the standard deviation of the error term in the linear predictor. If you use a smaller value, the center of the red clouds will be closer to the true parameter values. If you use a larger value, the clouds will move farther apart.
For additional evidence that the data from the second simulation does not fit the model Y = X1 X2, the following graphs show the calibration plots for a data sets from each simulation. The plot on the left shows nearly perfect calibration: This is not surprising because the data were simulated from the same model that is fitted! The plot on the right shows the calibration plot for the latent-variable model. The calibration plot shows substantial deviations from a straight line, which indicates that the model is misspecified for the second set of data.
In summary, be careful when you simulate data for a generalized fixed-effect linear model. The randomness only appears during the last step when you simulate the response variable, conditional on the linear predictor. You should not add a random term to the linear predictor.
I'll leave you with a thought that is trivial but important: You can use the framework of the generalized linear model to simulate a linear regression model. For a linear model, the link function is the identity function and the response distribution is normal. That means that a linear model can be simulated by using the following:
/* Alternative way to simulate a linear model with parameters (-2.7, -0.03, 0.07) */ eta = -2.7 - 0.03*x1 + 0.07*x2; /* linear predictor */ mu = eta; /* identity link function */ y = rand("Normal", mu, 0.7); /* simulate Y as normal response with RMSE = 0.7 */ |
Thus simulating a linear model fits into the framework of simulating a generalized linear model, as it should!
Download the SAS program that generates the images in this article.
The post How to simulate data from a generalized linear model appeared first on The DO Loop.
]]>The post Encodings of CLASS variables in SAS regression procedures: A cheat sheet appeared first on The DO Loop.
]]>The documentation section "Parameterization of Model Effects" provides a complete list of the encodings in SAS and shows how the design matrices are constructed from the levels. (The levels are the values of a classification variable.) Pasta (2005) gives examples and further discussion.
The following SAS regression procedures support the CLASS statement or a similar syntax. The columns GLM, REFERENCE, and EFFECT indicate the three most common encodings. The word "Default" indicates the default encoding. For procedures that support the PARAM= option, the column indicates the supported encodings. The word All means that the procedure supports the complete list of SAS encodings. Most procedures default to using the GLM encoding; the exceptions are highlighted.
Procedure | GLM | REFERENCE | EFFECT | PARAM= |
ADAPTIVEREG | Default | |||
ANOVA | Default | |||
BGLIMM | Default | Yes | Yes | GLM | EFFECT | REF |
CATMOD | Default | |||
FMM | Default | |||
GAM | Default | |||
GAMPL | Default | Yes | GLM | REF | |
GEE | Default | |||
GENMOD | Default | Yes | Yes | All |
GLIMMIX | Default | |||
GLM | Default | |||
GLMSELECT | Default | Yes | Yes | All |
HP regression procedures | Default | Yes | GLM | REF | |
HPMIXED | Default | |||
ICPHREG | Default | Yes | Yes | All |
LIFEREG | Default | |||
LOGISTIC | Yes | Yes | Default | All |
MIXED | Default | |||
ORTHOREG | Default | Yes | Yes | All |
PLS | Default | |||
PROBIT | Default | |||
PHREG | Yes | Default | Yes | All |
QUANTLIFE | Default | |||
QUANTREG | Default | |||
QUANTSELECT | Default | Yes | Yes | All |
RMTSREG | Default | Yes | Yes | All |
ROBUSTREG | Default | |||
SURVEYLOGISTIC | Yes | Yes | Default | All |
SURVEYPHREG | Default | Yes | Yes | All |
SURVEYREG | Default | |||
TRANSREG | Yes | Default | Yes |
A few comments:
The GLM parameterization is a singular parameterization. The other encodings are nonsingular. The "Other Parameterizations" section of the documentation gives a simple one-sentence summary of how to interpret the parameter estimates for the main effects in each encoding:
This article lists the various encodings that are supported for each SAS regression procedures. I hope you will find it to be a useful reference. If I've missed your favorite regression procedure, let me know in the comments.
The post Encodings of CLASS variables in SAS regression procedures: A cheat sheet appeared first on The DO Loop.
]]>The post The normal mixture distribution in SAS appeared first on The DO Loop.
]]>As with all probability distributions, there are four essential functions that you need to know: The PDF, CDF, QUANTILE, and RAND functions.
A finite mixture distribution is a weighted sum of component distributions. When all of the components are normal, the distribution is called a mixture of normals. If the i_th component has parameters (μ_{i}, σ_{i}), then you can write the probability density function (PDF) of the normal mixture as
f(x) = Σ_{i} w_{i} φ(x; μ_{i}, σ_{i})
where φ is the normal PDF and the positive constants w_{i} are the mixing weights. The mixing weights must sum to 1.
The adjacent graph shows the density function for a three-component mixture of normal distributions. The means of the components are -6, 3, and 8, respectively, and are indicated by vertical reference lines. The mixing weights are 0.1, 0.3, and 0.6. The SAS program to create the graph is in the next section.
The PDF and CDF functions in Base SAS support the "NormalMix" distribution. The syntax is a little unusual because the function needs to support an arbitrary number of components. If there are k components, the PDF and CDF functions require 3k + 3 parameters:
If you are using a model that has many components, it is tedious to explicitly list every parameter in every function call. Fortunately, there is a simple trick that prevents you from having to list the parameters. You can put the parameters into arrays and use the OF operator (sometimes called the OF keyword) to reference the parameter values in the array. This is shown in the next section.
The following example demonstrates how to compute the PDF and CDF for a three-component mixture-of-normals distribution. The DATA step shows two tricks:
/* PDF and CDF of the normal mixture distribution. This example specifies three components. */ data NormalMix; array w[3] _temporary_ ( 0.1, 0.3, 0.6); /* mixing weights */ array mu[3] _temporary_ (-6, 3, 8); /* mean for each component */ array sigma[3] _temporary_ (0.5, 0.6, 2.5); /* standard deviation for each component */ /* For each component, the range [mu-5*sigma, mu+5*sigma] is the effective support. */ minX = 1e308; maxX = -1e308; /* initialize to extreme values */ do i = 1 to dim(mu); /* find largest interval where density > 1E-6 */ minX = min(minX, mu[i] - 5*sigma[i]); maxX = max(maxX, mu[i] + 5*sigma[i]); end; /* Visualize the functions on the effective support. Use arrays and the OF operator to specify the parameters. An alternative syntax is to list the arguments, as follows: cdf = CDF('normalmix', x, 3, 0.1, 0.3, 0.6, -6, 3, 8, 0.5, 0.6, 2.5); */ dx = (maxX - minX)/200; do x = minX to maxX by dx; pdf = pdf('normalmix', x, dim(mu), of w[*], of mu[*], of sigma[*]); cdf = cdf('normalmix', x, dim(mu), of w[*], of mu[*], of sigma[*]); output; end; keep x pdf cdf; run; |
As shown in the program, the OF operator greatly simplifies and clarifies the function calls. The alternative syntax, which is shown in the comments, is unwieldy.
The following statements create graphs of the PDF and CDF functions. The PDF function is shown at the top of this article. The CDF function, along with a few reference lines, is shown below.
title "PDF function for Normal Mixture Distribution"; title2 "Vertical Lines at Component Means"; proc sgplot data=NormalMix; refline -6 3 8 / axis=x; series x=x y=pdf; run; title "CDF function for Normal Mixture Distribution"; proc sgplot data=NormalMix; xaxis grid; yaxis min=0 grid; refline 0.1 0.5 0.7 0.9; series x=x y=cdf; run; |
The quantile function for a continuous distribution is the inverse of the CDF distribution. The graph of the CDF function for a mixture of normals can have flat regions when the component means are far apart relative to their standard deviations. Technically, these regions are not completely flat because the normal distribution has infinite support, but computationally they can be very flat. Because finding a quantile is equivalent to finding the root of a shifted CDF, you might encounter computational problems if you try to compute the quantile that corresponds to an extremely flat region, such as the 0.1 quantile in the previous graph.
The following DATA step computes the 0.1, 0.5, 0.7, and 0.9 quantiles for the normal mixture distribution. Notice that you can use arrays and the OF operator for the QUANTILE function:
data Q; array w[3] _temporary_ ( 0.1, 0.3, 0.6); /* mixing weights */ array mu[3] _temporary_ (-6, 3, 8); /* mean for each component */ array sigma[3] _temporary_ (0.5, 0.6, 2.5); /* standard deviation for each component */ array p[4] (0.1, 0.5, 0.7, 0.9); /* find quantiles for these probabilities */ do i = 1 to dim(p); prob = p[i]; qntl = quantile('normalmix', prob, dim(mu), of w[*], of mu[*], of sigma[*]); output; end; keep qntl prob; run; proc print; run; |
The table tells you that 10% of the density of the normal mixture is less than x=-3.824. That is essentially the result of the first component, which has weight 0.1 and therefore is responsible for 10% of the total density. Half of the density is less than x=5.58. Fully 70% of the density lies to the left of x=8, which is the mean of the third component. That result makes sense when you look at the mixing weights.
The RAND function does not explicitly support the "NormalMix" distribution. However, as I have shown in a previous article, you can simulate from an arbitrary mixture of distributions by using the "Table" distribution in conjunction with the component distributions. For the three-component mixture distribution, the following DATA step simulates a random sample:
/* random sample from a mixture distribution */ %let N = 1000; data RandMix(drop=i); call streaminit(12345); array w[3] _temporary_ ( 0.1, 0.3, 0.6); /* mixing weights */ array mu[3] _temporary_ (-6, 3, 8); /* mean for each component */ array sigma[3] _temporary_ (0.5, 0.6, 2.5); /* standard deviation for each component */ do obsNum = 1 to &N; i = rand("Table", of w[*]); /* choose the component by using the mixing weights */ x = rand("Normal", mu[i], sigma[i]); /* sample from that component */ output; end; run; title "Random Sample for Normal Mixture Distribution"; proc sgplot data=RandMix; histogram x; refline -6 3 8 / axis=x; /* means of component distributions */ run; |
The histogram of a random sample looks similar to the graph of the PDF function, as it should.
In summary, SAS provides built-in support for working with the density (PDF), cumulative probability (CDF), and quantiles (QUANTILE) of a normal mixture distribution. You can use arrays and the OF operator to call these Base SAS functions without having to list every parameter. Although the RAND function does not natively support the "NormalMix" distribution, you can use the "Table" distribution to select a component according to the mixing weights and use the RAND("Normal") function to simulate from the selected normal component.
The post The normal mixture distribution in SAS appeared first on The DO Loop.
]]>The post A CUSUM test for autregressive models appeared first on The DO Loop.
]]>A CUSUM test uses the cumulative sum of some quantity to investigate whether a sequence of values can be modeled as random. Here are some examples:
Whereas the CUSUM test for a binary sequence uses cumulative sums for a discrete (+1, -1} sequence, the other tests assume that the sequence is a random sequence of normally distributed values. The main idea behind the tests are the same: The test statistic measures how far the sequence has drifted away from an expected value. If the sequence drifts too far too fast, the sequence is unlikely to be random.
Let's see how the CUSUM test in PROC AUTOREG can help to identify a misspecified model. For simplicity, consider two response variables, one that is linear in time (with uncorrelated errors) and the other that is quadratic in time. If you fit a linear model to both variables, the CUSUM test can help you to see that the model does not fit the quadratic data.
In a previous article, I discussed Anscombe's quartet and created two series that have the same linear fit and correlation coefficient. These series are ideal to use for the CUSUM test because the first series is linear whereas the second is quadratic. The following calls to PROC AUTOREG fit a linear model to each variable.
ods graphics on; /* PROC AUTOREG models a time series with autocorrelation */ proc autoreg data=Anscombe2; Linear: model y1 = x; /* Y1 is linear. Model is oorrectly specified. */ output out=CusumLinear cusum=cusum cusumub=upper cusumlb=lower recres=RecursiveResid; run; proc autoreg data=Anscombe2; Quadratic: model y2 = x; /* Y2 is quadratic. Model is misspecified. */ output out=CusumQuad cusum=cusum cusumub=upper cusumlb=lower recres=RecursiveResid; run; |
The AUTOREG procedure creates a panel of standard residual diagnostic plots. The panel includes a plot of the residuals and a fit plot that shows the fitted model and the observed values. For the linear data, the residual plots seem to indicate that the model fits the data well:
In contrast, the same residual panel for the quadratic data indicates a systematic pattern in the residuals:
If this were a least squares model, which assumes independence of the residuals, those residual plots would indicate that this data-model combination does not satisfy the assumptions of the least squares regression model. For an autoregressive model, however, raw residuals can be correlated and exhibit a pattern. To determine whether the model is misspecified, PROC AUTOREG supports a special kind of residual analysis that uses recursive residuals.
The recursive residual for the k_th point is formed by fitting a line to the first k-1 points and then forming a standardized residual for the k_th point. The complete formulas are in the AUTOREG documentation. Galpin and Hawkins (1984) suggest plotting the cumulative sums of the recursive residuals as a diagnostic plot. Galpin and Hawkins credit Brown, Durbin, and Evans (1975) with proposing the CUSUM plot of the recursive residuals. The statistics output from the AUTOREG procedure are different than those in Galpin and Hawkin, but the idea and purpose behind the CUSUM charts are the same.
Galpin and Hawkin show a panel of nine plots that display different patterns that you might see in the CUSUM plots. I have reproduced two of the plots from the paper. (Remember, these graphs were produced in 1984!) The graph on the left shows what you should see for a correctly specified model. The cumulative sums stay within a region near the expected value of zero. In contrast, the graph on the right is one example of a CUSUM plot for a misspecified model.
The previous calls to PROC AUTOREG wrote the cumulative sums and the upper and lower boundaries of the confidence region to a data set. You can use PROC SGPLOT to create the CUSUM plot. The BAND statement is used to draw the confidence band:
ods layout gridded columns=2 advance=table; proc sgplot data=CusumLinear noautolegend; band x=x lower=lower upper=upper; series x=x y=cusum / break markers; refline 0 /axis=y noclip; xaxis grid; yaxis grid; run; proc sgplot data=CusumQuad noautolegend; band x=x lower=lower upper=upper; series x=x y=cusum / break markers; refline 0 /axis=y noclip; xaxis grid; yaxis grid; run; ods layout end; |
The graph on the left looks like a random walk on independent normal data. The cumulative sums stay within the colored confidence region. The model seems to fit the data. In contrast, the graph on the right quickly leaves the shaded region, which indicates that the model is misspecified.
In summary, there are many statistical tests that use a CUSUM statistic to determine whether deviations are random. These tests appear in many areas of statistics, including random walks, quality control, and time series analysis. For quality control, SAS supports the CUSUM procedure in SAS/QC software. For time series analysis, the AUTOREG procedure in SAS supports CUSUM charts of recursive residuals, which enable you to diagnose misspecified models.
You can download the SAS program that generates the graphs in this article.
The post A CUSUM test for autregressive models appeared first on The DO Loop.
]]>The post The CUSUM test for randomness of a binary sequence appeared first on The DO Loop.
]]>The CUSUM test for randomness of a binary sequence is one of the NIST tests for verifying that a random or pseudorandom generator is generating bits that are indistinguishable from truly random bits (Rukin, et al., 2000, revised 2010, pp 2-31 through 2-33). The test is straightforward to implement. You first translate the data to {-1, +1} values. You then compute the cumulative sums of the sequence. If the sequence is random, the cumulative sum is equivalent to the position of a random walker who takes unit steps. If the sequence is random, the sums will not move away from 0 (the expected sum) too quickly. I've previously visualized the random walk with unit steps (sometimes called a "Drunkard's walk").
Before proceeding to the CUSUM test, I should mention that this test is often used in conjunction with other tests, such as the "runs test" for randomness. That is because a perfectly alternating sequence such as 0101010101... will pass the CUSUM test even though the sequence is clearly not randomly generated. In fact, any sequence that repeatedly has k zeros followed by k ones also passes the test, provided that k is small enough.
The NIST report contains an example of calling the CUSUM test with a sequence of length N=100. The following SAS/IML statements define a sequence of {0, 1} values, convert those values to {-1, +1}, and plot the cumulative sums:
proc iml; eps = {1 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 0 1 0 1 1 1 0 0 0 }; x = 2*eps - 1; /* convert to {-1, +1} sequence */ S = cusum(x); title "Cumulative Sums of Sequence of {-1, +1} Values"; call series(1:ncol(S), S) option="markers" other="refline 0 / axis=y" label="Observation Number"; |
The sequence contains 58 values of one category and 42 values of the other. For a binomial distribution with p=0.5, the probability of a sample that has proportions at least this extreme is about 13%, as shown by the computation 2*cdf("Binomial", 42, 0.5, 100);. Consequently, the proportions are not unduly alarming. However, to test whether the sequence is random, you need to consider not only the proportion of values, but also the sequence. The graph of the cumulative sums of the {-1, +1} sequence shows a drift away from the line S=0, but it is not clear from the graph whether the deviation is more extreme than would be expected for a random sequence of this length.
The CUSUM test gives you a way to quantify whether the sequence is likely to have occurred as a random draw from a Bernoulli(p=0.5) distribution. The test statistic is the maximum deviation from 0. As you can see from the graph, the test statistic for this sequence is 16. The NIST paper provides a formula for the probability that a statistic at least this extreme occurs in a random sequence of length N=100. I implemented a (vectorized) version of the formula in SAS/IML.
/* NIST CUSUM test for randomness in a binary {-1, +1} sequence. https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-22r1a.pdf Section 2.13. Pages 2-31 through 2-33. INPUT: x is sequence of {-1, +1} values. */ start BinaryCUSUMTest(x, PrintTable=1, alpha=0.01); S = colvec( cusum(x) ); /* cumulative sums */ n = nrow(S); z = max(abs(S)); /* test statistic = maximum deviation */ /* compute probability of this test statistic for a sequence of this length */ zn = z/sqrt(n); kStart = int( (-n/z +1)/4 ); kEnd = int( ( n/z -1)/4 ); k = kStart:kEnd; sum1 = sum( cdf("Normal", (4*k+1)*zn) - cdf("Normal", (4*k-1)*zn) ); kStart = int( (-n/z -3)/4 ); k = kStart:kEnd; sum2 = sum( cdf("Normal", (4*k+3)*zn) - cdf("Normal", (4*k+1)*zn) ); pValue = 1 - sum1 + sum2; /* optional: print the test results in a nice human-readable format */ cusumTest = z || pValue; if PrintTable then do; print cusumTest[L="Result of CUSUM Test" c={"Test Statistic" "p Value"}]; labl= "H0: Sequence is a random binary sequence"; if pValue <= alpha then msg = "Reject H0 at alpha = " + char(alpha); /* sequence seems random */ else msg = "Do not reject H0 at alpha = " + char(alpha); /* sequence does not seem random */ print msg[L=labl]; end; return ( cusumTest ); finish; /* call the function for the {-1, +1} sequence */ cusumTest = BinaryCUSUMTest(x); |
According to the CUSUM test, there is not sufficient evidence to doubt that the sequence was generated from a random Bernoulli process.
A few comments about the program:
A neat thing about the CUSUM test is that you can compute the maximum test statistic based only on the sequence length. Thus if you plan to toss a coin 100 times to determine if it is fair, you can stop tossing (with 99% confidence) if the number of heads ever exceeds the number of tails by 29. Similarly, you can stop tossing if you know that the number of excess heads cannot possibly be 29 or greater. (For example, you've tossed 80 times and the current cumulative sum is 5.) You can apply the same argument to excess tails.
In summary, this article shows how to implement the CUSUM test for randomness of a binary sequence in SAS. Only a few lines of SAS/IML are required, and you can implement the test without using any loops. Be aware that the CUSUM test is not very powerful because regular sequences can pass the test. For example, the sequence 000111000111000111... has a maximum deviation of 3.
The post The CUSUM test for randomness of a binary sequence appeared first on The DO Loop.
]]>