The post The moving block bootstrap for time series appeared first on The DO Loop.
]]>The simple block bootstrap is not often used in practice. One reason is that the total number of blocks (k=n/L) is often small. If so, the bootstrap resamples do not capture enough variation for the bootstrap method to make correct inferences. This article describes a better alternative: the moving block bootstrap. In the moving block bootstrap, every block has the same block length but the blocks overlap. The following figure illustrates the overlapping blocks when L=3. The indices 1:L define the first block of residuals, the indices 2:L+1 define the second block, and so forth until the last block, which contains the residuals n-L+1:n.
To form a bootstrap resample, you randomly choose k=n/L blocks (with replacement) and concatenate them. You then add these residuals to the predicted values to create a "new" time series. Repeat the process many times and you have constructed a batch of bootstrap resamples. The process of forming one bootstrap sample is illustrated in the following figure. In the figure, the time series has been reshaped into a k x L matrix, where each row is a block.
To demonstrate the moving block bootstrap in SAS, let's use the same data that I analyzed in the previous article about the simple block bootstrap. The previous article extracted 132 observations from the Sashelp.Air data set and used PROC AUTOREG to form an additive model Predicted + Residuals. The OutReg data set contains three variables of interest: Time, Pred, and Resid.
As before, I will choose the block size to be L=12. The following SAS/IML program reads the data and defines a matrix (R) such that the i_th row contains the residuals with indices i:i+L-1. In total, the matrix R has n-L+1 rows.
/* MOVING BLOCK BOOTSTRAP */ %let L = 12; proc iml; call randseed(12345); use OutReg; read all var {'Time' 'Pred' 'Resid'}; close; /* Restriction for Simple Block Bootstrap: The length of the series (n) must be divisible by the number of blocks (k) so that all blocks have the same length (L) */ n = nrow(Pred); /* length of series */ L = &L; /* length of each block */ k = n / L; /* number of random blocks to use */ if k ^= int(k) then ABORT "The series length is not divisible by the block length"; /* Trick: Reshape data into k x L matrix. Each row is block of length L */ P = shape(Pred, k, L); /* there are k rows for Pred */ J = n - L + 1; /* total number of overlapping blocks to choose from */ R = j(J, L, .); /* there are n-L+1 blocks of residuals */ Resid = rowvec(Resid); /* make Resid into row vector so we don't need to transpose each row */ do i = 1 to J; R[i,] = Resid[ , i:i+L-1]; /* fill each row with a block of residuals */ end; |
With this setup, the formation of bootstrap resamples is almost identical to the program in the previous article. The only difference is that the matrix R for the moving block bootstrap has more rows. Nevertheless, each resample is formed by randomly choosing k rows from R and adding them to a block of predicted values. The following statements generate B=1000 bootstrap resamples, which are written to a SAS data set (BootOut). The program writes the Time variable, the resampled series (YBoot), and an ID variable that identifies each bootstrap sample.
/* The moving block bootstrap repeats this process B times and usually writes the resamples to a SAS data set. */ B = 1000; SampleID = j(n,1,.); create BootOut var {'SampleID' 'Time' 'YBoot'}; /* create outside of loop */ do i = 1 to B; SampleId[,] = i; idx = sample(1:J, k); /* sample of size k from the set 1:J */ YBoot = P + R[idx,]; append; end; close BootOut; QUIT; |
The BootOut data set contains B=1000 bootstrap samples. The rest of the bootstrap analysis is exactly the same as in the previous article.
This article shows how to perform a moving block bootstrap on a time series in SAS. First, you need to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L), which must divide the total length of the series (n), and form the n-L+1 overlapping blocks of residuals. Each bootstrap resample is generated by randomly choosing blocks of residuals and adding them to the predicted model. This article uses the SAS/IML language to perform the simple block bootstrap in SAS.
The post The moving block bootstrap for time series appeared first on The DO Loop.
]]>The post Blog posts from 2020 that deserve a second look appeared first on The DO Loop.
]]>However, among last year's 100+ articles are many that discuss advanced topics. Did you make a New Year's resolution to learn something new this year? Here is your chance! The following articles were fun to write and deserve a second look.
I write a lot about scatter plot smoothers, which are typically parametric or nonparametric regression models. But a SAS customer wanted to know how to get SAS to perform various classical interpolation schemes such as linear and cubic interpolations:
SAS is devoting tremendous resources to SAS Viya, which offers a modern analytic platform that runs in the cloud. One of the advantages of SAS Viya is the opportunity to take advantage of distributed computational resources. In 2020, I wrote a series of articles that demonstrate how to use the iml action in Viya 3.5 to implement custom parallel algorithms that use multiple nodes and threads on a cluster of machines. Whereas many actions in SAS Viya perform one and only one task, the iml action supports a general framework for custom, user-written, parallel computations:
Did I omit one of your favorite blog posts from The DO Loop in 2020? If so, leave a comment and tell me what topic you found interesting or useful. And if you missed some of these articles when they were first published, consider subscribing to The DO Loop in 2021.
The post Blog posts from 2020 that deserve a second look appeared first on The DO Loop.
]]>The post The simple block bootstrap for time series in SAS appeared first on The DO Loop.
]]>For a time series, the residuals are not independent. Rather, if you fit a model to the data, the residuals at time t+i are often close to the residual at time t for small values of i. This is known as autocorrelation in the error component. Accordingly, if you want to bootstrap the residuals of a time series, it is not correct to randomly shuffle the residuals, which would destroy the autocorrelation. Instead, you need to randomly choose a block of residuals (for example, at times t, t+1, ..., and t+L) and use those blocks of residuals to create bootstrap resamples. You repeatedly choose random blocks until you have enough residuals to create a bootstrap resample.
There are several ways to choose blocks:
There are many ways to fit a model to a time series and to obtain the model residuals. Trovero and Leonard (2018) discuss several modern methods to fit trends, cycles, and seasonality by using SAS 9.4 or SAS Viya. To get the residuals, you will want to fit an additive model. In this article, I will use the Sashelp.Air data and will fit a simple additive model (trend plus noise) by using the AUTOREG procedure in SAS/ETS software.
The Sashelp.Air data set has 144 months of data. The following SAS DATA step drops the first year of data, which leaves 11 years of 12 months. I am doing this because I am going to use blocks of size L=12, and I think the example will be clearer if there are 11 blocks of size 12 (rather than 12 blocks).
data Air; set Sashelp.Air; if Date >= '01JAN1950'd; /* exclude first year of data */ Time = _N_; /* the observation number */ run; title "Original Series: Air Travel"; proc sgplot data=Air; series x=Time y=Air; xaxis grid; yaxis grid; run; |
The graph suggests that the time series has a linear trend. The following call to PROC AUTOREG fits a linear model to the data. The predicted mean and residuals are output to the OutReg data set as the PRED and RESID variables, respectively. The call to PROC SGPLOT overlays a graph of the trend and a graph of the residuals.
/* Similar to Getting Started example in PROC AUTOREG */ proc autoreg data=Air plots=none outest=RegEst; AR12: model Air = Time / nlag=12; output out=OutReg pm=Pred rm=Resid; /* mean prediction and residuals */ ods select FinalModel.ParameterEstimates ARParameterEstimates; run; title "Mean Prediction and Residuals from AR Model"; proc sgplot data=OutReg; series x=Time y=Pred; series x=Time y=Resid; refline 0 / axis=y; xaxis values=(24 to 144 by 12) grid valueshint; run; |
The parameter estimates are shown for the linear model. On average, airlines carried an additional 2.8 thousand passengers per month during this time period. The graph shows the decomposition of the series into a linear trend and residuals. I added vertical lines to indicate the blocks of residuals that are used in the next section. The first block contains the residuals for times 13-24. The second block contains the residuals for times 25-36, and so forth until the 11th block, which contains the residuals for times 133-144.
For the simple bootstrap, the length of the blocks (L) must evenly divide the length of the series (n), which means that k = n / L is an integer. Because I dropped the first year of observations from Sashelp.Air, there are n=132 observations. I will choose the block size to be L=12, which means that there are k=11 non-overlapping blocks.
Each bootstrap resample is formed by randomly choosing k blocks (with replacement) and add those residuals to the predicted values. Think about putting the n predicted values and residuals into a matrix in row-wise order. The first L observations are in the first row, the next L are in the second row, and so forth. Thus, the matrix has k rows and L columns. The original series is of the form Predicted + Residuals, where the plus sign represents matrix addition. For the simple block bootstrap, each bootstrap resample is obtained by resampling the rows of the residual array and adding the rows together to obtain a new series of the form Predicted + (Random Residuals). This process is shown schematically in the following figure.
You can use the SAS/IML language to implement the simple block bootstrap. The following call to PROC IML reads in the original predicted and residual values and reshapes then vectors into k x L matrices (P and R, respectively). The SAMPLE function generates a sample (with replacement) of the vector 1:k, which is used to randomly select rows of the R matrix. To make sure that the process is working as expected, you can create one bootstrap resample and graph it. It should resemble the original series:
/* SIMPLE BLOCK BOOTSTRAP */ %let L = 12; proc iml; call randseed(12345); /* the original series is Y = Pred + Resid */ use OutReg; read all var {'Time' 'Pred' 'Resid'}; close; /* For the Simple Block Bootstrap, the length of the series (n) must be divisible by the block length (L). */ n = nrow(Pred); /* length of series */ L = &L; /* length of each block */ k = n / L; /* number of non-overlapping blocks */ if k ^= int(k) then ABORT "The series length is not divisible by the block length"; /* Trick: reshape data into k x L matrix. Each row is block of length L */ P = shape(Pred, k, L); R = shape(Resid, k, L); /* non-overlapping residuals (also k x L) */ /* Example: Generate one bootstrap resample by randomly selecting from the residual blocks */ idx = sample(1:nrow(R), k); /* sample (w/ replacement) of size k from the set 1:k */ YBoot = P + R[idx,]; title "One Bootstrap Resample"; title2 "Simple Block Bootstrap"; refs = "refline " + char(do(12,nrow(Pred),12)) + " / axis=x;"; call series(Time, YBoot) other=refs; |
The graph shows one bootstrap resample. The residuals from arbitrary blocks are concatenated until there are n residuals. These are added to the predicted value to create a "new" series, which is a bootstrap resample. You can generate a large number of bootstrap resamples and use them to perform inferences for time series statistics.
You can repeat the process in a loop to generate more resamples. The following statements generate B=1000 bootstrap resamples. These are written to a SAS data set (BootOut). The program uses a technique in which the results of each computation are immediately written to a SAS data set, which is very efficient. The program writes the Time variable, the resampled series (YBoot), and an ID variable that identifies each bootstrap sample.
/* The simple block bootstrap repeats this process B times and usually writes the resamples to a SAS data set. */ B = 1000; J = nrow(R); /* J=k for non-overlapping blocks, but prepare for moving blocks */ SampleID = j(n,1,.); create BootOut var {'SampleID' 'Time' 'YBoot'}; /* open data set outside of loop */ do i = 1 to B; SampleId[,] = i; /* fill array: https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */ idx = sample(1:J, k); /* sample of size k from the set 1:k */ YBoot = P + R[idx,]; append; /* append each bootstrap sample */ end; close BootOut; QUIT; |
The BootOut data set contains B=1000 bootstrap samples. You can efficiently analyze the samples by using a BY statement. For example, suppose that you want to investigate how the parameter estimates for the trend line vary among the bootstrap samples. You can run PROC AUTOREG on each bootstrap sample by using BY-group processing. Be sure to suppress ODS output during the BY-group analysis, and write the desired statistics to an output data set (BootEst), as follows:
/* Analyze the bootstrap samples by using a BY statement. See https://blogs.sas.com/content/iml/2012/07/18/simulation-in-sas-the-slow-way-or-the-by-way.html */ proc autoreg data=BootOut plots=none outest=BootEst noprint; by SampleID; AR12: model YBoot = Time / nlag=12; run; /* OPTIONAL: Use PROC MEANS or PROC UNIVARIATE to estimate standard errors and CIs */ proc means data=BootEst mean stddev P5 P95; var Intercept Time _A:; run; title "Distribution of Parameter Estimates"; proc sgplot data=BootEst; scatter x=Intercept y=Time; xaxis grid; yaxis grid; refline 77.5402 / axis=x; refline 2.7956 / axis=y; run; |
The scatter plot shows the bootstrap distribution of the parameter estimates of the linear trend. The reference lines indicate the parameter estimates for the original data. You can use the bootstrap distribution for inferential statistics such as estimation of standard errors, confidence intervals, the covariance of estimates, and more.
You can perform a similar bootstrap analysis for any other statistic that is generated by any time series analysis. The important thing is that the block bootstrap is performed on some sort of residual or "noise" component, so be sure to remove the trend, seasonality, cycles, and so forth and then bootstrap the remainder.
This article shows how to perform a simple block bootstrap on a time series in SAS. First, you need to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L), which (for the simple block bootstrap) must divide the total length of the series (n). Each bootstrap resample is generated by randomly choosing blocks of residuals and adding them to the predicted model. This article uses the SAS/IML language to perform the simple block bootstrap in SAS.
In practice, the simple block bootstrap is rarely used. However, it illustrates the basic ideas for bootstrapping a time series, and it provides a foundation for more sophisticated bootstrap methods.
The post The simple block bootstrap for time series in SAS appeared first on The DO Loop.
]]>The post Top posts from <em>The DO Loop</em> in 2020 appeared first on The DO Loop.
]]>Many articles in the previous sections included data visualization, but two popular articles are specifically about data visualization:
Many people claim they want to forget 2020, but these articles provide a few tips and techniques that you might want to remember. So, read (or re-read!) these popular articles from 2020. And if you made a resolution to learn something new this year, consider subscribing to The DO Loop so you don't miss a single article!
The post Top posts from <em>The DO Loop</em> in 2020 appeared first on The DO Loop.
]]>The post Create a response variable that has a specified R-square value appeared first on The DO Loop.
]]>In a previous article, I showed how to compute a vector that has a specified correlation with another vector. You can generalize that situation to obtain a vector that has a specified relationship with a linear subspace that is spanned by multiple vectors.
Recall that the correlation is related to the angle between two vectors by the formula cos(θ) = ρ, where θ is the angle between the vectors and ρ is the correlation coefficient. Therefore, correlation and "angle between" measure similar quantities. It makes sense to define the angle between a vector and a linear subspace as the smallest angle the vector makes with any vector in the subspace. Equivalently, it is the angle between the vector and its (orthogonal) projection onto the subspace.
This is shown graphically in the following figure. The vector z is not in the span of the explanatory variables. The vector w is the projection of z onto the linear subspace. As explained in the previous article, you can find a vector y such that the angle between y and w is θ, where cos(θ) = ρ. Equivalently, the correlation between y and w is ρ.
There is a connection between this geometry and the geometry of least-squares regression. In least-square regression, the predicted response is the projection of an observed response vector onto the span of the explanatory variables. Consequently, the previous article shows how to simulate an "observed" response vector that has a specified correlation with the predicted response.
For simple linear regression (one explanatory variable), textbooks often point out that the R-square statistic is the square of the correlation between the independent variable, X, and the response variable, Y. So, the previous article enables you to create a response variable that has a specified R-square value with one explanatory variable.
The generalization to multivariate linear regression is that the R-square statistic is the square of the correlation between the predicted response and the observed response. Therefore, you can use the technique in this article to create a response variable that has a specified R-square value in a linear regression model.
To be explicit, suppose you are given explanatory variables X_{1}, X_{2}, ..., X_{k}, and a correlation coefficient, ρ. The following steps generate a response variable, Y, such that the R-square statistic for the regression of Y onto the explanatory variables is ρ^{2}:
The following program shows how to carry out this algorithm in the SAS/IML language:
proc iml; /* Define or load the modules from https://blogs.sas.com/content/iml/2020/12/17/generate-correlated-vector.html */ load module=_all_; /* read some data X1, X2, ... into columns of a matrix, X */ use sashelp.class; read all var {"Height" "Weight" "Age"} into X; /* read data into (X1,X2,X3) */ close; /* Least-squares fit = Project Y onto span(1,X1,X2,...,Xk) */ start OLSPred(y, _x); X = j(nrow(_x), 1, 1) || _x; b = solve(X`*X, X`*y); yhat = X*b; return yhat; finish; /* specify the desired correlation between Y and \hat{Y}. Equiv: R-square = rho^2 */ rho = 0.543; call randseed(123); guess = randfun(nrow(X), "Normal"); /* 1. make random guess */ w = OLSPred(guess, X); /* 2. w is in Span(1,X1,X2,...) */ Y = CorrVec1(w, rho, guess); /* 3. Find Y such that corr(Y,w) = rho */ /* optional: you can scale Y anyway you want ... */ /* in regression, R-square is squared correlation between Y and YHat */ corr = corr(Y||w)[2]; R2 = rho**2; PRINT rho corr R2; |
The program uses a random guess to generate a vector Y such that the correlation between Y and the least-squares prediction for Y is exactly 0.543. In other words, if you run a regression model where Y is the response and (X1, X2, X3) are the explanatory variables, the R-square statistic for the model will be ρ^{2} = 0.2948. Let's write the Y variable to a SAS data set and run PROC REG to verify this fact:
/* Write to a data set, then call PROC REG */ Z = Y || X; create SimCorr from Z[c={Y X1 X2 X3}]; append from Z; close; QUIT; proc reg data=SimCorr plots=none; model Y = X1 X2 X3; ods select FitStatistics ParameterEstimates; quit; |
The "FitStatistics" table that is created by using PROC REG verifies that the R-square statistic is 0.2948, which is the square of the ρ value that was specified in the SAS/IML program. The ParameterEstimates table from PROC REG shows the vector in the subspace that has correlation ρ with Y. It is -1.26382 + 0.04910*X1 - 0.00197*X2 - 0.12016 *X3.
Many textbooks point out that the R-square statistic in multivariable regression has a geometric interpretation: It is the squared correlation between the response vector and the projection of that vector onto the linear subspace of the explanatory variables (which is the predicted response vector). You can use the program in this article to solve the inverse problem: Given a set of explanatory variables and correlation, you can find a response variable for which the R-square statistic is exactly the squared correlation.
You can download the SAS program that computes the results in this article.
The post Create a response variable that has a specified R-square value appeared first on The DO Loop.
]]>The post Find a vector that has a specified correlation with another vector appeared first on The DO Loop.
]]>The algorithm combines a mixture of statistics and basic linear algebra. The following facts are useful:
Given a centered vector, u, there are infinitely-many vectors that have correlation ρ with u. Geometrically, you can choose any vector on a positive cone in the same direction as u, where the cone has angle θ and cos(θ)=ρ. This is shown graphically in the figure below. The plane marked \(\mathbf{u}^{\perp}\) is the orthogonal complement to the vector u. If you extend the cone through the plane, you obtain the cone of vectors that are negatively correlated with x
One way to obtain a correlated vector is to start with a guess, z. The vector z can be uniquely represented as the sum \(\mathbf{y} = \mathbf{w} + \mathbf{w}^{\perp}\), where w is the projection of z onto the span of u, and \(\mathbf{w}^{\perp}\) is the projection of z onto the orthogonal complement.
The following figure shows the geometry of the right triangle with angle θ such that cos(θ) = ρ.
If you want the vector y to be unit length, you can read off the formula for y from the figure. The formula is
\(\mathbf{y} = \rho \mathbf{w} / \lVert\mathbf{w}\rVert + \sqrt{1 - \rho^2} \mathbf{w}^\perp / \lVert\mathbf{w}^\perp\rVert \)
In the figure, \(\mathbf{v}_1 = \mathbf{w} / \lVert\mathbf{w}\rVert\) and
\(\mathbf{v}_2 = \mathbf{w}^\perp / \lVert\mathbf{w}^\perp\rVert\).
It is straightforward to implement this projection in a matrix-vector language such as SAS/IML. The following program defines two helper functions (Center and UnitVec) and uses them to implement the projection algorithm. The function CorrVec1 takes three arguments: the vector x, a correlation coefficient ρ, and an initial guess. The function centers and scales the vectors into the vectors u and z. The vector z is projected onto the span of u. Finally, the function uses trigonometry and the fact that cos(θ) = ρ to return a unit vector that has the required correlation with x.
/* Given a vector, x, and a correlation, rho, find y such that corr(x,y) = rho */ proc iml; /* center a column vector by subtracting its mean */ start Center(v); return ( v - mean(v) ); finish; /* create a unit vector in the direction of a column vector */ start UnitVec(v); return ( v / norm(v) ); finish; /* Find a vector, y, such that corr(x,y) = rho. The initial guess can be almost any vector that is not in span(x), orthog to span(x), and not in span(1) */ start CorrVec1(x, rho, guess); /* 1. Center the x and z vectors. Scale them to unit length. */ u = UnitVec( Center(x) ); z = UnitVec( Center(guess) ); /* 2. Project z onto the span(u) and the orthog complement of span(u) */ w = (z`*u) * u; wPerp = z - w; /* 3. The requirement that cos(theta)=rho results in a right triangle where y (the hypotenuse) has unit length and the legs have lengths rho and sqrt(1-rho^2), respectively */ v1 = rho * UnitVec(w); v2 = sqrt(1 - rho**2) * UnitVec(wPerp); y = v1 + v2; /* 4. Check the sign of y`*u. Flip the sign of y, if necessary */ if sign(y`*u) ^= sign(rho) then y = -y; return ( y ); finish; |
The purpose of the function is to project the guess onto the green cone in the figure. However, if the guess is in the opposite direction from x, the algorithm will compute a vector, y, that has the opposite correlation. The function detects this case and flips y, if necessary.
The following statements call the function for a vector, x, and requests a unit vector that has correlation ρ = 0.543 with x:
/* Example: Call the CorrVec1 function */ x = {1,2,3}; rho = 0.543; guess = {0, 1, -1}; y = CorrVec1(x, rho, guess); corr = corr(x||y); print x y, corr; |
As requested, the correlation coefficient between x and y is 0.543. This process will work provided that the guess satisfies a few mild assumptions. Specifically, the guess cannot be in the span of x or in the orthogonal complement of x. The guess also cannot be a multiple of the 1 vector. Otherwise, the process will work for positive and negative correlations.
The function returns a vector that has unit length and 0 mean. However, you can translate the vector and scale it by any positive quantity without changing its correlation with x, as shown by the following example:
/* because correlation is a relationship between standardized vectors, you can translate and scale Y any way you want */ y2 = 100 + 23*y; /* rescale and translate */ corr = corr(x||y2); /* the correlation will not change */ print corr; |
When y is a centered unit vector, the vector β*y has L_{2} norm β. If you want to create a vector whose standard deviation is β, use β*sqrt(n-1)*y, where n is the number of elements in y.
One application of this technique is to create a random vector that has a specified correlation with a given vector, x. For example, in the following program, the x vector contains the heights of 19 students in the Sashelp.Class data set. The program generates a random guess from the standard normal distribution and passes that guess to the CorrVec1 function and requests a vector that has the correlation 0.678 with x. The result is a centered unit vector.
use sashelp.class; read all var {"Height"} into X; close; rho = 0.678; call randseed(123); guess = randfun(nrow(x), "Normal"); y = CorrVec1(x, rho, guess); mean = 100; std = 23*sqrt(nrow(x)-1); v = mean + std*y; title "Correlation = 0.678"; title2 "Random Normal Vector"; call scatter(X, v) grid={x y}; |
The graph shows a scatter plot between x and the random vector, v. The correlation in the scatter plot is 0.678. The sample mean of the vector v is 100. The sample standard deviation is 23.
If you make a second call to the RANDFUN function, you can get another random vector that has the same properties. Or you can repeat the process for a range of ρ values to visualize data that have a range of correlations. For example, the following graph shows a panel of scatter plots for ρ = -0.75, -0.25, 0.25, and 0.75. The X variable is the same for each plot. The Y variable is a random vector that was rescaled to have mean 100 and standard deviation 23, as above.
The random guess does not need to be from the normal distribution. You can use any distribution.
This article shows how to create a vector that has a specified correlation with a given vector. That is, given a vector, x, and a correlation coefficient, ρ, find a vector, y, such that corr(x, y) = ρ. The algorithm in this article produces a centered vector that has unit length. You can multiply the vector by β > 0 to obtain a vector whose norm is β. You can multiply the vector by β*sqrt(n-1) to obtain a vector whose standard deviation is β.
There are infinitely-many vectors that have correlation ρ with x. The algorithm uses a guess to produce a particular vector for y. You can use a random guess to obtain a random vector that has a specified correlation with x.
The post Find a vector that has a specified correlation with another vector appeared first on The DO Loop.
]]>The post Segmented regression models in SAS appeared first on The DO Loop.
]]>A previous article shows that for simple piecewise polynomial models, you can use the EFFECT statement in SAS regression procedures to use a spline to fit some segmented regression models. The method relies on two assumptions. First, it assumes that you know the location of the breakpoint. Second, it assumes that each model has the same parametric form on each interval. For example, the model might be piecewise constant, piecewise linear, or piecewise quadratic.
If you need to estimate the location of the breakpoint from the data, or if you are modeling the response differently on each segment, you cannot use a single spline effect. Instead, you need to use a SAS procedure such as PROC NLIN that enables you to specify the model on each segment and to estimate the breakpoint. For simplicity, this article shows how to estimate a model that is quadratic on the first segment and constant on the second segment. (This is called a plateau model.) The model also estimates the location of the breakpoint.
A SAS customer recently asked about segmented models on the SAS Support Communities. Suppose that a surgeon wants to model how long it takes her to perform a certain procedure over time. When the surgeon first started performing the procedure, it took about 3 hours (180 minutes). As she refined her technique, that time decreased. The surgeon wants to predict how long this surgery now takes now, and she wants to estimate when the time reached its current plateau. The data are shown below and visualized in a scatter plot. The length of the procedure (in minutes) is recorded for 25 surgeries over a 16-month period.
data Have; input SurgeryNo Date :mmddyy10. duration @@; format Date mmddyy10.; datalines; 1 3/20/2019 182 2 5/16/2019 150 3 5/30/2019 223 4 6/6/2019 142 5 6/11/2019 111 6 7/11/2019 164 7 7/26/2019 83 8 8/22/2019 144 9 8/29/2019 162 10 9/19/2019 83 11 10/10/2019 70 12 10/17/2019 114 13 10/31/2019 113 14 11/7/2019 97 15 11/21/2019 83 16 12/5/2019 111 17 12/5/2019 73 18 12/12/2019 87 19 12/19/2019 86 20 1/9/2020 102 21 1/16/2020 124 22 1/23/2020 95 23 1/30/2020 134 24 3/5/2020 121 25 6/4/2020 60 ; title "Time to Perform a Surgery"; proc sgplot data=Have; scatter x=Date y=Duration; run; |
From the graph, it looks like the duration for the procedure decreased until maybe November or December 2019. The goal of a segmented model is to find the breakpoint and to model the duration before and after the breakpoint. For this purpose, you can use a segmented plateau model that is a quadratic model prior to the breakpoint and a constant model after the breakpoint.
A segmented plateau model is one of the examples in the PROC NLIN documentation. The documentation shows how to use constraints in the problem to eliminate one or more parameters. For example, a common assumption is that the two segments are joined smoothly at the breakpoint location, x0. If f(x) is the predictive model to the left of the breakpoint and g(x) is the model to the right, then continuity dictates that f(x0) = g(x0) and smoothness dictates that f`(x0) = g`(x0). In many cases (such as when the models are low-degree polynomials), the two constraints enable you to reparameterize the models to eliminate two of the parameters.
The PROC NLIN documentation shows the details. Suppose f(x) is a quadratic function f(x) = α + β x + γ x^{2} and g(x) is a constant function g(x) = c. Then you can use the constraints to reparameterize the problem so that (α, β, γ) are free parameters, and the other two parameters are determined by the formulas:
x0 = -β / (2 γ)
c = α – β^{2} / (4 γ)
You can use the ESTIMATE statement in PROC NLIN to obtain estimates and standard error for the x0 and c parameters.
As is often the case, the hard part is to guess initial values for the parameters. You must supply an initial guess on the PARMS statement in PROC NLIN. One way to create a guess is to use a related "reduced model" to provide the estimates. For example, you can use PROC GLM to fit a global quadratic model to the data, as follows:
/* rescale by using x="days since start" as the variable */ %let RefDate = '20MAR2019'd; data New; set Have; rename duration=y; x = date - &RefDate; /* days since first record */ run; proc glm data=New; model y = x x*x; run; |
These parameter estimates are used in the next section to specify the initial parameter values for the segmented model.
Recall that a SAS date is represented by the number of days since 01JAN1960. Thus, for these data, the Date values are approximately 22,000. Numerically speaking, it is often better to use smaller numbers in a regression problem, so I will rescale the explanatory variable to be the number of days since the first surgery. You can use the parameter estimates from the quadratic model as the initial values for the segmented model:
title 'Segmented Model with Plateau'; proc nlin data=New plots=fit noitprint; parms alpha=184 beta= -0.5 gamma= 0.001; x0 = -0.5*beta / gamma; if (x < x0) then mean = alpha + beta*x + gamma*x*x; /* quadratic model for x < x0 */ else mean = alpha + beta*x0 + gamma*x0*x0; /* constant model for x >= x0 */ model y = mean; estimate 'plateau' alpha + beta*x0 + gamma*x0*x0; estimate 'breakpoint' -0.5*beta / gamma; output out=NLinOut predicted=Pred L95M=Lower U95M=Upper; ods select ParameterEstimates AdditionalEstimates FitPlot; run; |
The output from PROC NLIN includes estimates for the quadratic regression coefficients and for the breakpoint and the plateau value. According to the second table, the surgeon can now perform this surgical procedure in about 98 minutes, on average. The 95% confidence interval [77, 119] suggests that the surgeon might want to schedule two hours for this procedure so that she is not late for her next appointment.
According to the estimate for the breakpoint (x0), she achieved her plateau after about 287 days of practice. However, the confidence interval is quite large, so there is considerable uncertainty in this estimate.
The output from PROC NLIN includes a plot that overlays the predictions on the observed data. However, that graph is in terms of the number of days since the first surgery. If you want to return to the original scale, you can graph the predicted values versus the Date. You can also add reference lines that indicate the plateau value and the estimated breakpoint, as follows:
/* convert x back to original scale (Date) */ data _NULL_; plateauDate = 287 + &RefDate; /* translate back to Date scale */ call symput('x0', plateauDate); /* store breakpoint in macro variable */ put plateauDate DATE10.; run; /* plot the predicted values and CLM on the original scale */ proc sgplot data=NLinOut noautolegend; band x=Date lower=Lower upper=Upper; refline 97.97 / axis=y label="Plateau" labelpos=max; refline &x0 / axis=x label="01JAN2020" labelpos=max; scatter x=Date y=y; series x=Date y=Pred; yaxis label='Duration'; run; |
The graph shows the breakpoint estimate, which happens to be 01JAN2020. It also shows the variability in the data and the wide prediction limits.
This article shows how to fit a simple segmented model in SAS. The model has one breakpoint, which is estimated from the data. The model is quadratic before the breakpoint, constant after the breakpoint, and joins smoothly at the breakpoint. The constraints on continuity and smoothness reduce the problem from five parameters to three free parameters. The article shows how to use PROC NLIN in SAS to solve segmented models and how to visualize the result.
You can use a segmented model for many other data sets. For example, if you are a runner and routinely run a 5k distance, you can use a segmented model to monitor your times. Are your times decreasing, or did you reach a plateau? When did you reach the plateau?
The post Segmented regression models in SAS appeared first on The DO Loop.
]]>The post Horn's method: A simulation-based method for retaining principal components appeared first on The DO Loop.
]]>There are other methods for deciding how many PCs to keep. Recently a SAS customer asked about a method known as Horn's method (Horn, 1965), also called parallel analysis. This is a simulation-based method for deciding how many PCs to keep. If the original data consists of N observations and p variables, Horn's method is as follows:
I do not know why the adjective "parallel" is used for Horn's analysis. Nothing in the analysis is geometrically parallel to anything else. Although you can use parallel computations to perform a simulation study, I doubt Horn was thinking about that in 1965. My best guess is that is Horn's method is a secondary analysis that is performed "off to the side" or "in parallel" to the primary principal component analysis.
You do not need to write your own simulation method to use Horn's method (parallel analysis). Horn's parallel analysis is implemented in SAS (as of SAS/STAT 14.3 in SAS 9.4M5) by using the PARALLEL option in PROC FACTOR. The following call to PROC FACTOR uses data about US crime rates. The data are from the Getting Started example in PROC PRINCOMP.
proc factor data=Crime method=Principal parallel(nsims=1000 seed=54321) nfactors=parallel plots=parallel; var Murder Rape Robbery Assault Burglary Larceny Auto_Theft; run; |
The PLOTS=PARALLEL option creates the visualization of the parallel analysis. The solid line shows the eigenvalues for the observed correlation matrix. The dotted line shows the 95th percentile of the simulated data. When the observed eigenvalue is greater than the corresponding 95th percentile, you keep the factor. Otherwise, you discard the factor. The graph shows that only one principal component would be kept according to Horn's method. This graph is a variation of the scree plot, which is a plot of the observed eigenvalues.
The same information is presented in tabular form in the "ParallelAnalysis" table. The first row is the only row for which the observed eigenvalue is greater than the 95th percentile (the "critical value") of the simulated eigenvalues.
Statisticians often use statistical tests based on a null hypothesis. In Horn's method, the simulation provides the "null distribution" of the eigenvalues of the correlation matrix under the hypothesis that the variables are uncorrelated. Horn's method says that we should only accept a factor as important if it explains more variance than would be expected from uncorrelated data.
Although the PARALLEL option is supported in PROC FACTOR, some researchers suggest that parallel analysis is valid only for PCA. Saccenti and Timmerman (2017) write, "Because Hornâ€™s parallel analysis is associated with PCA, rather than [common factor analysis], its use to indicate the number of common factors is inconsistent (Ford, MacCallum, & Tait, 1986; Humphreys, 1975)." I an expert in factor analysis, but a basic principle of simulation is to ensure that the "null distribution" is appropriate to the analysis. For PCA, the null distribution in Horn's method (eigenvalues of a sample correlation matrix) is appropriate. However, in some common factor models, the important matrix is a "reduced correlation matrix," which does not have 1s on the diagonal.
Although the PARALLEL option in PROC FACTOR runs a simulation and summarizes the results, there are several advantages to implementing a parallel analysis yourself. For example, you can perform the analysis on the covariance (rather than correlation) matrix. Or you can substitute a robust correlation matrix as part of a robust principal component analysis.
I decided to run my own simulation because I was curious about the distribution of the eigenvalues. The graph that PROC FACTOR creates shows only the upper 95th percentiles of the eigenvalue distribution. I wanted to overlay a confidence band that indicates the distribution of the eigenvalues. The band would visualize the uncertainty in the eigenvalues of the simulated data. How wide is the band? Would you get different results if you use the median eigenvalue instead of the 95th percentile?
Such a graph is shown to the right. The confidence band was created by using a technique similar to the one I used to visualize uncertainty in predictions for linear regression models. The graph shows that the distribution of each eigenvalue and connects them with a straight line. The confidence band fits well with the existing graph, even though the X axis is discrete.
Here's an interesting fact about the simulation in Horn's method. Most implementations generate B random samples, X ~ MVN(0, I(p)), but you don't actually NEED the random samples! All you need are the correlation matrices for the random samples. It turns out that you can simulate the correlation matrices directly by using the Wishart distribution. SAS/IML software includes the RANDWISHART function, which simulates matrices from the Wishart distribution. You can transform those matrices into correlation matrices, find the eigenvalues, and compute the quantiles in just a few lines of PROC IML:
/* Parallel Analysis (Horn 1965) */ proc iml; /* 1. Read the data and compute the observed eigenvalues of the correlation */ varNames = {"Murder" "Rape" "Robbery" "Assault" "Burglary" "Larceny" "Auto_Theft"}; use Crime; read all var varNames into X; read all var "State"; close; p = ncol(X); N = nrow(X); m = corr(X); /* observed correlation matrix */ Eigenvalue = eigval(m); /* observed eigenvalues */ /* 2. Generate random correlation matrices from MVN(0,I(p)) data and compute the eigenvalues. Each row of W is a p x p scatter matrix for a random sample of size N where X ~ MVN(0, I(p)) */ nSim = 1000; call randseed(12345); W = RandWishart(nSim, N-1, I(p)); /* each row stores a p x p matrix */ S = W / (N-1); /* rescale to form covariance matrix */ simEigen = j(nSim, p); /* store eigenvalues in rows */ do i = 1 to nSim; cov = shape(S[i,], p, p); /* reshape the i_th row into p x p */ R = cov2corr(cov); /* i_th correlation matrix */ simEigen[i,] = T(eigval(R)); end; /* 3. find 95th percentile for each eigenvalue */ alpha = 0.05; call qntl(crit, simEigen, 1-alpha); results = T(1:nrow(Eigenvalue)) || Eigenvalue || crit`; print results[c={"Number" "Observed Eigenvalue" "Simul Crit Val"} F=BestD8.]; |
The table is qualitatively the same as the one produced by PROC FACTOR. Both tables are the results of simulations, so you should expect to see small differences in the third column, which shows the 95th percentile of the distributions of the eigenvalues.
The eigenvalues are stored as rows of the simEigen matrix, so you can estimate the 5th, 10th, ..., 95th percentiles and over band plots on the eigenvalue (scree) plot, as follows:
/* 4. Create a graph that illustrates Horn's method: Factor Number vs distribution of eigenvalues. Write results in long form. */ /* 4a. Write the observed eigenvalues and the 95th percentile */ create Horn from results[c={"Factor" "Eigenvalue" "SimulCritVal"}]; append from results; close; /* 4b. Visualize the uncertainty in the simulated eigenvalues. For details, see https://blogs.sas.com/content/iml/2020/10/12/visualize-uncertainty-regression-predictions.html */ a = do(0.05, 0.45, 0.05); /* significance levels */ call qntl(Lower, simEigen, a); /* lower qntls */ call qntl(Upper, simEigen, 1-a); /* upper qntls */ Factor = col(Lower); /* 1,2,3,...,1,2,3,... */ alpha = repeat(a`, 1, p); create EigenQntls var {"Factor" "alpha" "Lower" "Upper"}; append; close; QUIT; proc sort data=EigenQntls; by alpha Factor; run; data All; set Horn EigenQntls; run; title "Horn's Method (1965) for Choosing the Number of Factors"; title2 "Also called Parallel Analysis"; proc sgplot data=All noautolegend; band x=Factor lower=Lower upper=Upper/ group=alpha fillattrs=(color=gray) transparency=0.9; series x=Factor y=Eigenvalue / markers name='Eigen' legendlabel='Observed Eigenvalue'; series x=Factor y=SimulCritVal / markers lineattrs=(pattern=dot) name='Sim' legendlabel='Simulated Crit Value'; keylegend 'Eigen' 'Sim' / across=1; run; |
The graph is shown in the previous section. The darkest part of the band shows the median eigenvalue. You can see that the "null distribution" of eigenvalues is rather narrow, even though the data contain only 50 observations. I thought perhaps it would be wider. Because the band is narrow, it doesn't matter much whether you choose the 95th percentile as a critical value or some other value (90th percentile, 80th percentile, and so forth). For these data, any reasonable choice for a percentile will still lead to rejecting the second factor and keeping only one principal component. Because the band is narrow, the results will not be unduly affected by whether you use few or many Monte Carlo simulations. In this article, both simulations used B=1000 simulations.
In summary, PROC FACTOR supports the PARALLEL and PLOTS=PARALLEL options for performing a "parallel analysis," which is Horn's method for choosing the number of principal components to retain. PROC FACTOR creates a table and graph that summarize Horn's method. You can also run the simulation yourself. If you use SAS/IML, you can simulate the correlation matrices directly, which is more efficient than simulating the data. If you run the simulation yourself, you can add additional features to the scree plot, such as a confidence band that shows the null distribution of the eigenvalues.
The post Horn's method: A simulation-based method for retaining principal components appeared first on The DO Loop.
]]>The post Can you transplant an indoor Christmas tree? appeared first on The DO Loop.
]]>"O Christmas tree, O Christmas tree, how lovely are your branches!" The idealized image of a Christmas tree is a perfectly straight conical tree with lush branches and no bare spots. Although this ideal exists only on Christmas cards, forest researchers are always trying to develop trees that approach the ideal. And they use serious science and statistics to do it!
Bert Cregg, a forest researcher at Michigan State University, is a Christmas tree researcher who has been featured in Wired magazine. Cregg and his colleagues have published many papers on best practices for growing Christmas trees. One paper that caught my eye is Gooch, Nzokou, and Cregg (2009), "Effect of Indoor Exposure on the Cold Hardiness and Physiology of Containerized Christmas Trees." In this paper, Cregg and his colleagues investigate whether you can buy a live Christmas tree, keep it in the house for the holidays, and then transplant it in your yard. The authors use a statistical technique known as the median lethal dose, or LD50, to describe how bringing a potted Christmas tree into the house can affect its hardiness to freezing temperatures.
This blog post uses Cregg's data and shows how to use SAS to estimate the LD50 statistic. If you are not interested in the details, the last section of this article summarizes the results. Spoiler: An evergreen kept indoors has a reduced ability to withstand freezing temperatures after it is transplanted. In Cregg's experiment, all the transplanted trees died within six months.
Containerized (potted) Christmas trees are popular among consumers who want a live Christmas tree but do not want to kill the tree by cutting it down. The idea is to bring the tree indoors during the holidays, then plant it on your property. However, there is a problem with bringing a tree into a house during the winter. Evergreens naturally go dormant in the winter, which enables their needles and buds to withstand freezing temperatures. When you bring a tree into a heated space, it "wakes up." If you later move the tree outside into freezing temperatures, the needles and buds can be damaged by the cold. This damage can kill the tree.
Cregg and his colleagues set up an experiment to understand how the length of time spent indoors affects a Christmas tree's ability to withstand freezing temperatures. Trees were brought indoors for 0, 3, 7, 14, and 20 days. Cuttings from the trees were then exposed to freezing conditions: -3, -6, -9, ..., -30 degrees Celsius. The goal is to estimate the median "lethal dose" for temperature. That is, to estimate the temperature at which half of the cuttings are damaged by the cold. In pharmacological terms, the time spent indoors is a treatment and the temperature is a "dose." The trees that were never brought indoors (0 days) are a control group.
Cregg and his colleagues studied three species of Christmas trees and studied dames to both buds and needles. For brevity, I will only look at the results for buds on the Black Hills Spruce (Picea glauca).
In Table 1 (p. 75), the authors show data for the bud mortality and 50% lethality (LD50) for each treatment group (days indoors) as a function of decreasing temperatures. The data for the spruce are shown in the following SAS DATA step. Each cell in the table represents the percentage of 60 cuttings that showed damage after being exposed to a freezing temperature. By knowing there were 60 cuttings, you can compute how many cuttings were damaged.
/* Bud mortality and LD50. Data from Table 1 of Gooch, Nzokou, & Cregg (2009). */ data SpruceBudDamage; Ntrials = 60; input TimeIndoors @; do Temp = -3 to -30 by -3; input BudMort @; NEvents = int(NTrials * BudMort / 100); /* compute NEvents from mortality */ output; end; label Temp = "Temperature (C)"; /* Bud Mortality (percent) as a function of temperature for each treatment */ /* Days | -------- Temperature (C) -------- */ /* Indoors |-3 -6 -9 -12 -15 -18 -21 -24 -27 -30 */ datalines; 0 0 0 0 0 0 6.7 0 0 20 100 3 0 0 0 0 0 0 0 30 80 100 7 0 0 0 0 0 0 10 40 100 100 14 0 0 0 0 0 0 0 80 100 100 20 0 0 0 0 20 40 100 100 100 100 ; |
The paper says that the LD50 "was determined graphically using a pairwise plot of the exposure temperature and the percentage of bud mortality ... for each species." I don't know what that means. It sounds like they did not estimate the LD50 statistically but instead graphed the bud mortality versus the temperature and used linear interpolation (or a smoother) to estimate the temperature at which the mortality is 50%. Here is a graph of the data connected by lines:
title "Bud Mortality in Black Hills Spruce"; proc sgplot data=SpruceBudDamage; series x=Temp y=BudMort / markers group=TimeIndoors lineattrs=(thickness=2); refline 50 / axis=y lineattrs=(color=red); xaxis grid; yaxis grid; run; |
The horizontal red line in the graph is the line of 50% mortality. For each curve, a crude estimate of LD50 is the temperature at which the curve crosses that line. The graphical estimates and the estimates in the paper are shown in the following table. The estimates in the authors' paper are greater than the estimates from my graph, but I do not know why. If you use something other than linear interpolation (for example, a loess smoother), you will get different curves and different estimates.
Although these graphical estimates differ slightly from the published results, the numbers tell the same story. On average, a spruce Christmas tree that is not brought indoors is hardy to about -28C. If you bring a tree indoors for 3 days, it is hardy only to about -25C. The longer a tree is indoors, the more it loses hardiness. For trees that are indoors for 20 days, the median lethal temperature is -18C, or about 10 degrees warmer than for the control group.
The graphical estimates are crude. They are based on linear interpolation between two consecutive data points: one for which the mortality is below 50% and the next for which the mortality is above 50%. The estimates ignore all other data. Furthermore, the estimates assume that the mortality is linear between those two points, which is not usually the case. The mortality curve is typically a sigmoid (or S-shaped) curve.
Fortunately, we can use statistics to address these concerns. The usual way to estimate LD50 in SAS is to use PROC PROBIT. For these data, we will perform a separate analysis for each value of the TimeIndoors variable. The INVERSECL option on the MODEL statement estimates the Temperature (and confidence limits) for a range of probability values. You can use the ODS OUTPUT statement to write the statistics to a SAS data set so that you can use PROC SGPLOT to overlay all five curves on a single graph, as follows:
proc probit data=SpruceBudDamage plots=(predpplot); by TimeIndoors; model NEvents / NTrials = Temp / InverseCL; ods exclude ProbitAnalysis; ods output ProbitAnalysis=ProbitOut; run; proc sgplot data=ProbitOut; band y=Probability lower=LowerCL upper=UpperCL / group=TimeIndoors transparency=0.5; series y=Probability x=Variable / group=TimeIndoors; refline 0.50 / axis=y lineattrs=(color=brown); xaxis grid values=(-30 to -3 by 3); yaxis grid; run; |
For each treatment group (time spent indoors), the graph shows probability curves for bud mortality as a function of the outdoor temperature. The median lethal temperature is where these curves and inverse confidence intervals intersect the line of 0.5 probability. (They are inverse confidence limits because they are for the value of the X value that produces a given Y value.) You can use PROC PRINT to display the estimates for LD50 for each treatment group:
The estimates for LD50 use all the data to model the bud mortality. This method also produces 95% (inverse) confidence limits for the LD50 parameter. For one of the treatment groups (TimeIndoors=14), the confidence limits could not be produced. The documentation for PROC PROBIT discusses why this can happen.
If you want, you can use this table to estimate the differences between the LD50 values.
In summary, growing beautiful Christmas trees requires a lot of science and statistics. This article analyzes data from an experiment in which potted Christmas trees were brought indoors and later returned outdoors where they can experience freezing temperatures. Trees that are brought indoors lose some of their hardiness and can be damaged by freezing temperatures. This article shows how to use PROC PROBIT in SAS to compute the median lethal temperature (LD50), which is the temperature at which half of the trees would be damaged. After 20 days indoors, a spruce tree will lose about 10 degrees (C) of resistance to freezing temperatures.
If you are thinking about getting a containerized (live) Christmas tree, Cregg (2020) wrote an article about how to take care of it and how to prepare it for transplanting after Christmas. He suggests waiting until spring. In the 2009 experiment, 100% of the trees that had been brought indoors for 10 or more days died within six months of transplanting, as compared to 0% of the control group. This happened even though the outdoor temperature never approached the LD50 level. This experiment was done in Michigan, so the results might not apply to trees in warmer regions.
The post Can you transplant an indoor Christmas tree? appeared first on The DO Loop.
]]>The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.
]]>This article presents a solution for PROC LOGISTIC. At the end of this article, I present a few tips for other SAS procedures.
Here's the main idea: PROC LOGISTIC supports an INEST= option that you can use to specify initial values of the parameters. It also supports the MAXITER=0 option on the MODEL statement, which tells the procedure not to perform any iterations to try to improve the parameter estimates. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want. Furthermore, you can use the STORE statement to store the model and use PROC PLM to perform scoring, visualization, and other post-fitting analyses.
I have used this technique previously to compute parameter estimates in PROC HPLOGISTIC and use them in PROC LOGISTIC to estimate odds ratios, the covariance matrix of the parameters, and other inferential quantities that are not available in PROC HPLOGISTIC. In a similar way, PROC LOGISTIC can construct ROC curves for predictions that were made outside of PROC LOGISTIC.
As a motivating example, let's create parameter estimates by using multiple imputations. The documentation for PROC MIANALYZE has an example of using PROC MI and PROC MIANALYZE to estimate the parameters for a logistic model. The following data and analysis are from that example. The data are lengths and widths of two species of fish (perch and parkki). Missing values are artificially introduced. A scatter plot of the data is shown.
data Fish2; title 'Fish Measurement Data'; input Species $ Length Width @@; datalines; Parkki 16.5 2.3265 Parkki 17.4 2.3142 . 19.8 . Parkki 21.3 2.9181 Parkki 22.4 3.2928 . 23.2 3.2944 Parkki 23.2 3.4104 Parkki 24.1 3.1571 . 25.8 3.6636 Parkki 28.0 4.1440 Parkki 29.0 4.2340 Perch 8.8 1.4080 . 14.7 1.9992 Perch 16.0 2.4320 Perch 17.2 2.6316 Perch 18.5 2.9415 Perch 19.2 3.3216 . 19.4 . Perch 20.2 3.0502 Perch 20.8 3.0368 Perch 21.0 2.7720 Perch 22.5 3.5550 Perch 22.5 3.3075 . 22.5 . Perch 22.8 3.5340 . 23.5 . Perch 23.5 3.5250 Perch 23.5 3.5250 Perch 23.5 3.5250 Perch 23.5 3.9950 . 24.0 . Perch 24.0 3.6240 Perch 24.2 3.6300 Perch 24.5 3.6260 Perch 25.0 3.7250 . 25.5 3.7230 Perch 25.5 3.8250 Perch 26.2 4.1658 Perch 26.5 3.6835 . 27.0 4.2390 Perch 28.0 4.1440 Perch 28.7 5.1373 . 28.9 4.3350 . 28.9 . . 28.9 4.5662 Perch 29.4 4.2042 Perch 30.1 4.6354 Perch 31.6 4.7716 Perch 34.0 6.0180 . 36.5 6.3875 . 37.3 7.7957 . 39.0 . . 38.3 . Perch 39.4 6.2646 Perch 39.3 6.3666 Perch 41.4 7.4934 Perch 41.4 6.0030 Perch 41.3 7.3514 . 42.3 . Perch 42.5 7.2250 Perch 42.4 7.4624 Perch 42.5 6.6300 Perch 44.6 6.8684 Perch 45.2 7.2772 Perch 45.5 7.4165 Perch 46.0 8.1420 Perch 46.6 7.5958 ; proc format; value $FishFmt " " = "Unknown"; run; proc sgplot data=Fish2; format Species $FishFmt.; styleattrs DATACONTRASTCOLORS=(DarkRed LightPink DarkBlue); scatter x=Length y=Width / group=Species markerattrs=(symbol=CircleFilled); run; |
The analyst wants to use PROC LOGISTIC to create a model that uses Length and Width to predict whether a fish is perch or parkki. The scatter plot shows that the parkki (dark red) tend to be less wide than the perch of the same length For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red). For some fish in the graph, the species is not known.
Because the data contains missing values, the analyst uses PROC MI to run 25 missing value imputations, uses PROC LOGISTIC to produce 25 sets of parameter estimates, and uses PROC MI to combine the estimates into a single set of parameter estimates. See the documentation for a discussion.
/* Example from the MIANALYZE documentation "Reading Logistic Model Results from a PARMS= Data Set" https://bit.ly/394VlI7 */ proc mi data=Fish2 seed=1305417 out=outfish2; class Species; monotone logistic( Species= Length Width); var Length Width Species; run; ods select none; options nonotes; proc logistic data=outfish2; class Species; model Species= Length Width / covb; by _Imputation_; ods output ParameterEstimates=lgsparms; run; ods select all; options notes; proc mianalyze parms=lgsparms; modeleffects Intercept Length Width; ods output ParameterEstimates=MI_PE; run; proc print data=MI_PE noobs; var Parm Estimate; run; |
The parameter estimates from PROC MIANALYZE are shown. The question is: How can you use PROC LOGISTIC and PROC PLM to score and visualize this model, given that the estimates are produced outside of PROC LOGISTIC?
As mentioned earlier, a solution to this problem is to use the INEST= option on the PROC LOGISTIC statement in conjunction with the MAXITER=0 option on the MODEL statement. When used together, you can get PROC LOGISTIC to evaluate any logistic model you want, and you can use the STORE statement to create an item store that can be read by PROC PLM to perform scoring and visualization.
You can create the INEST= data set by hand, but it is easier to use PROC LOGISTIC to create an OUTEST= data set and then merely change the values for the parameter estimates, as done in the following example:
/* 1. Use PROC LOGISTIC to create an OUTEST= data set */ proc logistic data=Fish2 outest=OutEst noprint; class Species; model Species= Length Width; run; /* 2. replace the values of the parameter estimates with different values */ data inEst; set outEst; Intercept = -0.130560; Length = 1.169782; Width = -8.284998; run; /* 3. Use the INEST= data set and MAXITER=0 to get PROC LOGISTIC to create a model. Use the STORE statement to write an item store. https://blogs.sas.com/content/iml/2019/06/26/logistic-estimates-from-hplogistic.html */ proc logistic data=Fish2 inest=InEst; /* read in extermal model */ model Species= Length Width / maxiter=0; /* do not refine model fit */ effectplot contour / ilink; store LogiModel; run; |
The contour plot is part of the output from PROC LOGISTIC. You could also request an ROC curve, odds ratios, and other statistics. The contour plot visualizes the regression model. For a fish of a given length, wider fish are predicted to be perch (blue) and thinner fish are predicted to be parkki (red).
Because PROC LOGISTIC writes an item store for the model, you can use PROC PLM to perform a variety of scoring tasks, visualization, and hypothesis tests. The following statements create a scoring data set and use PROC PLM to score the model and estimate the probability that each fish is a parkki:
/* 4. create a score data set */ data NewFish; input Length Width; datalines; 17.0 2.7 18.1 2.1 21.3 2.9 22.4 3.0 29.1 4.3 ; /* 5. predictions on the DATA scale */ proc plm restore=LogiModel noprint; score data=NewFish out=ScoreILink predicted lclm uclm / ilink; /* ILINK gives probabilities */ run; proc print data=ScoreILink; run; |
According to the model, the first and fifth fish are probably perch. The second, third, and fourth fish are predicted to be parkki, although the 95% confidence intervals indicate that you should not be too confident in the predictions for the third and fourth observations.
Unfortunately, not every regression procedure in SAS is as flexible as PROC LOGISTIC. In many cases, it might be difficult or impossible to "trick" a SAS regression procedure into analyzing a model that was produced externally. Here are a few thoughts from me and from one of my colleagues. I didn't have time to fully investigate these ideas, so caveat emptor!
This article shows how to score parametric regression models when the parameter estimates are not fit by the usual procedures. For example, multiple imputations can produce a set of parameter estimates. In PROC LOGISTIC, you can use an INEST= data set to read the estimates and use the MAXITER=0 option to suppress fitting. You can use the STORE statement to store the model and use PROC PLM to perform scoring and visualization. Other procedures have similar options, but there is not a single method that works for all SAS regression procedures.
If you use any of the ideas in this article, let me know how they work by leaving a comment. If you have an alternate way to trick SAS regression procedures into using externally supplied estimates, let me know that as well.
The post How to score a logistic regression model that was not fit by PROC LOGISTIC appeared first on The DO Loop.
]]>