Most regression models try to model a response variable by using a smooth function of the explanatory variables. However, if the data are generated from some nonsmooth process, then it makes sense to use a regression function that is not smooth. A simple way to model a discontinuous process in SAS is to use spline effects and specify repeated value for the knots.

### Discontinuous processes: More common than you might think

The classical ANOVA is one way to analyze data that are collected before and after a known event. For example, you might record gas mileage for a car before and after a tune-up. You might collect patient data before and after they undergo a medical or surgical treatment. You might have data about real estate prices before and after some natural disaster. In all these cases, you might suspect that the response changes abruptly because of the event.

To give a simple example, suppose that a driver records the fuel economy (in mile per gallon) for a car for 12 weeks. Because the car engine is coughing and knocking, the owner brings the car to a mechanic for maintenance. After the maintenance, the car seems to run better and he records the fuel economy for another six weeks. The hypothetical data are below:

data MPG; input mpg @@; week = _N_; period = ifc(week > 12, "After ", "Before"); label mpg="Miles per Gallon"; datalines; 30.5 28.1 27.1 31.2 25.2 31.1 27.7 28.2 29.6 30.6 28.9 25.9 30.6 33.0 31.2 29.7 32.7 31.1 ; |

Notice that the data contains a binary indicator variable (period) that records whether the data are from before or after the tune-up. You can use PROC GLM to perform a simple ANOVA analysis to determine whether there was a significant change in the mean fuel economy after the maintenance. The following call to PROC GLM runs an ANOVA and indicates that the mean fuel economy is about 2.7 mpg better after the tune-up.

proc glm data=MPG plots=fitplot; class period / ref=first; model mpg = period /solution; output out=out predicted=Pred; quit; |

Graphically, this regression analysis is usually visualized by using two box plots. (PROC GLM creates the box plots automatically when ODS graphics are enabled.) However, because the independent variable is time, you could also use a series plot to show the observed data and the mean response before and after the maintenance. By using the GROUP= option on the SERIES statement, you can get two lines for the "before" and "after" time periods.

title "Piecewise Constant Regression with Jump Discontinuity"; proc sgplot data=Out; block x=week block=period / transparency=0.8; scatter x=week y=mpg / markerattrs=(symbol=CircleFilled color=black); series x=week y=pred / group=period lineattrs=(thickness=3) ; run; |

The graph shows that the model has a jump discontinuity at the time point at which the maintenance intervention occurred. If you include the WEEK variable in the analysis, you can model the response as a linear function of time, with a jump at the time of the tune-up.

All this is probably very familiar. However, did you know that you can use splines to model the data as a continuous function that has a kink or "corner" at the time of the maintenance event? You can use this feature when the model is continuous, but the slope changes at a known time.

### Splines for nonsmooth models

Several SAS procedures support the EFFECT statement, which enables you to build spline effects. The paper "Rediscovering SAS/IML Software" (Wicklin 2010, p. 4) has an example where splines are used to construct a highly nonlinear curve for a scatter plot.

A spline effect is determined by the placement of certain points called "knots." Often knots are evenly spaced within the range of the explanatory variable, but the EFFECT statement supports many other ways to position the knots. In fact, the documentation for the EFFECT statement says:
"If you remove the restriction that the knots of a spline must be distinct and allow repeated knots, then you can obtain functions with less smoothness and even discontinuities at the repeated knot location. For a spline of degree *d* and a repeated knot with multiplicity *m* ≤ *d*, the piecewise polynomials
that join such a knot are required to have only *d – m* matching derivatives."

The degree of a linear regression is *d*=1, so if you specify a knot position once you obtain a piecewise linear function that contains a "kink" at the knot. The following call to PROC GLIMMIX demonstrates this technique. (I use GLIMMIX because neither PROC GLM nor PROC GENMOD support the EFFECT statement.) You can manually specify the position of knots by using the KNOTMETHOD=LIST(*list*) option on the EFFECT statement.

proc glimmix data=MPG; effect spl = spline(week / degree=1 knotmethod=list(1 13 18)); /* p-w linear */ *effect spl = spline(week / degree=2 knotmethod=list(1 13 13 1); /*p-w quadratic */ model mpg = spl / solution; output out=out predicted=Pred; quit; title "Piecewise Linear Regression with Kink"; proc sgplot data=Out noautolegend; block x=week block=period / transparency=0.8; scatter x=week y=mpg / markerattrs=(symbol=CircleFilled color=black); series x=week y=pred / lineattrs=(thickness=3) ; run; |

The graph shows that the model is piecewise linear, but that the slope of the model changes at week=13. In contrast, the second EFFECT statement in the PROC GLIMMIX code (which is commented out), specifies piecewise quadratic polynomials (*d*=2) and
repeats the knot at week=13. That results in two quadratic models that give the same predicted value at week=13 but the model is not smooth at that location. Try it out!

If you are using a SAS procedure that does not support the EFFECT statement, you can use the GLMMIX procedure to output the dummy variables that are associated with the spline effects. A nice paper by David Pasta (2003) describes how to use dummy variables in a variety of models. The paper was written before the EFFECT statement; many of the ideas in the paper are easier to implement by using the EFFECT statement.

Lastly, the TRANSREG procedure in SAS supports spline effects but has its own syntax. See the TRANSREG documentation, which includes an example of repeating knots to build a regression model for discontinuous data.

Have you ever needed to construct a nonsmooth regression model? Tell your story by leaving a comment.

## 4 Comments

Great topic, Rick. Regression splines are very useful but under-used in practice as they are somewhat complicated. There are different mathematical forms of the basis function -- linear, quadratic, cubic, restricted cubic or natural; different basis functions -- polynomial, bernstein polynomial, orthogonal polynomial; and perhaps most painful of all, different ways of doing regression with them -- generate basis yourself via data step (ouch), or let different procedures generate the basis column and then feed them into analysis procedure, e.g., proc logistic must be fed the design matrix, and there are procedures as you illustrated above that will generate the splines and perform regression within the same proc.

It would be great if you can share your thoughts on the intricacies involved and all the magical secrets hidden in SAS procedures. I have been using SAS for a long time but did not fully appreciate the capabilities of the EFFECT statement!

Daymond

It is hard to not think about interrupted time series or regression discontinuity with this topic/example. Both approaches provide different slopes and the latter is more difficult to interpret if the agenda is related to hypothesis testing. I am curious to see comments on when the latter would be beneficial and interpretations of the different slopes between these approaches.

Hayden, to me the example as stated, change in car mileage after a maintenance, is not one of slopes in any way. So, I would certainly use the standard analysis of variance approach to determine the change in mileage. But there are plenty of other situations where the slope might be of interest, and thus the spline fit method provided by Rick is very useful. However, the question of do you compare means or slopes is one that should be determine before the analysis is ever performed; the specific situation determines what analysis to use (slopes or means).

A wonderful topic. I've done papers on TRANSREG; I think it's an underutilzed PROC. I think the graphical output from TRANSREG is an area that could be fun to work on - I did a little in my paper.