An easier way to perform regression with restricted cubic splines in SAS

34
Maybe if we think and wish and hope and pray
It might come true.
Oh, wouldn't it be nice?

The Beach Boys

Months ago, I wrote about how to use the EFFECT statement in SAS to perform regression with restricted cubic splines. This is the modern way to use splines in a regression analysis in SAS, and it replaces the need to use older macros such as Frank Harrell's %RCSPLINE macro. I shared my blog post with a colleague at SAS and mentioned that the process could be simplified. In order to specify the placement of the knots as suggested by Harrell (Regression Modeling Strategies, 2010 and 2015), I had to use PROC UNIVARIATE to get the percentiles of the explanatory variable. "Wouldn't it be nice," I said, "if the EFFECT statement could perform that computation automatically?"

I am happy to report that the 15.1 release of SAS/STAT (SAS 9.4M6) includes a new option that makes it easy to place internal knots at percentiles of the data. You can now use the KNOTMETHOD=PERCENTILELIST option on the EFFECT statement to place knots. For example, the following statement places five internal knots at percentiles that are recommended in Harrell's book:
EFFECT spl = spline(x / knotmethod=percentilelist(5 27.5 50 72.5 95));

An example of using restricted cubic in regression in SAS

Restricted cubic splines are also called "natural cubic splines." This section shows how to perform a regression fit by using restricted cubic splines in SAS.

For the example, I use the same Sashelp.Cars data that I used in the previous article. For clarity, the following SAS DATA step renames the Weight and MPG_City variables to X and Y, respectively. If you want to graph the regression curve, you can sort the data by the X variable, but this step is not required to perform the regression.

/* create (X,Y) data from the Sashelp.Cars data. Sort by X for easy graphing. */
data Have;
   set sashelp.cars;
   rename mpg_city = Y  weight = X  model = ID;
run;
 
proc sort data=Have;  by X;  run;

The following call to PROC GLMSELECT includes an EFFECT statement that generates a natural cubic spline basis using internal knots placed at specified percentiles of the data. The MODEL statement fits the regression model and the OUTPUT statement writes an output data set that contains the predicted values. The SGPLOT procedure displays a graph of the regression curve overlaid on the data:

/* fit data by using restricted cubic splines using SAS/STAT 15.1 (SAS 9.4M6) */
ods select ANOVA ParameterEstimates SplineKnots;
proc glmselect data=Have;
   effect spl = spline(X/ details naturalcubic basis=tpf(noint)
             knotmethod=percentilelist(5 27.5 50 72.5 95) ); /* new in SAS/STAT 15.1 (SAS 9.4M6)  */
   model Y = spl / selection=none;       /* fit model by using spline effects */
   output out=SplineOut predicted=Fit;   /* output predicted values */
quit;
 
title "Restricted Cubic Spline Regression";
title2 "Five Knots Placed at Percentiles";
proc sgplot data=SplineOut noautolegend;
   scatter x=X y=Y;
   series x=X y=Fit / lineattrs=(thickness=3 color=red);
run;

In summary, the new KNOTMETHOD=PERCENTILELIST option on the EFFECT statement simplifies the process of using percentiles of a variable to place internal knots for a spline basis. The example shows knots placed at the 5th, 27.5th, 50th, 72.5th, and 95th percentiles of an explanatory variable. These heuristic values are recommended in Harrell's book. For more details about the EFFECT statement and how the location of knots affects the regression fit, see my previous article "Regression with restricted cubic splines in SAS."

You can download the complete SAS program that generates this example, which requires SAS/STAT 15.1 (SAS 9.4M6). If you have an earlier release of SAS, the program also shows how to perform the same computations by calling PROC UNIVARIATE to obtain the location of the knots.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

34 Comments

  1. Thank you for your wonderful work. And I want to konw how to adjusted other confounders in the restricted cubic spline.

  2. Hi,

    I have 4 knots.
    Is there anyway to get the parameters for the curve between the second and third knots, third and fourth knots, etc.

    Thank you.

    • Rick Wicklin

      Yes. When you use knots in a linear regression, you will get estimates for the regression coefficients. The curve of the predicted values is therefore a linear combination of the spline basis, as shown in the article "Visualize a regression with splines." If you want to restrict the interval between the second and third knots, just add together the knots with nonzero support on that interval.

  3. Hi Rick,
    I have 9 data points and I would like to ask you if it's appropriate to apply regression with restricted cubic splines. I have the scatterplot from sgplot, but I can't add it here. The plot shows an upward quadratic then decreases linearly. I will be happy to send the display.

    Thanks!

    • Rick Wicklin

      Sorry that I was unclear. I meant "Sure, go ahead and fit a cubic spline model to 9 data points."

      If you need more help, you can post questions to the SAS Support Communities. The Communities enable you to upload data, programs, and graphics.

  4. Hi Rick,
    Thank you for this post and your comments. They are really helpful.

    Following your above SAS syntax, I wrote the following one (Model one), by specifying the values at each knot.

    /*model one*/
    proc glmselect data=Have;
       effect spl = spline(X/ details naturalcubic basis=tpf(noint)
                 knotmethod=list(2513, 3174, 3474.5, 3903, 5000) ); 
       model Y = spl / selection=none;       
       output out=SplineOut2 predicted=Fit;   
    quit;

    I then created 3 spline variables and wrote model two. The results generated by the two models are almost the same, except the coefficients of the thee spline variables (the t-values and p-values are the same. Would you please share with me your thoughts about the differences?
    Many thanks, Hongjie

    /*model two*/
    data Have2;
     set Have;
     
     if x<=2513   then x2513=0; else x2513=x-2513;
     if x<=3174   then x3174=0; else x3174=x-3174;
     if x<=3474.5 then x3474=0; else x3474=x-3474.5;
     if x<=3903   then x3903=0; else x3903=x-3903;
     if x<=5000   then x5000=0; else x5000=x-5000;
     x_sq=x*x;
     x_q=x**3;
     x2513q=x2513**3;
     x3174q=x3174**3;
     x3474q=x3474**3;
     x3903q=x3903**3;
     x5000q=x5000**3;
     
     term2=x2513q-x3903q*((5000-2513)  /(5000-3903))+x5000q*((3903-2513)  /(5000-3903));
     term3=x3174q-x3903q*((5000-3174)  /(5000-3903))+x5000q*((3903-3174)  /(5000-3903));  
     term4=x3474q-x3903q*((5000-3474.5)/(5000-3903))+x5000q*((3903-3474.5)/(5000-3903));
     run;
     
     proc reg data=Have2;
     model y=x term2 term3 term4;
     output out=out4 p=Fit;
    run;
  5. Hi Rick,

    I have to manually create basis functions for a variable but SAS doesn't seem to have a procedure to do it. I am using Frank Harrell's macro RCSPLINE but the basis variables are not identical to those created by SAS using OUTDESIGN. It seems like SAS is dividing the second basis by 4 and the last one by 4.5. Not sure where these numbers come from for the example given below. Why are SAS numbers different from the ones created by the macro?

    data simulation;
    do i=1 to 10000;
    t=rand('uniform',0,10);
    p=1/(1+exp(sin(t)));
    y=rand('bernoulli',p);
    output;
    end;
    run;

    *model a natural cubic spline;
    *and store the result in "mystore";
    proc logistic data=simulation OUTDESIGN=splines;
    effect myspline=spline(t/knotmethod=list(1,2,3,4) NATURALCUBIC);
    model y(event="1")=myspline;
    store mystore;
    run;
    /*-------------------------
    MACRO FOR CUBIC SPLINES
    -------------------------*/
    %macro rcspline(x,knot1,knot2,knot3,knot4,knot5,knot6,knot7,knot8,knot9,knot10, norm=2);
    %LOCAL j v7 k tk tk1 t k1 k2;
    %LET v7=&x; %IF %length(&v7)=8 %THEN %LET v7=%SUBSTR(&v7,1,7);
    %*get no. knots, last knot, next to last knot;
    %DO k=1 %TO 10;
    %IF %QUOTE(&&knot&k)= %THEN %GOTO nomorek;
    %END;
    %LET k=11;
    %nomorek: %LET k=%EVAL(&k-1); %LET k1=%EVAL(&k-1); %LET k2=%EVAL(&k-2);
    %IF &k<3 %THEN %PUT ERROR: <3 knots given. no spline variables CREATEd.;
    %ELSE %DO;
    %LET tk=&&knot&k;
    %LET tk1=&&knot&k1;
    DROP _kd_; _kd_=
    %IF &norm=0 %THEN 1;
    %ELSE %IF &norm=1 %THEN &tk - &tk1;
    %ELSE (&tk - &knot1)**.666666666666; ;
    %DO j=1 %TO &k2;
    %LET t=&&knot&j;
    &v7&j=max((&x-&t)/_kd_,0)**3+((&tk1-&t)*max((&x-&tk)/_kd_,0)**3
    -(&tk-&t)*max((&x-&tk1)/_kd_,0)**3)/(&tk-&tk1)%STR(;);
    %END;
    %END;
    %mend rcspline;

    **comparing to splines datasets from PROC LOGISTIC
    **t1 from PROC LOGISTIC=t1 from below divided by 4-1=3;
    DATA want; SET simulation;
    %rcspline(t,1,2,3,4);
    RUN;

  6. Thank you Rick!

    I am also looking to implement penalized spline and I was wondering if there is a way to do so in SAS. I am looking for a SAS equivalent of the R- PSPLINE functionthat can be used in PROC LOGISTIC.

  7. Harells %PSLINE macro is for plotting the restricted cubic spline. The PSPLINE function is R is a piecewise polynomial with a smoothing penalty. I am not sure if SAS has anything at the moment.

  8. I just found out that there is a PSPLINE(x) that can be used only in PROC TRANSREG. I am running conditional logit so I cannot use it in PROC LOGISTIC or PROC PHREG.

  9. Hello Rick,
    I am having trouble creating these visualizations in SAS using PROC PHREG. I tried using your code above by trying to request an output dataset with predicted fit values using the following line within PHREG: OUTPUT out =SplineOut predicted = Fit;. However, it does not seem as if PHREG supports requesting predicted fit. Is there some other way to get a similar graph in PHREG? Thanks.

  10. Hi Rick,
    I would like to ask you if you have used the %RCS_reg package, I am using it and I have found that I can run the results but I cannot generate the RCS curves. The log reports an error saying, I would like to consult you on what this can be done. Here is my code%RCS_Reg( infile= NHSALL_2, outfile= NHSALL_out,
    Dir_data= ,
    Main_spline_var= LBXVIDMS, ref_val= , AVK_msv= 0, knots_msv=5 50 95,
    Oth_spline_var1=,
    typ_reg= cox,
    dep_var= mort_all, surv_time_var= PERyear_EXM,
    adjust_var=sex1 sex2 race1 race2 race3 race4 BMI1 BMI2 BMI2 edu1
    edu2 edu3 income1 income2 income3 drinking1 drinking2
    drinking3 smoking1 smoking2 smoking3 phyacty1 phyacty2 phyacty3
    DURDM1 DURDM2 DURDM3 MEDDM1 MEDDM2 MEDDM3 MEDDM4
    hbp cvd ghpgp hbc,
    Y_ref_line= 1, max_Xaxis= ,
    exp_beta= 1, round=,
    specif_val= );
    quit;
    . The variables are converted to dummy variables except for age which is continuous.

  11. Hi Rick,
    This blog is really so helpful and appreciated. Your code for plotting linear gression with RCS works very well when there are no covariates, but how can I visualize the multivaraible-adjusted spline. Here is my code. I found this code only plots unadjusted predicted value.

    proc glmselect data=diet outdesign(addinputvars fullmodel)=SplineBasis;
    effect splT_fish= spline(raw_T_fish/ basis=tpf(noint) NATURALCUBIC details knotmethod=percentilelist(5 27.5 50 72.5 95));
    model n3pufas=splT_fish AGE SEX BMI raw_cereals raw_pulses raw_meat raw_egg raw_milk
    raw_potato raw_fruits raw_veg nuts vegoilfat seaweeds energy /selection=none;

    proc reg data=SplineBasis plots=none;
    model n3pufas=splT_fish_: ;
    output out=regout predicted=fit LCLM=Lower UCLM=Upper;

    proc sgplot data=regOut noautolegend Noborder nowall;
    scatter x=raw_T_fish y=Fit;
    series x=raw_T_fish y=Fit ;
    Series y=Lower x=raw_T_fish ;
    Series y=Upper x=raw_T_fish;

  12. Hocine Tighiouart on

    Hi Rick,
    I am fitting a restricted cubic spline using proc phreg using the effect statement, see below. My question is when you use the baseline statement to predicted survival at certain time point for different values of the variable dpgfr3m, does below work if the covariate included in the model is spl? the dataset cox1_temp has value for dpgfr3m and not spl

    proc phreg data=resample_treated;
    effect spl=spline(dpgfr3m/details naturalcubic basis=tpf(noint) knotmethod=percentilelist(5 27.5 50 72.5 95));
    model v1_fu_d*v1_ev_d(0)=spl;
    baseline covariates=cox1_temp out=cox1_temp1 survival=surv1 timelist=12 24 36 48;
    by replicate;
    run;

Leave A Reply

Back to Top