Techniques for scoring a regression model in SAS

19

My previous post described how to use the "missing response trick" to score a regression model. As I said in that article, there are other ways to score a regression model. This article describes using the SCORE procedure, a SCORE statement, the relatively new PLM procedure, and the CODE statement.

The following DATA step defines a small set of data. The goal of the analysis is to fit various regression models to Y as a function of X, and then evaluate each regression model on a second data set, which contains 200 evenly spaced X values.

/* the original data; fit model to these values */
data A;               
input x y @@;
datalines;
1  4  2  9  3 20  4 25  5  1  6  5  7 -4  8 12
;
 
/* the scoring data; evaluate model on these values */
%let NumPts = 200;
data ScoreX(keep=x);
min=1; max=8;
do i = 0 to &NumPts-1;
   x = min + i*(max-min)/(&NumPts-1);     /* evenly spaced values */
   output;                                /* no Y variable; only X */
end;
run;

The SCORE procedure

Some SAS/STAT procedures can output parameter estimates for a model to a SAS data set. The SCORE procedure can read those parameter estimates and use them to evaluate the model on new values of the explanatory variables. (For a regression model, the SCORE procedure performs matrix multiplication: you supply the scoring data X and the parameter estimates b and the procedure computes the predicted values p = Xb.)

The canonical example is fitting a linear regression by using PROC REG. You can use the OUTEST= option to write the parameter estimates to a data set. That data set, which is named RegOut in this example, becomes one of the two input data sets for PROC SCORE, as follows:

proc reg data=A outest=RegOut noprint;
   YHat: model y = x;    /* name of model is used by PROC SCORE */
quit;
 
proc score data=ScoreX score=RegOut type=parms predict out=Pred;
   var x;
run;

It is worth noting that the label for the MODEL statement in PROC REG is used by PROC SCORE to name the predicted variable. In this example, the YHat variable in the Pred data set contains the predicted values. If you do not specify a label on the MODEL statement, then a default name such as MODEL1 is used. For more information, see the documentation for the SCORE procedure.

The SCORE statement

Nonparametric regression procedures cannot output parameter estimates because...um...because they are nonparametric! Nonparametric regression procedures support a SCORE statement, which enables you to specify the scoring data set. The following example shows the syntax of the SCORE statement for the TPSPLINE procedure, which fits a thin-plate spline to the data:

proc tpspline data=A;
   model y = (x);
   score data=ScoreX out=Pred;
run;

Other nonparametric procedures that support the SCORE statement include the ADAPTIVEREG procedure (new in SAS/STAT 12.1), the GAM procedure, and the LOESS procedure.

The STORE statement and the PLM procedure

Although the STORE statement and the PLM procedure were introduced in SAS/STAT 9.22 (way back in 2010), some SAS programmers are still not aware of these features. Briefly, the idea is that sometimes a scoring data set is not available when a model is fit, so the STORE statement saves all of the information needed to recreate and evaluate the model. The saved information can be read by the PLM procedure, which includes a SCORE statement, as well as many other capabilities. A good introduction to the PLM procedure is Tobias and Cai (2010), "Introducing PROC PLM and Postfitting Analysis for Very General Linear Models."

For this example, the GLM procedure is used to fit the data. Because of the shape of the previous thin-plate spline curve, a cubic model is fit. The STORE statement is used to save the model information in an item store named WORK.ScoreExample. (I've used the WORK libref, but use a permanent libref if you want the item store to persist across SAS sessions.) Many hours or days later, you can use the PLM procedure to evaluate the model on a new set of data, as shown in the following statements:

proc glm data=A;
   model y = x | x | x;    
   store work.ScoreExample;     /* store the model */
quit;
 
proc plm restore=work.ScoreExample;
   score data=ScoreX out=Pred;  /* evaluate the model on new data */
run;

The STORE statement is supported by many SAS/STAT regression procedures, including the GENMOD, GLIMMIX, GLM, GLMSELECT, LIFEREG, LOGISTIC, MIXED, ORTHOREG, PHREG, PROBIT, SURVEYLOGISTIC, SURVEYPHREG, and SURVEYREG procedures. It also applies to the RELIABILITY procedure in SAS/QC software.

The CODE statement

In SAS/STAT 12.1 the CODE statement was added to several SAS/STAT regression procedures. It is also part of the PLM procedure. The CODE statement offers yet another option for scoring data. The CODE statement writes DATA step statements into a text file. You can then use the %INCLUDE statement to insert those statements into a DATA step. In the following example, DATA step statements are written to the file glmScore.sas. You can include that file into a DATA step in order to evaluate the model on the ScoreX data:

proc glm data=A noprint;
   model y = x | x | x;  
   code file='glmScore.sas';
quit;
 
data Pred;
set ScoreX;
%include 'glmScore.sas';
run;

For this example, the predicted values are in a variable called P_y in the Pred data set. The CODE statement is supported by many predictive modeling procedures, such as the GENMOD, GLIMMIX, GLM, GLMSELECT, LOGISTIC, MIXED, PLM, and REG procedures in SAS/STAT software. In addition, the CODE statement is supported by the HPLOGISTIC and HPREG procedures in SAS High-Performance Analytics software.

In summary, there are many ways to score SAS regression models. For PROC REG and linear models with an explicit design matrix, use the SCORE procedure. For nonparametric models, use the SCORE statement. For scoring data sets long after a model is fit, use the STORE statement and the PLM procedure. For scoring inside the DATA step, use the CODE statement. For regression procedures that do not support these options (such as PROC TRANSREG) use the missing value trick from my last post.

Did I leave anything out? What is your favorite technique to score a regression model? Leave a comment.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

19 Comments

  1. Hi Rick,

    Nice job! I haven't been using proc plm so this will motivate me to start using it. Does adding a "by" group change this much? That's one of the areas that I find R MUCH more confusing than SAS.

    Cheers,
    Bob

    • Rick Wicklin

      From Tobias and Cai (2010): "When you use a BY statement in the analysis that creates an item store,
      the information about BY variables and BY-group-specific modeling results is also transferred to the item store, and
      PROC PLM also automatically processes the item store in BY-group order."

  2. Michelle Homes

    Great blog post, I think you're right about people not being aware of PROC PLM and code statement... Whilst SAS code written years ago still works, enhancing it can give you time and resource benefits. These features you've succinctly highlighted will certainly help.

  3. Pingback: Three ways to add a smoothing spline to a scatter plot in SAS - The DO Loop

  4. I have been testing the new Visual Statistics for a few months and one of my tasks was to verify the CODE output from PROC IMSTAT from the lasr genmodel, glm, and logistic actions. I really like this output because it contains all the information for scoring, but having the explicit code to run as a data step allows me to modify it on the fly. The header for the code also contains useful information that I may have forgotten since the time I fit the model.

  5. Robert Feyerharm on

    Let's say I've divided my analysis data into 10 cross-validation subsets, identified as k=1 thru 10. I'd like to train proc reg on 9 of the subsets and then score the 10th validation subset (and then repeat the procedure for each of the other 9 training datasets). However proc score won't allow me to score a subset of the analysis dataset as follows:

    proc reg data=A outest=RegOut noprint;
    where k<10;
    model y = x;
    quit;

    proc score data=A score=RegOut type=parms predict out=Pred;
    where k=10;
    var x;
    run;

    Is there a simple way to do this without having to create separate validation daatsets? Thanks!

  6. Richard Remington on

    Can PLM provide prediction intervals for predictions at new data points for a model fit with GLIMMIX? My application uses a logistic GLMM with one random effect.

    • Rick Wicklin

      See the doc for the SCORE stmt in PROC PLM, which includes this statement: "Prediction limits (LCL, UCL) are available only for statistical models that allow such limits, typically regression-type models for normally distributed data with an identity link function." If the doc isn't clear, you can discuss statistical procedures at the SAS Support Community.

  7. Pingback: Plot the conditional distribution of the response in a linear regression model - The DO Loop

  8. Pingback: Generate evenly spaced points in an interval - The DO Loop

  9. Pingback: Create a surface plot in SAS - The DO Loop

  10. Thanks for this, Rick! I'm trying to use this with the TRANSREG procedure, and just can't make it work. Have you tried it with that procedure?

    • Rick Wicklin

      The last paragraph says, "For regression procedures that do not support these options (such as PROC TRANSREG) use the missing value trick from my last post." If you have specific questions about syntax, please post to the SAS Support Communities.

  11. Pingback: Use the EFFECTPLOT statement to visualize regression models in SAS - The DO Loop

  12. Pingback: 3 ways to visualize prediction regions for classification problems - The DO Loop

Leave A Reply

Back to Top