Weekday Morning Quick-trick: How to Score from PROC VARCLUS

5

Have you used multivariate procedures in SAS and wanted to save out scores? Some procedures, such as FACTOR, CANDISC, CANCORR, PRINCOMP, and others have an OUT= option to save scores to the input data set. However, to score a new data set, or to perform scoring with multivariate procedures that do not have an OUT= option you need another way.

Scoring new data with these procedures is a snap if you know how to use PROC SCORE. The OUTSTAT data sets from these procedures includes scoring code that the procedure uses to score observations in the DATA= data set. They are contained in the observations where TYPE='SCORE'.

The only catch, however, is that if you use a hierarchical method such as the VARCLUS procedure, then the score code is included for the 1-, 2-, 3-, and so on cluster solutions, up to the final number of clusters in the analysis. PROC SCORE doesn't know which one to use.

So you have one extra (teeny-tiny) step to perform scoring: Subset the code to only the NCL solution you want. Then, use the same VAR statement from PROC VARCLUS in PROC SCORE.

Here is an example:

The analysis has produced three clusters.

Here is a subset of the scorecode data set, which contains SCORE observations. The first column, NCL, indicates the number of clusters. You only want to score the three-cluster solution.


Whether you prefer to subset and replace the data, as shown here, or subset and create a new data table is up to you.

This code scores the original data set with the three cluster scores. You can use a different DATA= data set to score new data, as long as the variables are all there for scoring. Note that I used a VAR statement in PROC SCORE here, so that the procedure knows which subset of the columns in the scorecode are used for scoring. Otherwise, the NCL column would pose a problem.

Here are the new columns in the scored data set.

Share

About Author

Catherine Truxillo

Catherine Truxillo, Ph.D. has been a Statistical Training Specialist at SAS since 2000 and has written or co-written SAS training courses for advanced statistical methods including: multivariate statistics, linear and generalized linear mixed models, multilevel models, structural equation models, imputation methods for missing data, statistical process control, design and analysis of experiments, and cluster analysis. Although she primarily works with advanced statistics topics, she also teaches SAS courses using SAS/IML (the interactive matrix language), SAS Enterprise Guide, SAS Enterprise Miner, SAS Forecast Studio, and JMP software. Before coming to SAS, Catherine completed her Ph.D. in Social Psychology with an emphasis in Statistics at The University of Texas at Austin. While at UT Austin, she completed an internship with the Math and Computer Science department's statistical consulting help desk and taught a number of undergraduate courses. While teaching and performing her own graduate research, she worked for a software usability design company conducting experiments to assess the ease-of-use of various software interfaces and website designs. Cat's personal interests include triathlon, hiking the woods near her home in North Carolina, and having tea parties with her two children.

Related Posts

5 Comments

  1. Thanks a lot Cat! This really saved me a lot of time. I was planning to create a manual code to multiply and add to get the linear combination for each cluster. Cheers! :)

  2. I see the point about clustering variables. But suppose we were clustering observations with PROC CLUSTER? Does the highest cluster score for a new observation imply the highest probability of cluster membership?

    • Benjamin DeKoven on

      Can you share some information on scoring using PROC CLUSTER, HPCLUSTER, or FASTCLUS? The training dataset would create the seed which can be used for cluster assignment of the new observations. How to tell if some of the new observations really do not assign well to seed clusters and a new or revised cluster is needed?

      Thank you,
      Ben DeKoven

  3. Since PROC VARCLUS is creating clusters of variables, not observations, each observation gets a cluster score for each cluster. For example, if my variables are height, weight, shoe_size, salary, and savings_balance and I have 2 clusters (height, weight and shoe_size in cluster 1, salary and savings_balance in cluster 2), then the cluster scores for each observation would be 1: your score on the combination of the size variables (whether you are generally larger or smaller than others in the sample), and 2: your score on the combination of wealth variables (whether you are generally richer or poorer than the others in the sample).
    I hope that helps!

  4. How do you compute the most probable cluster for each observation.
    Is it the component score with the highest value?

Leave A Reply

Back to Top