Have you used multivariate procedures in SAS and wanted to save out scores? Some procedures, such as FACTOR, CANDISC, CANCORR, PRINCOMP, and others have an OUT= option to save scores to the input data set. However, to score a new data set, or to perform scoring with multivariate procedures that do not have an OUT= option you need another way.

Scoring new data with these procedures is a snap if you know how to use PROC SCORE. The OUTSTAT data sets from these procedures includes scoring code that the procedure uses to score observations in the DATA= data set. They are contained in the observations where TYPE='SCORE'.

The only catch, however, is that if you use a hierarchical method such as the VARCLUS procedure, then the score code is included for the 1-, 2-, 3-, and so on cluster solutions, up to the final number of clusters in the analysis. PROC SCORE doesn't know which one to use.

So you have one extra (teeny-tiny) step to perform scoring: Subset the code to only the __NCL__ solution you want. Then, use the same VAR statement from PROC VARCLUS in PROC SCORE.

Here is an example:

The analysis has produced three clusters.

Here is a subset of the scorecode data set, which contains SCORE observations. The first column, __NCL__, indicates the number of clusters. You only want to score the three-cluster solution.

Whether you prefer to subset and replace the data, as shown here, or subset and create a new data table is up to you.

This code scores the original data set with the three cluster scores. You can use a different DATA= data set to score new data, as long as the variables are all there for scoring. Note that I used a VAR statement in PROC SCORE here, so that the procedure knows which subset of the columns in the scorecode are used for scoring. Otherwise, the __NCL__ column would pose a problem.

Here are the new columns in the scored data set.

## 5 Comments

Thanks a lot Cat! This really saved me a lot of time. I was planning to create a manual code to multiply and add to get the linear combination for each cluster. Cheers! :)

I see the point about clustering variables. But suppose we were clustering observations with PROC CLUSTER? Does the highest cluster score for a new observation imply the highest probability of cluster membership?

Can you share some information on scoring using PROC CLUSTER, HPCLUSTER, or FASTCLUS? The training dataset would create the seed which can be used for cluster assignment of the new observations. How to tell if some of the new observations really do not assign well to seed clusters and a new or revised cluster is needed?

Thank you,

Ben DeKoven

Since PROC VARCLUS is creating clusters of variables, not observations, each observation gets a cluster score for each cluster. For example, if my variables are height, weight, shoe_size, salary, and savings_balance and I have 2 clusters (height, weight and shoe_size in cluster 1, salary and savings_balance in cluster 2), then the cluster scores for each observation would be 1: your score on the combination of the size variables (whether you are generally larger or smaller than others in the sample), and 2: your score on the combination of wealth variables (whether you are generally richer or poorer than the others in the sample).

I hope that helps!

How do you compute the most probable cluster for each observation.

Is it the component score with the highest value?