How can you specify weights for a statistical analysis? Hmmm, that's a "weighty" question! Many people on discussion forums ask "What is a weight variable?" and "How do you choose a weight for each observation?" This article gives a brief overview of weight variables in statistics and includes examples of how weights are used in SAS.

### Different kinds of weight variables

One source of confusion is that different areas of statistics use weights in different ways. All weights are not created equal! The weights in survey statistics have a different interpretation from the weights in a weighted least squares regression.

Let's start with a basic definition.
A *weight variable* provides a value (the *weight*) for each observation in a data set.
The *i*_th weight value, *w*_{i}, is the weight for the *i*_th observation.
For most applications, a valid weight is nonnegative. A zero weight usually means that you want to exclude the observation from the analysis. Observations that have relatively large weights have more influence in the analysis than observations that have smaller weights. An unweighted analysis is the same as a weighted analysis in which all weights are 1.

There are several kinds of weight variables in statistics. At the 2007 Joint Statistical Meetings in Denver, I discussed weighted statistical graphics for two kinds of statistical weights: survey weights and regression weights. An audience member informed me that STATA software provides four definitions of weight variables, as follows:

**Frequency weights:**A frequency variable specifies that each observation is repeated multiple times. Each frequency value is a nonnegative integer.**Survey weights:**Survey weights (also called*sampling weights*or*probability weights*) indicate that an observation in a survey represents a certain number of people in a finite population. Survey weights are often the reciprocals of the selection probabilities for the survey design.**Analytical weights:**An analytical weight (sometimes called an*inverse variance weight*or a*regression weight*) specifies that the*i*_th observation comes from a sub-population with variance σ^{2}/*w*_{i}, where σ^{2}is a common variance and*w*_{i}is the weight of the*i*_th observation. These weights are used in multivariate statistics and in a meta-analyses where each "observation" is actually the mean of a sample.**Importance weights:**According to a STATA developer, an "importance weight" is a STATA-specific term that is intended "for programmers, not data analysts." The developer says that the formulas "may have no statistical validity" but can be useful as a programming convenience. Although I have never used STATA, I imagine that a primary use is to downweight the influence of outliers. The REWEIGHT statement in PROC REG served a similar purpose in the years before robust regression methods were implemented in SAS.

### Frequencies are not weights

I have previously argued that **a frequency variable is not a weight variable**. I provided an example that shows the distinction between a frequency variable and a weight variable in regression. Briefly,
a frequency variable is a notational convenience that enables you to compactly represent the data. A frequency variable determines the sample size (and the degrees of freedom), but using a frequency variable is always equivalent to "expanding" the data set. (To expand the data, create *f*_{i} identical observations when the *i*_th value of the frequency variable is *f*_{i}.) An analysis of the expanded data is identical to the same analysis on the original data that uses a frequency variable.

In SAS, the FREQ statement enables you to specify a frequency variable in most procedures. Ironically, in PROC FREQ you use the WEIGHT statement to specify frequencies. Because weights can be non-integer,the WEIGHT statement enables you to analyze tables that contain expected counts, percentages, and other non-integer values.

### Have survey data? Use survey weights

If you have survey data, you should analyze it by using survey weights. The sum of the survey weights equals the population size. Using survey weights enables you to make correct inferences about the finite population that is represented by the survey.

In SAS, you can use the SAS SURVEY procedures to analyze survey data. The SURVEY procedures (including SURVEYMEANS, SURVEYFREQ, and SURVEYREG) also support stratified samples and strata weights.

### Inverse variance weights

Inverse variance weights are appropriate for regression and other multivariate analyses. When you include a weight variable in a multivariate analysis, the crossproduct matrix is computed as X`WX, where W is the diagonal matrix of weights and X is the data matrix (possibly centered or standardized). In these analyses, the weight of an observation is assumed to be inversely proportional to the variance of the subpopulation from which that observation was sampled. You can "manually" reproduce a lot of formulas for weighted multivariate statistics by multiplying each row of the data matrix (and the response vector) by the square root of the appropriate weight.

In particular, if you use a weight variable in a regression procedure, you get a weighted regression analysis. For regression, the right side of the normal equations is X`WY.

You can also use weights to analyze a set of means, such as you might encounter in meta-analysis or an analysis of means. The weight that you specify for the *i*_th mean should be inversely proportional to the variance of the *i*_th sample. Equivalently, the weight for the *i*_th group is (approximately) proportional to the sample size of the *i*_th group.

In SAS, most regression procedures support WEIGHT statements. For example, PROC REG performs a weighted least squares regression. The multivariate analysis procedures (DISRIM, FACTOR, PRINCOMP, ...) use weights to form a weighted covariance or correlation matrix. You can use PROC GLM to compute a meta-analyze of data that are the means from previous studies.

### What happens if you "make up" a weight variable?

Analysts can (and do!) create weights arbitrarily based on "gut feelings." You might say, "I don't trust the value of this observation, so I'm going to downweight it." Suppose you assign Observation 1 twice as much weight as Observation 2 because you feel that Observation 1 is twice as "trustworthy." How does a multivariate procedure interpret those weights?

In statistics, precision is the inverse of the variance. When you use those weights you are implicitly stating that you believe that Observation 2 is from a population whose variance is twice as large as the population variance for Observation 1. In other words, "less trust" means that you have less faith in the precision of the measurement for Observation 2 and more faith in the precision of Observation 1.

### Examples of weighted analyses in SAS

In SAS, many procedures support a WEIGHT statement. The documentation for the procedure describes how the procedure incorporates weights. In addition to the previously mentioned procedures, many Base SAS procedures compute weighted descriptive statistics. For some examples of weighted statistical analyses in SAS and how to interpret the results, see the following articles:

## 19 Comments

Hi Rick. Nice blog! It takes me back to discussions in graduate school. You ask, what happens when you make up weights? Analysts *ALWAYS* make up weights. One of my graduate school professors was fond of saying: "All analyses are weighted; some analyses use equal weights." This still sticks with me 30+ years later. If your barrier to doing a weighted analysis is: how do I choose the weights? You do not get out of the decision by using equal weights. Equal weights are convenient, and they are the default, but you are making an arbitrary choice when you choose to use equal weights. Embrace the WEIGHT statement!

Thanks for your thoughts. Another way to think about it is that equal weights are an assumption, but one that is reasonable in many circumstances. If you believe that all observations come from the same population, then it is reasonable to assume equal weights. The WEIGHT statement enables you to relax that assumption.

Hi

When I assign weights why do my sample gets reduced?

Sorry, but I do not understand your question. If you use positive weights, the sample size will not change. Observations for which weights are missing or nonpositive will be excluded from the analysis. If that is not clear, please post data and an example program to the SAS Support Communities.

Hi Rick,

Do you happen to know any specific techniques I could use to create weights for my data? I am looking to rank order my data based upon a few variables but I would like to assign weights to these variables before i do that.

This article is about weights for observations. You are asking about weights for variables. If you have a target variable, you can use regression techniques to assign weights (parameter estimates) for the explanatory variables. If you don't have a target variable, then principal components are another way to obtain weights for variables. The first principal component is the linear combination (=weighted combination) of the variables that explain the most variance in the data.

Hey Rick,

I'm working on a small investment side project that involves selective arbitrage (investing in x amount of X outcomes). Currently, I do not apply any weights to my investment distributions, and, therefore, have an equal ROI for each investment. To maximize returns, I'd like to use historical data on these investments to weight the capital distribution, however, I am unsure how to calculate the weights.

The data consists of historical probabilities and the year in which the investment was executed.

Do you recommend a direction in which I go in, another resource to read, or ideas on where to start?

Any help would be greatly appreciated.

Matt

Yes, do an internet search for

"efficient frontier" portfolio

The efficient frontier is the weights that you should use to maximize your return while minimizing the risk.

Thanks for the quick response, Rick!

Appreciate the help.

Rick,

What regression techniques would you recommend when assigning weights to a variable that has 2 conditions: present or absent?

Thank you for your help.

Your question is somewhat vague. Perhaps you mean that you have a binary response variable (present or absent) and you want to fit a model to predict the probability of the feature being present. Statisticians often use logistic models for that purpose, and you can use weights in a logistic model. I suggest that you post your question with sample data to the SAS Support Community for additional advice.

I am developing an Index of performance and i have already selected the parameters i am going to include in the index. Now the question is how to assign weights to those parameters? Please help

That is up to you. The S&P 500 is weighted by market capitalization. The Dow 30 index is a price-weighted index. Talk to your advisor/mentor/colleagues to determine the best way to weight the components in your index.

Hi,

I have time series for 3 variables. I have check their dependency and found that their dependency is not ignorable. They must be summed with their own weights to make a new variable. I have also a benchmark to evaluate this new variable. The problem is that I don't know how to define the proper weights to sum those 3 variables. I really appreciate any help.

I suggest you ask your question on CrossValidated.

Any suggestions on a methodology for weighting variables in a customer satisfaction survey? We would like to model our survey off the American Customer Satisfaction Index (ACSI) which uses the ACSI structural equation model: (((X1*W1)+(X2*W2)+(X3*W3))-1)/9*100

We operate a community service project providing IT skills coaching to local adults. Our CS survey asks three questions, Quality of Customer Service?, Knowledge of Customer Service Rep? and Confidence in Skills learned?

My understanding is that the weights should be published for each year. Do an internet search for

weights in acsi survey 2018

If I have a survey, and I have the 'survey weights', and now I use these to 'expand the data' to the population, what is your position on statistical inference in such situations? I am assuming that because we have expanded to the population, that any measures of association (correlations, ORs, etc.) or regression model estimates (betas) have no variation and thus inference is out of the question.

Thanks

You should ask questions like this on a statistical discussion forum. Remember that there is uncertainty in the estimates from the survey.