This week I read an interesting blog post that led to a discussion about specifying the frequencies of observations in a regression model. In SAS software, many of the analysis procedures contain a FREQ statement for specifying frequencies and a WEIGHT statement for specifying weights in a weighted regression. Theis article takes a quick look at the FREQ and WEIGHT statements in regression models, and when you should use one instead of the other.The difference between frequencies and weights in regression analysis Click To Tweet
Frequencies are not weights, although they are similar enough that confusion is inevitable. SAS customers ask about the difference so often that SAS Technical Support has a usage note on the distinction between the WEIGHT and FREQ statements in SAS. It is important for statistical software to support both methods.
In short, a "frequency variable" is one that specifies an integer count that is associated with each observation. The SAS documentation is filled with examples of specifying a frequency variable. For example, see the examples for the LOGISTIC procedure.
Here are some key points to remember:
- A frequency variable tells the procedure that there are more observations than there are rows in the data set.
- When you run a frequency analysis, your analysis should agree with the same analysis run on the "expanded data," which is the data set in which each row represents a single observation.
- A frequency variable changes the degrees of freedom in the model. This affects many statistics: means, standard errors, p-values, and so forth.
As an example, I will use the data from the Freakonometrics blog:
data FreqData; input x y Count @@; datalines; 18 14 74 18 19 13 18 21 7 18 23 1 23 14 6 23 19 4 23 21 2 23 23 7 23 25 4 25 14 2 25 19 3 25 21 2 25 23 2 25 25 4 27 14 1 27 19 1 27 21 2 27 23 3 27 25 3 27 27 2 29 14 2 29 21 1 29 23 3 29 25 2 29 29 6 31 14 2 31 21 2 31 23 1 31 29 1 31 31 2 31 33 1 33 23 2 33 29 1 33 31 1 33 33 3 35 19 1 35 23 1 35 33 2 37 25 1 37 27 1 37 31 1 37 35 1 ;
The data contains an explanatory variable (X), a response variable (Y), and a frequency variable (Count). Assume that the goal is to perform a linear regression for these data.
First, let's expand the data by using the frequency variable. The FreqData data set has 42 observations, but the following SAS DATA step creates a data set named Expand that has 181 observations, which is the sum of the Count variable:
/* Expand original data by frequency variable */ data Expand; keep x y; set FreqData; if Count<1 then delete; do i = 1 to int(Count); output; end; run; proc sgplot data=Expand; scatter x=x y=y / markerattrs=(symbol=CircleFilled size=12) transparency=0.7; /* in SAS 9.4, use the JITTER option! */ run;
In the scatter plot, many markers are overplotted. For example, the marker at (18, 14) is plotted 74 times. Transparency is used to show that some values are repeated multiple times in the data. In addition, you might want to use jittering to help visualize overplotting in scatter plots. In SAS 9.4, there is a new JITTER option on the SCATTER statement, which makes jittering much easier than in SAS 9.3.
Now that the data are expanded, let's compute a simple linear regression of Y on X and look at some of the output. This is the "gold standard" to which we will compare other attempts.
ods graphics off; ods select NObs(persist) ANOVA(persist) ParameterEstimates(persist); proc reg data=Expand; model y=x; run;
You can click on the image to enlarge it. Three parts of the results are circled in red:
- The number of observations used in the analysis: n = 181
- The degrees of freedom: We are estimating two parameters, so there are n − 2 = 179 degrees of freedom for the error term
- The parameter estimates are b0 = 1.269 and b1 = 0.787, and the standard errors (which depend on the degrees of freedom!) are as shown.
As I stated previously, we should get these same results when we use the FREQ statement on the original data, as follows:
proc reg data=FreqData; freq Count; model y=x; run;
Success! The statistics for this computation are identical to the statistics for the expanded data. The only difference is that the data set has 42 observations, but the FREQ statement results in a "Sum of Frequencies" equal to 181. That number is used as the sample size (n) in the degrees of freedom computations.
Weights are not frequencies. The WEIGHT statement does not change the "sample size" or the "degrees of freedom."
You can use a WEIGHT statement when you some observations contribute to the model fit more than others. A canonical example of this is that if you identify outliers in your data, you can assign them weights that are zero or nearly zero in order to obtain a fit that is robust to the outliers. As this example indicates, weights do not have to be integers. You can also use weights when some response values are known more precisely than others. In that case, it is common to weight each observation proportional to the inverse of the variance of that observation. These are known as inverse-variance weights or accuracy weights.
In the regression context, if you use integer counts as weights, the parameter estimates are the same as when you use the counts for frequencies, but the statistics that use the sample size are different. This includes standard errors for the estimates and p-values for significance. This is shown in the following example:
proc reg data=FreqData; weight Count; model y=x; run; ods select all;
As you can see, the analysis thinks that there are only 42 observations. Although the parameter estimates are correct, almost all of the other statistics are wrong (assuming that you are trying to reproduce the analysis for the expanded data).
So the conclusion is that you should use the FREQ statement to specify integer frequencies for repeated observations. Use the WEIGHT statement when you want to decrease the influence that certain observations have on the parameter estimates.