Last week there was an interesting question posted to the "Stat-Math Statistics" group on LinkedIn. The original question was a little confusing, so I'll state it in a more general form:
A population is normally distributed with a known mean and standard deviation. A sample of size N is drawn from the population. Each element of the sample is rounded to the nearest integer. The challenge: Construct a sample whose sample mean and sample standard deviation are as close as possible to the population values.
And here is my paraphrase of the actual problem that was posed:
The time required for Sally to walk home is normally distributed with mean 18 minutes and standard deviation 2 minutes. She walks home on 70 days and records her time to the nearest minute. What set of 70 values result in sample statistics that are as close as possible to the population parameters?
If you would like to try to solve the problem on your own, stop reading here. Spoilers ahead!
My approach to this problem is to use the normal density to estimate the number of observations for each value of the integer times 12, 13, ..., 24. Why those values? Because we are told that the mean is 18, and I assume that all the data will be within three standard deviations of the mean: [12, 24] = [18 - 3*2, 18 + 3*2].
In SAS, the PDF function computes the normal density. If you multiply the normal density by the number of observations, you obtain an estimate for the expected number of observation in each unit interval. Of course, this estimate will not be an integer, so use the ROUND function to obtain integers, as follows:
data WalkTimes; N = 70; Mean = 18; StdDev = 2; keep t freq; do t = 12 to 24; pdf = N*pdf( "normal", t, Mean, StdDev); /* approximate expected number */ freq = round(pdf); /* round to integer */ output; end; run;
It is not clear that this approach will always produce N observations, but it does for this symmetric distribution. What does the distribution look like and what are the sample moments? A quick call to PROC UNIVARIATE answers these questions:
proc univariate data=WalkTimes; freq Freq; var t; histogram t / normal midpoints=(12 to 24) vscale=count barlabel=count; ods select Moments Histogram; run;
The histogram plots the counts in each interval, displays the empirical frequencies, and overlays the normal curve for the population. There is close agreement between the data and the population. (The procedure also produces goodness-of-fit tests, which I do not show here. None of the tests reject the null hypothesis that the data are from an N(18, 2) population.)
The Moments table shows that there are 70 observations, that the mean is exactly 18, and that the standard deviation is very close to 2. Furthermore, the skewness is exactly zero, which means that the data distribution is symmetric. The small kurtosis value indicates that the tails of the distribution are close zero, which is what you would expect for a normal sample.
I obtained the same data distribution as the person who posed the problem, so presumably he used a similar approach. Other people tried to simulate the problem by generating random numbers from N(18,2) and rounding them to integers.
What approach would you use? Can you come up with a set of 70 integers that do a better job of approximating the N(18, 2) distribution?