The other day I encountered the following SAS DATA step for generating three normally distributed variables. Study it, and see if you can discover what is unnecessary (and misleading!) about this program:
data points; drop i; do i=1 to 10; x=rannor(34343); y=rannor(12345); z=rannor(54321); output; end; run;
The program creates the POINTS data set. The data set contains three variables, each containing random numbers from the standard normal distribution. I'm guessing that the author of the program thinks that using rannor(12345) to define the y variable makes y independent from the x variable, which is defined by rannor(34343).
Sorry, but that is not correct.
The x, y, and z variables are, indeed, independent samples from a normal distribution, but that fact does not depend on using different seeds in the RANNOR function. In fact, in this DATA step, all random number seeds except the first one are completely ignored! Don't believe me? Run the following DATA step and compare the two data sets, as follows:
data points2; drop i; /* change all random number seeds except the first */ x=rannor(34343); y=rannor(1); z=rannor(2); output; do i=2 to 10; x=rannor(10+i); y=rannor(100+i); z=rannor(1000+i); output; end; run; proc compare base=points compare=points2; run;
The COMPARE Procedure Comparison of WORK.POINTS with WORK.POINTS2 (Method=EXACT) NOTE: No unequal values were found. All values compared are exactly equal.
All values compared are exactly equal. Every observation, every variable, down to the last bit. But except for the first observation of the x variable, the second DATA step uses completely different random number seeds! How can the POINTS2 data set be identical to the POINTS data set?
As I explained in a previous post on random number seeds in SAS, the random number seed for a DATA step (or SAS/IML program) is set by the first call. SAS ignores subsequent seeds within the same DATA step or PROC step. In my previous post, I used the newer (and better) STREAMINIT function and the RAND function instead of the older RANNOR function, but the fact remains that first random number seed determines the random number stream for the entire DATA step. That is, only the first call to the STREAMINIT subroutine is important, as shown in the following example:
data points3; drop i; do i=1 to 10; call streaminit(123); /* this call is used to set the seed */ x = rand("Normal"); call streaminit(54321); /* this call is ignored */ y = rand("uniform"); output; end; run;
The program looks like it is using different streams for the normal and uniform variates, but it is not. The first call to the STREAMINIT call is sets the seed; future calls in the same DATA step are ignored. Thus, it would be better to move call streaminit(123) to the top of the program, outside of the loop. Moving the STREAMINIT call to the top of the program will generate the same set of pseudorandom numbers.
For further details, see the SAS documentation, which shows an example similar to mine in which three data sets (imaginatively named A, B, and C) contain the same pseudorandom numbers.
Now that I've ranted against using different random number seeds, I will reveal that the DATA step at the beginning of my post is from an example in the SAS Knowledge Base! Yes, even experienced SAS programmers are sometimes confused by the subtleties of random number streams. There is nothing wrong with a program that uses multiple seeds, but such a program makes the reader think that all those seeds are actually doing something. They’re not.Are you someone who uses different random number seeds for each variable in the same DATA step or PROC IML program? If so, you can safely stop. Multiple seeds do not make your random variables any more "random." Only the first seed matters.