The other day I encountered the following SAS DATA step for generating three normally distributed variables. Study it, and see if you can discover what is unnecessary (and misleading!) about this program:
data points; drop i; do i=1 to 10; x=rannor(34343); y=rannor(12345); z=rannor(54321); output; end; run; |
The program creates the POINTS data set. The data set contains three variables, each containing random numbers from the standard normal distribution. I'm guessing that the author of the program thinks that using rannor(12345) to define the y variable makes y independent from the x variable, which is defined by rannor(34343).
Sorry, but that is not correct.
The x, y, and z variables are, indeed, independent samples from a normal distribution, but that fact does not depend on using different seeds in the RANNOR function. In fact, in this DATA step, all random number seeds except the first one are completely ignored! Don't believe me? Run the following DATA step and compare the two data sets, as follows:
data points2; drop i; /* change all random number seeds except the first */ x=rannor(34343); y=rannor(1); z=rannor(2); output; do i=2 to 10; x=rannor(10+i); y=rannor(100+i); z=rannor(1000+i); output; end; run; proc compare base=points compare=points2; run; |
The COMPARE Procedure Comparison of WORK.POINTS with WORK.POINTS2 (Method=EXACT) NOTE: No unequal values were found. All values compared are exactly equal.
All values compared are exactly equal. Every observation, every variable, down to the last bit. But except for the first observation of the x variable, the second DATA step uses completely different random number seeds! How can the POINTS2 data set be identical to the POINTS data set?
As I explained in a previous post on random number seeds in SAS, the random number seed for a DATA step (or SAS/IML program) is set by the first call. SAS ignores subsequent seeds within the same DATA step or PROC step. In my previous post, I used the newer (and better) STREAMINIT function and the RAND function instead of the older RANNOR function, but the fact remains that first random number seed determines the random number stream for the entire DATA step. That is, only the first call to the STREAMINIT subroutine is important, as shown in the following example:
data points3; drop i; do i=1 to 10; call streaminit(123); /* this call is used to set the seed */ x = rand("Normal"); call streaminit(54321); /* this call is ignored */ y = rand("uniform"); output; end; run; |
The program looks like it is using different streams for the normal and uniform variates, but it is not. The first call to the STREAMINIT call is sets the seed; future calls in the same DATA step are ignored. Thus, it would be better to move call streaminit(123) to the top of the program, outside of the loop. Moving the STREAMINIT call to the top of the program will generate the same set of pseudorandom numbers.
For further details, see the SAS documentation, which shows an example similar to mine in which three data sets (imaginatively named A, B, and C) contain the same pseudorandom numbers.
Now that I've ranted against using different random number seeds, I will reveal that the DATA step at the beginning of my post is from an example in the SAS Knowledge Base! Yes, even experienced SAS programmers are sometimes confused by the subtleties of random number streams. There is nothing wrong with a program that uses multiple seeds, but such a program makes the reader think that all those seeds are actually doing something. They’re not.
Are you someone who uses different random number seeds for each variable in the same DATA step or PROC IML program? If so, you can safely stop. Multiple seeds do not make your random variables any more "random." Only the first seed matters.
6 Comments
I seem to recall that separate independent seeds and hence "independent" random series of numbers could be called by using the "call" version of the random number functions. I had some code at one time that demonstrated that as true ... but I don't have it easily available right now.
Yes, that is correct. The documentation is at http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p026ygl6toz3tgn14lt4iu6cl5bb.htm
The various seeds were sometimes necessary to overcome the fact that the older RAN* functions have a relatively short period. Fortunately, the newer RAND function has such a long period that you no longer need to worry about using multiple seeds.
My codes also demonstrate that the first random number seed determines the random number stream in PROC IML using RANDGEN function.
Hi
I have a question that how do the initial seeds calculated by sas in the fastclus procedure?
Those are not random number seeds. They are points used as the centers of clusters. See http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_fastclus_sect002.htm
Pingback: Six reasons you should stop using the RANUNI function to generate random numbers - The DO Loop