Last week I was asked a simple question: "How do I choose a seed for the random number functions in SAS?" The answer might surprise you: use any seed you like. Each seed of a well-designed random number generator is likely to give rise to a stream of random numbers, so you can view the various streams as statistically equivalent.

### Random means random

To be clear, I am talking about using a seed value to initialize a modern, high-quality, pseudorandom number generator (RNG). For example, in SAS you can use the STREAMINIT subroutine to initialize the Mersenne twister algorithm that is used by the RAND function. If you are still using the old-style RANUNI or RANNOR functions in SAS, please read the article "Six reasons you should stop using the RANUNI function to generate random numbers."

A seed value specifies a particular stream from a set of possible random number streams. When you specify a seed, SAS generates the same set of pseudorandom numbers every time you run the program. However, there is no intrinsic reason to prefer one stream over another. The stream for seed=12345 is just as random as the stream for the nine-digit prime number 937162211.

Some people see the number 937162211 and think that it looks "more random" than 12345. They then assume that the random number stream that follows from CALL STREAMINIT(937162211) is "more random" than the random number stream for CALL STREAMINIT(12345). Nope, random means random.
In modern pseudorandom generators, the streams for different seeds should have similar statistical properties. Furthermore, many RNGs use the base-2 representation of the seed for initialization and (12345)_{10} = (11000000111001)_{2} looks pretty random! In fact, if you avoid powers of 2, the base-2 representations of most base-10 numbers "look random."

### Initialization: Hard for researchers, easy for users

Researchers who specialize in random number generators might criticize what I've said as overly simplistic. There have been many research papers written about how to take a 32-bit integer and use that information to initialize a RNG whose internal state contains more than 32 bits. There have been cases where a RNG was published and the authors later modified the initialization routine because certain seeds did not result in streams that were sufficiently random. There have been many discussions about how to create a seed initialization algorithm that is easy to call and that almost always results in a high-quality stream of random numbers.

These are hard problems, but fortunately researchers have developed ways to initialize a stream from a seed so that there is a high probability that the stream will have excellent statistical properties. The relevant question for many SAS programmers is "can I use 12345 or my telephone number as seed values, or do I always need to type a crazy-looking nine-digit sequence?" My response is that there is no reason to prefer the crazy-looking seed over an easy-to-type sequence such as your phone number, your birthday, or the first few digits of pi.

### Choosing a random seed

If you absolutely insist on using a "random seed," SAS can help. If you call the STREAMINIT subroutine with the value 0, then SAS will use the date, time of day, and possibly other information to manufacture a seed when you call the RAND function. SAS puts the seed value into the SYSRANDOM system macro variable. That means you can use %PUT to display the seed that SAS created, as follows:

data _null_; call streaminit(0); /* generate seed from system clock */ x = rand("uniform"); run; %put &=SYSRANDOM; |

SYSRANDOM=1971603567 |

Every time you run this program, you will get a different seed value that you can use as the seed for a next program.

A second method is to use the RAND function to generate a random integer between 1 and 2^{31}-1, which is the range of valid seed values for the Mersenne twister generator in SAS 9.4m4.
The following program generates a random seed value:

data _null_; call streaminit(0); seed = ceil( (2**31 - 1)*rand("uniform") ); put seed=; run; |

seed=1734176512 |

Both of these methods will generate a seed for you. However, the randomly generated seed does not provide any benefit. For a modern, high-quality, pseudorandom number generator, the stream should have good statistical properties regardless of the seed value. Using a random seed value does not make a stream "more random" than a seed that is easier to type.

## 9 Comments

When you say "there is no reason to prefer the crazy-looking seed over an easy-to-type sequence such as your phone number, your birthday, or the first few digits of pi," does that mean you would be comfortable with someone using the same seed (their birthday) in all their programs? Say I pull a random sample for Project 1, and Project 9 also requires a random sample. Of course it shouldn't matter if Project 1 and Project 9 use the same sequence of random numbers to select their sample. But still, if I used my birthday for every seed, I would worry that if my birthday resulted in an unusual looking random stream, that stream would be used by every project. So feels "safer" to use a different seed in different programs, so each gets an independent stream. But of course "feelings" about random numbers are often wrong. An "unusual looking random string" is still a random string. And I appreciate your main point that the randomness of the stream does not depend on the seed.

Suppose you pick a different seed for Project 9. Do you now feel more confident that Project 9 is not affected by an "unusual" stream? I don't. Using one random sample is the same as basing a conclusion on a single sample of data, and we know that sampling variability means that we will reject a true null hypothesis (commit a Type 1 error) by chance 5% of the time.

Fortunately, with simulation we don't need to rely on one random sample. I always recommend that programmers change the seed a few times and make sure the results are similar and consistent. I've written about how basing a conclusion on a single "unlucky" seed can lead to wrong conclusions. See the posts "How to lie with a simulation" and "Monte Carlo estimates of pi and an important statistical lesson."

Taking that recommendation it's logical conclusion leads to Monte Carlo simulation in which you generate many (maybe 10,000 or 100,000) samples.

In many cases, you can quantify the accuracy of a Monte Carlo study that simulates N independent samples. Statistical theory (and some assumptions) suggest that the accuracy of the Monte Carlo estimate is asymptotically proportional to StdErr/sqrt(N), where StdErr is the standard error of the statistic. Thus if you generate 10,000 samples, you are getting an estimate that is within 1% of the standard error. The impact of any particular seed becomes less important as you generate more samples. In the Monte Carlo scenario, the initial seed really has almost no impact on the conclusion.

Another reason you may want to specify a seed and stick with it is repeatability. For example, if you are publishing work involving random simulations, this will enable readers and reviewers to replicate what you've done. Repeated random trials can validate your work; using a constant seed can give transparency to the numbers and visualizations you publish.

Great post, Rick! You concentrate on the RAND family of functions. Older functions such as the twins RANUNI and UNIFORM and the other twins RANNOR and NORMAL rely on a less sophisticated methodology. They enable you to specify multiple seeds and multiple streams, which can get you into trouble if you pick the wrong seeds. The only reason I mention this is because if you scroll not quite half way down in http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a001281561.htm to Example 1, you can see some cool and crazy graphs that show what can happen when you pick the wrong seeds. Perhaps in some future blog you can explain how the wrong seeds can give you a pattern that looks like an exploding bug! I have no idea.

In my online marketing research book, all of my seeds used to have a theme. I invited readers to figure out what they had in common. No one ever did.

-- Warren

I do not intend to devote any additional posts to the linear congruential generator (the RNG for RANUNI). The crazy graphs indicate correlations between streams. In more sophisticated RNGs, the streams for two different seeds are either independent (the ideal goal) or have a high probability of being nearly independent (the Mersenne twister).

When developing I always use the same seed, but when the program goes into production, I change the seed to 0.

My goto-seed is a very dull number, I always use 1729.

As lots of work today is 'cross-cultural' in the sense of using SAS, PYTHON, R and Matlab, just to name a few, it would be great if SAS could output the 624 numbers needed to start the classical Mersenn Twister. I think R, PYTHON and Matlab do so already. At best in a format that allows easy plugin into the others.

Hi Rick,

I have a SAS proc MI as below:

proc mi data=wide_data2 out=data3 nimpute=5 seed=12345

MIN=. . . . . . . . . 0 0 0 0 0 0

MAX=. . . . . . . . . 7 7 7 7 7 7

MINMAXITER=1000;

class &STRATACAT;

FCS;

by TRT01PN;

var &STRATA WEEK_2 WEEK_4 WEEK_8 WEEK_12 WEEK_14;

RUN;

We want to specify random seed to reproduce the results for future. But we have ‘by’ statement in the program, which means the MI is separately processed by group of ‘TRT01PN’. We found the first ‘TRT01PN’ group used random seed: 12345 as specified, but the second ‘TRT01PN’ group used ‘random’ random seed. Is there a way to specify multiple random seeds for ‘by’ group so the MI results are completely reproducible? Many thanks,

Sherry

There is only one seed and it is the one you specify. The seed specifies a stream, and that stream is used to process all BY groups consecutively. This is a standard technique that guarantees that each random sample for each BY group is independent. To learn more about independence in random samples, see "Independence and overlap in streams of random numbers."