This article shows how to simulate data from a mixture of multivariate normal distributions, which is also called a Gaussian mixture. You can use this simulation to generate clustered data. The adjacent graph shows three clusters, each simulated from a four-dimensional normal distribution. Each cluster has its own within-cluster covariance,

## Tag: **Simulation**

A classical problem in elementary probability asks for the expected lengths of line segments that result from randomly selecting k points along a segment of unit length. It is both fun and instructive to simulate such problems. This article uses simulation in the SAS/IML language to estimate solutions to the

If you toss a coin 28 times, you would not be surprised to see three heads in a row, such as ...THHHTH.... But what about eight heads in a row? Would a sequence such as THHHHHHHHTH... be a rare event? This question popped into my head last weekend as I

Last week I was asked a simple question: "How do I choose a seed for the random number functions in SAS?" The answer might surprise you: use any seed you like. Each seed of a well-designed random number generator is likely to give rise to a stream of random numbers,

A SAS customer asked how to simulate data from a three-parameter lognormal distribution as specified in the PROC UNIVARIATE documentation. In particular, he wanted to incorporate a threshold parameter into the simulation. Simulating lognormal data is easy if you remember an important fact: if X is lognormally distributed, then Y=log(X)

The 2017 edition of SAS Global Forum, the largest annual SAS user group meeting, will be held at the Swan and Dolphin Resort in Orlando, Florida on April 2-5. Among the many analytic talks at SAS Global Forum 2017, several focus on operations research topics like optimization and simulation. If

Monte Carlo techniques have many applications, but a primary application is to approximate the probability that some event occurs. The idea is to simulate data from the population and count the proportion of times that the event occurs in the simulated data. For continuous univariate distributions, the probability of an

In a previous article, I showed how to simulate data for a linear regression model with an arbitrary number of continuous explanatory variables. To keep the discussion simple, I simulated a single sample with N observations and p variables. However, to use Monte Carlo methods to approximate the sampling distribution

This article shows how to simulate a data set in SAS that satisfies a least squares regression model for continuous variables. When you simulate to create "synthetic" (or "fake") data, you (the programmer) control the true parameter values, the form of the model, the sample size, and magnitude of the

How can you generate data that contains outliers in a simulation study? The contaminated normal distribution is a simple but useful distribution you can use to simulate outliers. The distribution is easy to explain and understand, and it is also easy to implement in SAS. What is a contaminated normal

In the classic textbook by Johnson and Wichern (Applied Multivariate Statistical Analysis, Third Edition, 1992, p. 164), it says: All measures of goodness-of-fit suffer the same serious drawback. When the sample size is small, only the most aberrant behaviors will be identified as lack of fit. On the other hand,

Somewhere in my past I encountered a panel of histograms for small random samples of normal data. I can't remember the source, but it might have been from John Tukey or William Cleveland. The point of the panel was to emphasize that (because of sampling variation) a small random sample

The 2016 INFORMS Annual Meeting will be held at the Music City Center and Omni Nashville Hotel in downtown Nashville, TN on November 13-16, with pre-conference events starting on Saturday, November 12. SAS will be a major participant in this conference. Over two dozen people from SAS will attend, with

When simulating data or testing algorithms, it is useful to be able to generate patterns of missing data. This article shows how to generate random and systematic patterns of missing values. In other words, this article shows how to replace nonmissing data with missing data. Generate a random pattern of

Although statisticians often assume normally distributed errors, there are important processes for which the error distribution has a heavy tail. A well-known heavy-tailed distribution is the t distribution, but the t distribution is unsuitable for some applications because it does not have finite moments (means, variance,...) for small parameter values.

The article uses the SAS DATA step and Base SAS procedures to estimate the coverage probability of the confidence interval for the mean of normally distributed data. This discussion is based on Section 5.2 (p. 74–77) of Simulating Data with SAS. What is a confidence interval? Recall that a confidence

Analytics Experience 2016 will be held on Sept. 12-14, 2016 at the Bellagio in Las Vegas, NV. There will be a great number of excellent talks and demonstrations at the conference, covering many aspects of SAS analytics and many practical applications. Several of these sessions deal directly with the use

This year's SAS Global Forum conference will take place April 18-21 at The Venetian in Las Vegas. For SAS/OR, SAS staff will present two Super Demos and three papers:

I saw an interesting mathematical result in Wired magazine. The original article was about mathematical research into prime numbers, but the article included the following tantalizing fact: If Alice tosses a [fair]coin until she sees a head followed by a tail, and Bob tosses a coin until he sees two

SAS will have a major presence at the 2016 INFORMS Conference on Business Analytics and Operations Research, which will be held at the Hyatt Regency Grand Cypress hotel in Orlando, FL on April 10-12. Many SAS staff will participate in this conference. SAS/OR, the SAS Global Academic Program, and JMP

Last week I showed how to generate random points uniformly inside a 2-d circular region. That article showed that the distance of a point to the circle's center cannot be distributed uniformly. Instead, you should use the square root of a uniform variate to generate 2-D distances to the origin.

It is easy to generate random points that are uniformly distributed inside a rectangle. You simply generate independent random uniform values for each coordinate. However, nonrectangular regions are more complicated. An instructive example is to simulate points uniformly inside the ball with a given radius. The two-dimensional case is to

There are several ways to simulate multinomial data in SAS. In the SAS/IML matrix language, you can use the RANDMULTINOMIAL function to generate samples from the multinomial distribution. If you don't have a SAS/IML license, I have previously written about how to use the SAS DATA step or PROC SURVEYSELECT

Today is March 14th, which is annually celebrated as Pi Day. Today's date, written as 3/14/16, represents the best five-digit approximation of pi. On Pi Day, many people blog about how to approximate pi. This article uses a Monte Carlo simulation to estimate pi, in spite of the fact that

Many simulation and resampling tasks use one of four sampling methods. When you draw a random sample from a population, you can sample with or without replacement. At the same time, all individuals in the population might have equal probability of being selected, or some individuals might be more likely

How do you sample with replacement in SAS when the probability of choosing each observation varies? I was asked this question recently. The programmer thought he could use PROC SURVEYSELECT to generate the samples, but he wasn't sure which sampling technique he should use to sample with unequal probability. This

Die gescorten Rentiere scharren heute schon ganz nervös mit den Hufen. Bald geht es los! Sie freuen sich schon so auf die Reise. Überall auf der Erde ist es so schön geschmückt, alles leuchtet und blinkt! Und vielleicht liegt sogar ein bisschen Schnee. Heute wird mal wieder simuliert. Unsere bisherigen

The FREQ procedure in SAS supports computing exact p-values for many statistical tests. For small and mid-sized problems, the procedure runs very quickly. However, even though PROC FREQ uses efficient methods to avoid unnecessary computations, the computational time required by exact tests might be prohibitively expensive for certain tables. If

How do you simulate a contingency table that has a specified row and column sum? Last week I showed how to simulate a random 2 x 2 contingency table when the marginal frequencies are specified. This article generalizes to random r x c frequency tables (also called cross-tabulations) that have the same marginal row

When modeling and simulating data, it is important to be able to articulate the real-life statistical process that generates the data. Suppose a friend says to you, "I want to simulate two random correlated variables, X and Y." Usually this means that he wants data generated from a multivariate distribution,