I recently showed how to use simulation to estimate the power of a statistical hypothesis test. The example (a two-sample t test for the difference of means) is a simple SAS/IML module that is very fast. Fast is good because often you want to perform a sequence of simulations over

# Author

A previous article about standardizing data in groups shows how to simulate data from two groups. One sample (with n1=20 observations) is simulated from an N(15, 5) distribution whereas a second (with n2=30 observations) is simulated from an N(16, 5) distribution. The sample means of the two groups are close

A common operation in statistical data analysis is to center and scale a numerical variable. This operation is conceptually easy: you subtract the mean of the variable and divide by the variable's standard deviation. Recently, I wanted to perform a slight variation of the usual standardization: Perform a different standardization

Have you ever seen the "brain teaser" for children that shows a 4 x 4 grid and asks "how many squares of any size are in this grid?" To solve this problem, the reader must recognize that there are sixteen 1 x 1 squares, nine 2 x 2 squares, four 3 x 3 squares, and one 4 x 4 square.

The iml action was introduced in Viya 3.5. As shown in a previous article, the iml action supports ways to implement the map-reduce paradigm, which is a way to distribute a computation by using multiple threads. The map-reduce paradigm is ideal for “embarrassingly parallel” computations, which are composed of many

When you write a program that simulates data from a statistical model, you should always check that the simulation code is correct. One way to do this is to generate a large simulated sample, estimate the parameters in the simulated data, and make sure that the estimates are close to

The Kronecker product (also called the direct product) is a binary operation that combines two matrices to form a new matrix. The Kronecker product appears in textbooks about the design of experiments and multivariate statistics. The Kronecker product seems intimidating at first, but often one of the matrices in the

Last month a SAS programmer asked how to fit a multivariate Gaussian mixture model in SAS. For univariate data, you can use the FMM Procedure, which fits a large variety of finite mixture models. If your company is using SAS Viya, you can use the MBC or GMM procedures, which

I recently showed how to compute within-group multivariate statistics by using the SAS/IML language. However, a principal of good software design is to encapsulate functionality and write self-contained functions that compute and return the results. What is the best way to return multiple statistics from a SAS/IML module? A convenient

The multivariate normal distribution is used frequently in multivariate statistics and machine learning. In many applications, you need to evaluate the log-likelihood function in order to compare how well different models fit the data. The log-likelihood for a vector x is the natural logarithm of the multivariate normal (MVN) density

A previous article introduces the MAPREDUCE function in the iml action. (The iml action was introduced in Viya 3.5.) The MAPREDUCE function implements the map-reduce paradigm, which is a two-step process for distributing a computation to multiple threads. The example in the previous article adds a set of numbers by

The iml action in SAS Viya (introduced in Viya 3.5) provides a set of general programming tools that you can use to implement a custom parallel algorithm. This makes the iml action different than other Viya actions, which use distributed computations to solve specific problems in statistics, machine learning, and

Testing people for coronavirus is a public health measure that reduces the spread of coronavirus. Dr. Anthony Fauci, a US infectious disease expert, recently mentioned the concept of "pool testing." The verb "to pool" means "to combine from different sources." In a USA Today article, Dr. Deborah Birx, the coordinator

A previous article discusses the pooled variance for two or groups of univariate data. The pooled variance is often used during a t test of two independent samples. For multivariate data, the analogous concept is the pooled covariance matrix, which is an average of the sample covariance matrices of the

The first time I saw a formula for the pooled variance, I was quite confused. It looked like Frankenstein's monster, assembled from bits and pieces of other quantities and brought to life by a madman. However, the pooled variance does not have to be a confusing monstrosity. The verb "to

If you have ever run a Kolmogorov-Smirnov test for normality, you have encountered the Kolmogorov D statistic. The Kolmogorov D statistic is used to assess whether a random sample was drawn from a specified distribution. Although it is frequently used to test for normality, the statistic is "distribution free" in

Sometimes in matrix computations, it is important to display the nonzero elements of a matrix. This can be useful for visualizing the structure of a sparse matrix (one that has many zeros) and it is also useful when describing a matrix algorithm (such as Gaussian elimination) that introduces zeros at

A previous article shows how to use the iml action to read a CAS data table into an IML matrix. This article shows how to write a CAS table from data in an IML matrix. You can read an overview of the iml action, which was introduced in SAS Viya

A previous article compares a SAS/IML program that runs in PROC IML to the same program that runs in the iml action. (You can read an overview of the iml action.) The example in the previous article was very simple and did not read or write data. This article compares

A previous article provides an introduction and overview of the iml action, which is available in SAS Viya 3.5. The article compares the iml action to PROC IML and states that most PROC IML programs can be modified to run in iml action. This article takes a closer look at

This article introduces the iml action, which is available in SAS Viya 3.5. The iml action supports most of the same syntax and functionality as the SAS/IML matrix language, which is implemented in PROC IML. With minimal changes, most programs that run in PROC IML also run in the iml

A SAS customer asked how to specify interaction effects between a classification variable and a spline effect in a SAS regression procedure. There are at least two ways to do this. If the SAS procedure supports the EFFECT statement, you can build the interaction term in the MODEL statement. For

I recently read an article that describes ways to compute confidence intervals for the difference in a percentile between two groups. In Eaton, Moore, and MacKenzie (2019), the authors describe a problem in hydrology. The data are the sizes of pebbles (grains) in rivers at two different sites. The authors

In a previous article, I discussed the definition of the Kullback-Leibler (K-L) divergence between two discrete probability distributions. For completeness, this article shows how to compute the Kullback-Leibler divergence between two continuous distributions. When f and g are discrete distributions, the K-L divergence is the sum of f(x)*log(f(x)/g(x)) over all

The Kullback–Leibler divergence is a measure of dissimilarity between two probability distributions. An application in machine learning is to measure how distributions in a parametric family differ from a data distribution. This article shows that if you minimize the Kullback–Leibler divergence over a set of parameters, you can find a

If you have been learning about machine learning or mathematical statistics, you might have heard about the Kullback–Leibler divergence. The Kullback–Leibler divergence is a measure of dissimilarity between two probability distributions. It measures how much one distribution differs from a reference distribution. This article explains the Kullback–Leibler divergence and shows

This article shows how to perform two-dimensional bilinear interpolation in SAS by using a SAS/IML function. It is assumed that you have observed the values of a response variable on a regular grid of locations. A previous article showed how to interpolate inside one rectangular cell. When you have a

I've previously written about linear interpolation in one dimension. Bilinear interpolation is a method for two-dimensional interpolation on a rectangle. If the value of a function is known at the four corners of a rectangle, an interpolation scheme gives you a way to estimate the function at any point in

This article shows how to find local maxima and maxima on a regression curve, which means finding points where the slope of the curve is zero. An example appears at the right, which shows locations where the loess smoother in a scatter plot has local minima and maxima. Except for

I recently showed how to use linear interpolation in SAS. Linear interpolation is a common way to interpolate between a set of planar points, but the interpolating function (the interpolant) is not smooth. If you want a smoother interpolant, you can use cubic spline interpolation. This article describes how to