Do you ever use a permutation matrix to change the order of rows or columns in a matrix? Did you know that there is a more efficient way in matrix-oriented languages such as SAS/IML, MATLAB, and R? Remember the following tip: Never multiply with a large permutation matrix! Instead, use

## Tag: **Statistical Programming**

A SAS programmer recently asked why his SAS program and his colleague's R program display different estimates for the quantiles of a very small data set (less than 10 observations). I pointed the programmer to my article that compares the nine common definitions for sample quantiles. The article has a

To get better at something, you need to practice. That maxim applies to sports, music, and programming. If you want to be a better programmer, you need to write many programs. This article provides an example of forming the intersection of items in a SAS/IML list. It then provides several

After my recent articles on simulating data by using copulas, many readers commented about the power of copulas. Yes, they are powerful, and the geometry of copulas is beautiful. However, it is important to be aware of the limitations of copulas. This article creates a bizarre example of bivariate data,

Do you know what a copula is? It is a popular way to simulate multivariate correlated data. The literature for copulas is mathematically formidable, but this article provides an intuitive introduction to copulas by describing the geometry of the transformations that are involved in the simulation process. Although there are

A recent article about how to estimate a two-dimensional distribution function in SAS inspired me to think about a related computation: a 2-D cumulative sum. Suppose you have numbers in a matrix, X. A 2-D cumulative sum is a second matrix, C, such that the C[p,q] gives the sum of

This article shows how to estimate and visualize a two-dimensional cumulative distribution function (CDF) in SAS. SAS has built-in support for this computation. Although the bivariate CDF is not used as much as the univariate CDF, the bivariate version is still a useful tool in understanding the probable values of

This article uses simulation to demonstrate the fact that any continuous distribution can be transformed into the uniform distribution on (0,1). The function that performs this transformation is a familiar one: it is the cumulative distribution function (CDF). A continuous CDF is defined as an integral, so the transformation is

A previous article showed how to simulate multivariate correlated data by using the Iman-Conover transformation (Iman and Conover, 1982). The transformation preserves the marginal distributions of the original data but permutes the values (columnwise) to induce a new correlation among the variables. When I first read about the Iman-Conover transformation,

Simulating univariate data is relatively easy. Simulating multivariate data is much harder. The main difficulty is to generate variables that have given univariate distributions but also are correlated with each other according to a specified correlation matrix. However, Iman and Conover (1982, "A distribution-free approach to inducing rank correlation among

Many nonparametric statistical methods use the ranks of observations to compute distribution-free statistics. In SAS, two procedures that use ranks are PROC NPAR1WAY and PROC CORR. Whereas the SPEARMAN option in PROC CORR (which computes rank correlation) uses only the "raw" tied ranks, PROC NPAR1WAY uses transformations of the ranks,

For many univariate statistics (mean, median, standard deviation, etc.), the order of the data is unimportant. If you sort univariate data, the mean and standard deviation do not change. However, you cannot sort an individual variable (independently) if you want to preserve its relationship with other variables. This statement is

When data contain outliers, medians estimate the center of the data better than means do. In general, robust estimates of location and sale are preferred over classical moment-based estimates when the data contain outliers or are from a heavy-tailed distribution. Thus, instead of using the mean and standard deviation of

I previously wrote about how to understand standardized regression coefficients in PROC REG in SAS. You can obtain the standardized estimates by using the STB option on the MODEL statement in PROC REG. Several readers have written to ask whether I could write a similar article about the STDCOEF option

You can standardize a numerical variable by subtracting a location parameter from each observation and then dividing by a scale parameter. Often, the parameters depend on the data that you are standardizing. For example, the most common way to standardize a variable is to subtract the sample mean and divide

Odani's truism is a mathematical result that says that if you want to compare the fractions a/b and c/d, it often is sufficient to compare the sums (a+d) and (b+c) rather than the products a*d and b*c. (All of the integers a, b, c, and d are positive.) If you

There are many statistics that measure whether two continuous random variables are independent or whether they are related to each other in some way. The most well-known statistic is Pearson's correlation, which is a parametric measure of the linear relationship between two variables. A related measure is Spearman's rank correlation,

A SAS customer wanted to compute the cumulative distribution function (CDF) of the generalized gamma distribution. For any continuous distribution, the CDF is the integral of the probability density function (PDF), which usually has an explicit formula. Accordingly, he wanted to compute the CDF by using the QUAD function in

I've previously written about how to generate all pairwise interactions for a regression model in SAS. For a model that contains continuous effects, the easiest way is to use the EFFECT statement in PROC GLMSELECT to generate second-degree "polynomial effects." However, a SAS programmer was running a simulation study and

In a previous article, I showed how to generate random points uniformly inside a d-dimensional sphere. In that article, I stated the following fact: If Y is drawn from the uncorrelated multivariate normal distribution, then S = Y / ||Y|| has the uniform distribution on the unit sphere. I was

Imagine an animal that is searching for food in a vast environment where food is scarce. If no prey is nearby, the animal's senses (such as smell and sight) are useless. In that case, a reasonable search strategy is a random walk. The animal can choose a random direction, walk/swim/fly

The inverse gamma distribution is a continuous probability distribution that is used in Bayesian analysis and in some statistical models. The inverse gamma distribution is closely related to the gamma distribution. For any probability distribution, it is essential to know how to compute four functions: the PDF function, which returns

Years ago, I wrote about how to compute the incomplete beta function in SAS. Recently, a SAS programmer asked about a similar function, called the incomplete gamma function. The incomplete gamma function is a "special function" that arises in applied math, physics, and statistics. You should not confuse the gamma

This is the third and last introductory article about how to bootstrap time series in SAS. In the first article, I presented the simple block bootstrap and discussed why bootstrapping a time series is more complicated than for regression models that assume independent errors. Briefly, when you perform residual resampling

As I discussed in a previous article, the simple block bootstrap is a way to perform a bootstrap analysis on a time series. The first step is to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L) that divides the total

On The DO Loop blog, I write about a diverse set of topics, including statistical data analysis, machine learning, statistical programming, data visualization, simulation, numerical analysis, and matrix computations. In a previous article, I presented some of my most popular blog posts from 2020. The most popular articles often deal

When you perform a linear regression, you can examine the R-square value, which is a goodness-of-fit statistic that indicates how well the response variable can be represented as a linear combination of the explanatory variables. But did you know that you can also go the other direction? Given a set

Do you know that you can create a vector that has a specific correlation with another vector? That is, given a vector, x, and a correlation coefficient, ρ, you can find a vector, y, such that corr(x, y) = ρ. The vectors x and y can have an arbitrary number

To help visualize regression models, SAS provides the EFFECTPLOT statement in several regression procedures and in PROC PLM, which is a general-purpose procedure for post-fitting analysis of linear models. When scoring and visualizing a model, it is important to use reasonable combinations of the explanatory variables for the visualization. When

Intuitively, the skewness of a unimodal distribution indicates whether a distribution is symmetric or not. If the right tail has more mass than the left tail, the distribution is "right skewed." If the left tail has more mass, the distribution is "left skewed." Thus, estimating skewness requires some estimates about