The collinearity problem is to determine whether three points in the plane lie along a straight line. You can solve this problem by using middle-school algebra. An algebraic solution requires three steps. First, name the points: p, q, and r. Second, find the parametric equation for the line that passes

# Author

Plot rates, not counts. This maxim is often stated by data visualization experts, but often ignored by practitioners. You might also hear the related phrases "plot proportions" or "plot percentages," which mean the same thing but expresses the idea alliteratively. An example in a previous article about avoiding alphabetical ordering

Converting a program from one language to another can be a challenge. Even if the languages share many features, there is often syntax that is valid in one language that is not valid in another. Recently, a SAS programmer was converting a program from R to SAS IML. He reached

Howard Wainer, who used to write the "Visual Revelations" column in Chance magazine, often reminded his readers that "we are almost never interested in seeing Alabama first" (2005, Graphic Discovery, p. 72). His comment is a reminder that when we plot data for a large number of categories (states, countries,

Sometimes it is helpful to display a table of statistics directly on a graph. A simple example is displaying the number of observations and the mean or median on a histogram. In SAS, the term inset is used to describe a table that is displayed on a graph. This article

In several previous articles, I've shown how to use SAS to fit models to data by using maximum likelihood estimation (MLE). However, I have not previously shown how to obtain standard errors for the estimates. This article combines two previous articles to show how to obtain MLE estimates and the

A previous article shows how to use Monte Carlo simulation to approximate the sampling distribution of the sample mean and sample median. When x ~ N(0,1) are normal data, the sample mean is also normal, and there are simple formulas for the expected value and the standard error of the

An elementary course in statistics often includes a discussion of the sampling distribution of a statistic. The canonical example is the sampling distribution of the sample mean. For samples of size n that are drawn from a normally distribution (X ~ N(μ, σ)), the sample mean is normally distributed as

A previous article discusses the birthday problem and its generalizations. The classic birthday problem asks, "In a room that contains N people, what is the probability that two or more people share a birthday?" The probability is much higher than you might think. For example, in a room that contains

The birthday-matching problem (also called the birthday paradox or simply the birthday problem), is a classic problem in probability. Simply stated, the birthday-matching problem asks, "If there are N people in a room, what is the chance that two of them have the same birthday?" The problem is sometimes called

Recently I wrote about numerical analysis problem: the accurate computation of log(1+x) when x is close to 0. A naive computation of log(1+x) loses accuracy if you call the LOG function, which is why the SAS language provides the built-in LOG1PX for this computation. In addition, I showed that you

SAS supports a special function for the accurate evaluation of log(1+x) when x is near 0. The LOG1PX function is useful because a naive computation of log(1+x) loses accuracy when x is near 0. This article demonstrates two general approximation techniques that are often used in numerical analysis: the Taylor

The documentation for Python's SciPy package provides a table that concisely summarizes functions that are associated with continuous probability distributions. This article provides a similar table for SAS functions. For more information on the CDF, PDF, quantile, and random-variate functions, see "Four essential functions for statistical programmers." SAS functions for

A previous article shows ways to perform efficient BY-group processing in the SAS IML language. BY-group processing is a SAS-ism for what other languages call group processing or subgroup processing. The main idea is that the data set contains several discrete variables such as sex, race, education level, and so

One thing I have learned about rank-based statistics over the years is "Be careful of tied values!" On multiple occasions, I have been asked, "Why doesn't the SAS result for [NAME] statistic agree with my hand calculation?" The answer is sometimes because of the way that tied values are handled.

Many useful matrices in applied math and statistics have a banded structure. Examples include diagonal matrices, tridiagonal matrices, banded matrices, and Toeplitz matrices. An example of an unsymmetric Toeplitz matrix is shown to the right. Notice that the matrix is constant along each diagonal, including sub- and superdiagonals. Recently, I

The other day I was trying to numerically integrate the function f(x) = sin(x)/x on the domain [0,∞). The graph of this function is shown to the right. In SAS, you can use the QUAD subroutine in SAS IML software to perform numerical integration. Some numerical integrators have difficulty computing

Did you know that you can embed one graph inside another by using PROC SGPLOT in SAS? A typical example is shown to the right. The large graph shows kernel density estimates for the distribution of the Cholesterol variable among male and female patients in a heart study. The small

I don't often use the SG annotation facility in SAS for adding annotations to statistical graphics, but when I do, I enjoy the convenience of the SG annotation macros. I can never remember the details of the SG annotation commands, but I know that the SG annotation macros will create

Many SAS procedures support a BY statement that enables you to perform an analysis for each unique value of a BY-group variable. The SAS IML language does not support a BY statement, but you can program a loop that iterates over all BY groups. You can emulate BY-group processing by

There are many ways to model a set of raw data by using a continuous probability distribution. It can be challenging, however, to choose the distribution that best models the data. Are the data normal? Lognormal? Is there a theoretical reason to prefer one distribution over another? The SAS has

Does anyone write paper checks anymore? According to researchers at the Federal Reserve Bank of Atlanta (Greene, et al., 2020), the use of paper checks has declined 63% among US consumers since the year 2000. The researchers surveyed more than 3,000 consumers in 2017-2018 and discovered that only 7% of

I have previously written about how to efficiently generate points uniformly at random inside a sphere (often called a ball by mathematicians). The method uses a mathematical fact from multivariate statistics: If X is drawn from the uncorrelated multivariate normal distribution in dimensiond, then S = r*X / ||X|| has

A previous article shows how to use the MODELAVERAGE statement in PROC GLMSELECT in SAS to perform a basic bootstrap analysis of the regression coefficients and fit statistics. A colleague asked whether PROC GLMSELECT can construct bootstrap confidence intervals for the predicted mean in a regression model, as described in

I've written many articles about bootstrapping in SAS, including several about bootstrapping in regression models. Many of the articles use a very general bootstrap method that can bootstrap almost any statistic that SAS can compute. The method uses PROC SURVEYSELECT to generate B bootstrap samples from the data, uses the

It has been more than a decade since SAS 9.3 changed the default ODS destination from the old LISTING destination to more modern destinations such as HTML. One of the advantages of modern output destinations is support for Unicode symbols, superscripts, subscripts, and for formatting text by using boldface, italics,

In ordinary least squares regression, there is an explicit formula for the confidence limit of the predicted mean. That is, for any observed value of the explanatory variables, you can create a 95% confidence interval (CI) for the predicted response. This formula assumes that the model is correctly specified and

A SAS programmer wanted to use PROC SGPLOT in SAS to visualize a regression model. The programmer wanted to visualize confidence limits for the predicted mean at certain values of the explanatory variable. This article shows two options for adding confidence limits to a scatter plot. You can use a

The acceptance-rejection method (sometimes called rejection sampling) is a method that enables you to generate a random sample from an arbitrary distribution by using only the probability density function (PDF). This is in contrast to the inverse CDF method, which uses the cumulative distribution function (CDF) to generate a random

There are dozens of common probability distributions for a continuous univariate random variable. Familiar examples include the normal, exponential, uniform, gamma, and beta distributions. Where did these distributions come from? Well, some mathematician needed a model for a stochastic process and wrote down the equation for the distribution, typically by