Many nonparametric statistical methods use the ranks of observations to compute distribution-free statistics. In SAS, two procedures that use ranks are PROC NPAR1WAY and PROC CORR. Whereas the SPEARMAN option in PROC CORR (which computes rank correlation) uses only the "raw" tied ranks, PROC NPAR1WAY uses transformations of the ranks,
Tag: Statistical Programming
For many univariate statistics (mean, median, standard deviation, etc.), the order of the data is unimportant. If you sort univariate data, the mean and standard deviation do not change. However, you cannot sort an individual variable (independently) if you want to preserve its relationship with other variables. This statement is
When data contain outliers, medians estimate the center of the data better than means do. In general, robust estimates of location and sale are preferred over classical moment-based estimates when the data contain outliers or are from a heavy-tailed distribution. Thus, instead of using the mean and standard deviation of
I previously wrote about how to understand standardized regression coefficients in PROC REG in SAS. You can obtain the standardized estimates by using the STB option on the MODEL statement in PROC REG. Several readers have written to ask whether I could write a similar article about the STDCOEF option
You can standardize a numerical variable by subtracting a location parameter from each observation and then dividing by a scale parameter. Often, the parameters depend on the data that you are standardizing. For example, the most common way to standardize a variable is to subtract the sample mean and divide
Odani's truism is a mathematical result that says that if you want to compare the fractions a/b and c/d, it often is sufficient to compare the sums (a+d) and (b+c) rather than the products a*d and b*c. (All of the integers a, b, c, and d are positive.) If you
There are many statistics that measure whether two continuous random variables are independent or whether they are related to each other in some way. The most well-known statistic is Pearson's correlation, which is a parametric measure of the linear relationship between two variables. A related measure is Spearman's rank correlation,
A SAS customer wanted to compute the cumulative distribution function (CDF) of the generalized gamma distribution. For any continuous distribution, the CDF is the integral of the probability density function (PDF), which usually has an explicit formula. Accordingly, he wanted to compute the CDF by using the QUAD function in
I've previously written about how to generate all pairwise interactions for a regression model in SAS. For a model that contains continuous effects, the easiest way is to use the EFFECT statement in PROC GLMSELECT to generate second-degree "polynomial effects." However, a SAS programmer was running a simulation study and
In a previous article, I showed how to generate random points uniformly inside a d-dimensional sphere. In that article, I stated the following fact: If Y is drawn from the uncorrelated multivariate normal distribution, then S = Y / ||Y|| has the uniform distribution on the unit sphere. I was
Imagine an animal that is searching for food in a vast environment where food is scarce. If no prey is nearby, the animal's senses (such as smell and sight) are useless. In that case, a reasonable search strategy is a random walk. The animal can choose a random direction, walk/swim/fly
The inverse gamma distribution is a continuous probability distribution that is used in Bayesian analysis and in some statistical models. The inverse gamma distribution is closely related to the gamma distribution. For any probability distribution, it is essential to know how to compute four functions: the PDF function, which returns
Years ago, I wrote about how to compute the incomplete beta function in SAS. Recently, a SAS programmer asked about a similar function, called the incomplete gamma function. The incomplete gamma function is a "special function" that arises in applied math, physics, and statistics. You should not confuse the gamma
This is the third and last introductory article about how to bootstrap time series in SAS. In the first article, I presented the simple block bootstrap and discussed why bootstrapping a time series is more complicated than for regression models that assume independent errors. Briefly, when you perform residual resampling
As I discussed in a previous article, the simple block bootstrap is a way to perform a bootstrap analysis on a time series. The first step is to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L) that divides the total
On The DO Loop blog, I write about a diverse set of topics, including statistical data analysis, machine learning, statistical programming, data visualization, simulation, numerical analysis, and matrix computations. In a previous article, I presented some of my most popular blog posts from 2020. The most popular articles often deal
When you perform a linear regression, you can examine the R-square value, which is a goodness-of-fit statistic that indicates how well the response variable can be represented as a linear combination of the explanatory variables. But did you know that you can also go the other direction? Given a set
Do you know that you can create a vector that has a specific correlation with another vector? That is, given a vector, x, and a correlation coefficient, ρ, you can find a vector, y, such that corr(x, y) = ρ. The vectors x and y can have an arbitrary number
To help visualize regression models, SAS provides the EFFECTPLOT statement in several regression procedures and in PROC PLM, which is a general-purpose procedure for post-fitting analysis of linear models. When scoring and visualizing a model, it is important to use reasonable combinations of the explanatory variables for the visualization. When
Intuitively, the skewness of a unimodal distribution indicates whether a distribution is symmetric or not. If the right tail has more mass than the left tail, the distribution is "right skewed." If the left tail has more mass, the distribution is "left skewed." Thus, estimating skewness requires some estimates about
The expected value of a random variable is essentially a weighted mean over all possible values. You can compute it by summing (or integrating) a probability-weighted quantity over all possible values of the random variable. The expected value is a measure of the "center" of a probability distribution. You can
A fundamental principle of data analysis is that a statistic is an estimate of a parameter for the population. A statistic is calculated from a random sample. This leads to uncertainty in the estimate: a different random sample would have produced a different statistic. To quantify the uncertainty, SAS procedures
A previous article shows how to use a recursive formula to compute exact probabilities for the Poisson-binomial distribution. The recursive formula is an O(N2) computation, where N is the number of parameters for the Poisson-binomial (PB) distribution. If you have a distribution that has hundreds (or even thousands) of parameters,
When working with a probability distribution, it is useful to know how to compute four essential quantities: a random sample, the density function, the cumulative distribution function (CDF), and quantiles. I recently discussed the Poisson-binomial distribution and showed how to generate a random sample. This article shows how to compute
The Poisson-binomial distribution is a generalization of the binomial distribution. For the binomial distribution, you carry out N independent and identical Bernoulli trials. Each trial has a probability, p, of success. The total number of successes, which can be between 0 and N, is a binomial random variable. The distribution
Many textbooks and research papers present formulas that involve recurrence relations. Familiar examples include: The factorial function: Set Fact(0)=1 and define Fact(n) = n*Fact(n-1) for n > 0. The Fibonacci numbers: Set Fib(0)=1 and Fib(1)=1 and define Fib(n) = Fib(n-1) + Fib(n-2) for n > 1. The binomial coefficients (combinations
A previous article discussed how to solve regression problems in which the parameters are constrained to be a specified constant (such as B1 = 1) or are restricted to obey a linear equation such as B4 = –2*B2. In SAS, you can use the RESTRICT statement in PROC REG to
Matrix balancing is an interesting problem that has a long history. Matrix balancing refers to adjusting the cells of a frequency table to match known values of the row and column sums. One of the early algorithms for matrix balancing is known as the RAS algorithm, but it is also
On discussion forums, many SAS programmers ask about the best way to generate dummy variables for categorical variables. Well-meaning responders offer all sorts of advice, including writing your own DATA step program, sometimes mixed with macro programming. This article shows that the simplest and easiest way to generate dummy variables
Have you ever seen the "brain teaser" for children that shows a 4 x 4 grid and asks "how many squares of any size are in this grid?" To solve this problem, the reader must recognize that there are sixteen 1 x 1 squares, nine 2 x 2 squares, four 3 x 3 squares, and one 4 x 4 square.