Blogs

Blogs

Tag: Statistical Programming

Programming Tips

Rick WicklinJanuary 25, 2021 0

How to compute the incomplete gamma function in SAS

Years ago, I wrote about how to compute the incomplete beta function in SAS. Recently, a SAS programmer asked about a similar function, called the incomplete gamma function. The incomplete gamma function is a "special function" that arises in applied math, physics, and statistics. You should not confuse the gamma

Read More

Analytics | Programming Tips

Rick WicklinJanuary 20, 2021 0

The stationary block bootstrap in SAS

This is the third and last introductory article about how to bootstrap time series in SAS. In the first article, I presented the simple block bootstrap and discussed why bootstrapping a time series is more complicated than for regression models that assume independent errors. Briefly, when you perform residual resampling

Read More

Analytics | Learn SAS | Programming Tips

Rick WicklinJanuary 13, 2021 0

The moving block bootstrap for time series

As I discussed in a previous article, the simple block bootstrap is a way to perform a bootstrap analysis on a time series. The first step is to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L) that divides the total

Read More

Programming Tips

Rick WicklinJanuary 11, 2021 0

Blog posts from 2020 that deserve a second look

On The DO Loop blog, I write about a diverse set of topics, including statistical data analysis, machine learning, statistical programming, data visualization, simulation, numerical analysis, and matrix computations. In a previous article, I presented some of my most popular blog posts from 2020. The most popular articles often deal

Read More

Programming Tips

Rick WicklinDecember 21, 2020 0

Create a response variable that has a specified R-square value

When you perform a linear regression, you can examine the R-square value, which is a goodness-of-fit statistic that indicates how well the response variable can be represented as a linear combination of the explanatory variables. But did you know that you can also go the other direction? Given a set

Read More

Analytics | Programming Tips

Rick WicklinDecember 17, 2020 0

Find a vector that has a specified correlation with another vector

Do you know that you can create a vector that has a specific correlation with another vector? That is, given a vector, x, and a correlation coefficient, ρ, you can find a vector, y, such that corr(x, y) = ρ. The vectors x and y can have an arbitrary number

Read More

Analytics | Data Visualization

Predicted probabilities for a logistic regression model

Rick WicklinNovember 18, 2020 0

Create scoring data when regressors are correlated

To help visualize regression models, SAS provides the EFFECTPLOT statement in several regression procedures and in PROC PLM, which is a general-purpose procedure for post-fitting analysis of linear models. When scoring and visualizing a model, it is important to use reasonable combinations of the explanatory variables for the visualization. When

Read More

Programming Tips

Rick WicklinNovember 9, 2020 0

Robust statistics for skewness and kurtosis

Intuitively, the skewness of a unimodal distribution indicates whether a distribution is symmetric or not. If the right tail has more mass than the left tail, the distribution is "right skewed." If the left tail has more mass, the distribution is "left skewed." Thus, estimating skewness requires some estimates about

Read More

Programming Tips

Expected value for the tail of a distribution

Rick WicklinNovember 4, 2020 0

The expected value of the tail of a distribution

The expected value of a random variable is essentially a weighted mean over all possible values. You can compute it by summing (or integrating) a probability-weighted quantity over all possible values of the random variable. The expected value is a measure of the "center" of a probability distribution. You can

Read More

Analytics | Programming Tips

Graphical comparison of two methods for estimating confidence intervals of eigenvalues of a correlation matrix

Rick WicklinOctober 26, 2020 0

Confidence intervals for eigenvalues of a correlation matrix

A fundamental principle of data analysis is that a statistic is an estimate of a parameter for the population. A statistic is calculated from a random sample. This leads to uncertainty in the estimate: a different random sample would have produced a different statistic. To quantify the uncertainty, SAS procedures

Read More

Analytics | Programming Tips

Rick WicklinOctober 7, 2020 0

The Poisson-binomial distribution for hundreds of parameters

A previous article shows how to use a recursive formula to compute exact probabilities for the Poisson-binomial distribution. The recursive formula is an O(N2) computation, where N is the number of parameters for the Poisson-binomial (PB) distribution. If you have a distribution that has hundreds (or even thousands) of parameters,

Read More

Programming Tips

PDF of the Poisson-binomial distribution

Rick WicklinSeptember 30, 2020 0

Density, CDF, and quantiles for the Poisson-binomial distribution

When working with a probability distribution, it is useful to know how to compute four essential quantities: a random sample, the density function, the cumulative distribution function (CDF), and quantiles. I recently discussed the Poisson-binomial distribution and showed how to generate a random sample. This article shows how to compute

Read More

Analytics | Programming Tips

Rick WicklinSeptember 28, 2020 0

The Poisson-binomial distribution

The Poisson-binomial distribution is a generalization of the binomial distribution. For the binomial distribution, you carry out N independent and identical Bernoulli trials. Each trial has a probability, p, of success. The total number of successes, which can be between 0 and N, is a binomial random variable. The distribution

Read More

Learn SAS | Programming Tips

Rick WicklinSeptember 23, 2020 0

Working with recurrence relations in SAS

Many textbooks and research papers present formulas that involve recurrence relations. Familiar examples include: The factorial function: Set Fact(0)=1 and define Fact(n) = n*Fact(n-1) for n > 0. The Fibonacci numbers: Set Fib(0)=1 and Fib(1)=1 and define Fib(n) = Fib(n-1) + Fib(n-2) for n > 1. The binomial coefficients (combinations

Read More

Analytics | Learn SAS

Rick WicklinSeptember 21, 2020 0

Regression with inequality constraints on parameters

A previous article discussed how to solve regression problems in which the parameters are constrained to be a specified constant (such as B1 = 1) or are restricted to obey a linear equation such as B4 = –2*B2. In SAS, you can use the RESTRICT statement in PROC REG to

Read More

Analytics | Programming Tips

Rick WicklinSeptember 8, 2020 0

Matrix balancing: Update matrix cells to match row and column sums

Matrix balancing is an interesting problem that has a long history. Matrix balancing refers to adjusting the cells of a frequency table to match known values of the row and column sums. One of the early algorithms for matrix balancing is known as the RAS algorithm, but it is also

Read More

Learn SAS | Programming Tips

Rick WicklinAugust 31, 2020 0

The best way to generate dummy variables in SAS

On discussion forums, many SAS programmers ask about the best way to generate dummy variables for categorical variables. Well-meaning responders offer all sorts of advice, including writing your own DATA step program, sometimes mixed with macro programming. This article shows that the simplest and easiest way to generate dummy variables

Read More

Learn SAS | Programming Tips

Rick WicklinAugust 5, 2020 0

Submatrices of matrices

Have you ever seen the "brain teaser" for children that shows a 4 x 4 grid and asks "how many squares of any size are in this grid?" To solve this problem, the reader must recognize that there are sixteen 1 x 1 squares, nine 2 x 2 squares, four 3 x 3 squares, and one 4 x 4 square.

Read More

Programming Tips

Rick WicklinJuly 29, 2020 0

Simulate regression models that incorporate CLASS parameterizations

When you write a program that simulates data from a statistical model, you should always check that the simulation code is correct. One way to do this is to generate a large simulated sample, estimate the parameters in the simulated data, and make sure that the estimates are close to

Read More

Advanced Analytics | Machine Learning | Programming Tips

Rick WicklinJuly 23, 2020 0

Fit a multivariate Gaussian mixture model by using the expectation-maximization (EM) algorithm

Last month a SAS programmer asked how to fit a multivariate Gaussian mixture model in SAS. For univariate data, you can use the FMM Procedure, which fits a large variety of finite mixture models. If your company is using SAS Viya, you can use the MBC or GMM procedures, which

Read More

Learn SAS | Programming Tips

Rick WicklinJuly 20, 2020 0

Compute within-group multivariate statistics and store them in a list

I recently showed how to compute within-group multivariate statistics by using the SAS/IML language. However, a principal of good software design is to encapsulate functionality and write self-contained functions that compute and return the results. What is the best way to return multiple statistics from a SAS/IML module? A convenient

Read More

Advanced Analytics | Programming Tips

Rick WicklinJuly 15, 2020 0

How to evaluate the multivariate normal log likelihood

The multivariate normal distribution is used frequently in multivariate statistics and machine learning. In many applications, you need to evaluate the log-likelihood function in order to compare how well different models fit the data. The log-likelihood for a vector x is the natural logarithm of the multivariate normal (MVN) density

Read More

Analytics | Data Visualization | Learn SAS

Rick WicklinJuly 1, 2020 0

Pooled, within-group, and between-group covariance matrices

A previous article discusses the pooled variance for two or groups of univariate data. The pooled variance is often used during a t test of two independent samples. For multivariate data, the analogous concept is the pooled covariance matrix, which is an average of the sample covariance matrices of the

Read More

Analytics | Programming Tips

Rick WicklinJune 24, 2020 0

The Kolmogorov D distribution and exact critical values

If you have ever run a Kolmogorov-Smirnov test for normality, you have encountered the Kolmogorov D statistic. The Kolmogorov D statistic is used to assess whether a random sample was drawn from a specified distribution. Although it is frequently used to test for normality, the statistic is "distribution free" in

Read More

Advanced Analytics | Machine Learning

Rick WicklinMay 26, 2020 0

The Kullback–Leibler divergence between discrete probability distributions

If you have been learning about machine learning or mathematical statistics, you might have heard about the Kullback–Leibler divergence. The Kullback–Leibler divergence is a measure of dissimilarity between two probability distributions. It measures how much one distribution differs from a reference distribution. This article explains the Kullback–Leibler divergence and shows

Read More

Learn SAS | Programming Tips

Rick WicklinMarch 18, 2020 0

Print SAS/IML variables with formats

A SAS/IML programmer asked about the best way to print multiple SAS/IML variables when each variable needs a different format. He wanted the output to resemble the "Parameter Estimates" table that is produced by PROC REG and other SAS/STAT procedures. This article shows four ways to print SAS/IML vectors in

Read More

Analytics | Programming Tips

Rick WicklinMarch 16, 2020 0

Predict a random integer: The tradeoff between bias and variance

Books about statistics and machine learning often discuss the tradeoff between bias and variance for an estimator. These discussions are often motivated by a sophisticated predictive model such as a regression or a decision tree. But the basic idea can be seen in much simpler situations. This article presents a

Read More

Advanced Analytics | Data Visualization | Programming Tips

Rick WicklinMarch 9, 2020 0

ROC curves for a binormal sample

In a previous article, I discussed the binormal model for a binary classification problem. This model assumes a set of scores that are normally distributed for each population, and the mean of the scores for the Negative population is less than the mean of scores for the Positive population. I

Read More

Learn SAS | Programming Tips

Rick WicklinMarch 4, 2020 0

Store pre-computed matrices in a list

Suppose that a data set contains a set of parameter values. For each row of parameters, you need to perform some computation. A recent discussion on the SAS Support Communities mentions an important point: if there are duplicate rows in the data, a program might repeat the same computation several

Read More

Analytics | Data Visualization

Rick WicklinFebruary 26, 2020 0

The binormal model for ROC curves

The ROC curve is a graphical method that summarizes how well a binary classifier can discriminate between two populations, often called the "negative" population (individuals who do not have a disease or characteristic) and the "positive" population (individuals who do have it). As shown in a previous article, there is

Read More

Previous 1 … 3 4 5 6 7 … 15 Next