In simulation studies, sometimes you need to simulate outliers. For example, in a simulation study of regression techniques, you might want to generate outliers in the explanatory variables to see how the technique handles high-leverage points. This article shows how to generate outliers in multivariate normal data that are a

# Author

An important concept in multivariate statistical analysis is the Mahalanobis distance. The Mahalanobis distance provides a way to measure how far away an observation is from the center of a sample while accounting for correlations in the data. The Mahalanobis distance is a good way to detect outliers in multivariate

An analyst was using SAS to analyze some data from an experiment. He noticed that the response variable is always positive (such as volume, size, or weight), but his statistical model predicts some negative responses. He posted the data and asked if it is possible to modify the graph so

Statisticians often emphasize the dangers of extrapolating from a univariate regression model. A common exercise in introductory statistics is to ask students to compute a model of population growth and predict the population far in the future. The students learn that extrapolating from a model can result in a nonsensical

It's time to celebrate Pi Day! Every year on March 14th (written 3/14 in the US), math-loving folks celebrate "all things pi-related" because 3.14 is the three-decimal approximation to the mathematical constant, π. Although children learn that pi is approximately 3.14159..., the actual definition of π is the ratio of

A SAS programmer posted an interesting question on a SAS discussion forum. The programmer wanted to iterate over hundreds of SAS data sets, read in all the character variables, and then do some analysis. However, not every data set contains character variables, and SAS complains when you ask it to

A previous article shows how to use a scatter plot to visualize the average SAT scores for all high schools in North Carolina. The schools are grouped by school districts and ranked according to the median value of the schools in the district. For the school districts that have many

Standardized tests like the SAT and ACT can cause stress for both high school students and their parents, but according to a Wall Street Journal article, the SAT and ACT "provide an invaluable measure of how students are likely to perform in college and beyond." Naturally, students wonder how their

Box plots are a great way to compare the distributions of several subpopulations of your data. For example, box plots are often used in clinical studies to visualize the response of patients in various cohorts. This article describes three techniques to visualize responses when the cohorts have a nested or

When I run a bootstrap analysis, I create graphs to visualize the distribution of the bootstrap statistics. For example, in my article about how to bootstrap the difference of means in a two-sample t test, I included a histogram of the bootstrap distribution and added reference lines to indicate a

Last year I published a series of blogs posts about how to create a calibration plot in SAS. A calibration plot is a way to assess the goodness of fit for a logistic model. It is a diagnostic graph that enables you to qualitatively compare a model's predicted probability of

Maybe if we think and wish and hope and pray It might come true. Oh, wouldn't it be nice? The Beach Boys Months ago, I wrote about how to use the EFFECT statement in SAS to perform regression with restricted cubic splines. This is the modern way to use splines

When you use maximum likelihood estimation (MLE) to find the parameter estimates in a generalized linear regression model, the Hessian matrix at the optimal solution is very important. The Hessian matrix indicates the local shape of the log-likelihood surface near the optimal value. You can use the Hessian to estimate

Have you ever run a regression model in SAS but later realize that you forgot to specify an important option or run some statistical test? Or maybe you intended to generate a graph that visualizes the model, but you forgot? Years ago, your only option was to modify your program

Feature generation (also known as feature creation) is the process of creating new features to use for training machine learning models. This article focuses on regression models. The new features (which statisticians call variables) are typically nonlinear transformations of existing variables or combinations of two or more existing variables. This

I previously discussed how you can use validation data to choose between a set of competing regression models. In that article, I manually evaluated seven models for a continuous response on the training data and manually chose the model that gave the best predictions for the validation data. Fortunately, SAS

Machine learning differs from classical statistics in the way it assesses and compares competing models. In classical statistics, you use all the data to fit each model. You choose between models by using a statistic (such as AIC, AICC, SBC, ...) that measures both the goodness of fit and the

This article shows how to use SAS to simulate data that fits a linear regression model that has categorical regressors (also called explanatory or CLASS variables). Simulating data is a useful skill for both researchers and statistical programmers. You can use simulation for answering research questions, but you can also

Recently I was asked to explain the result of an ANOVA analysis that I posted to a statistical discussion forum. My program included some simulated data for an ANOVA model and a call to the GLM procedure to estimate the parameters. I was asked why the parameter estimates from PROC

Recently I was asked to explain the result of an ANOVA analysis that I posted to a statistical discussion forum. My program included some simulated data for an ANOVA model and a call to the GLM procedure to estimate the parameters. I was asked why the parameter estimates from PROC

In machine learning and other model building techniques, it is common to partition a large data set into three segments: training, validation, and testing. Training data is used to fit each model. Validation data is a random sample that is used for model selection. These data are used to select

A quantile-quantile plot (Q-Q plot) is a graphical tool that compares a data distribution and a specified probability distribution. If the points in a Q-Q plot appear to fall on a straight line, that is evidence that the data can be approximately modeled by the target distribution. Although it is

When you overlay two series in PROC SGPLOT, you can either plot both series on the same axis or you can assign one series to the main axis (Y) and another to a secondary axis (Y2). If you use the Y and Y2 axes, they are scaled independently by default,

Numbers don't lie, but sometimes they don't reveal the full story. Last week I wrote about the most popular articles from The DO Loop in 2018. The popular articles are inevitably about elementary topics in SAS programming or statistics because those topics have broad appeal. However, I also write about

Deming regression (also called errors-in-variables regression) is a total regression method that fits a regression line when the measurements of both the explanatory variable (X) and the response variable (Y) are assumed to be subject to normally distributed errors. Recall that in ordinary least squares regression, the explanatory variable (X)

*The DO Loop*in 2018

Last year, I wrote more than 100 posts for The DO Loop blog. Of these, the most popular articles were about data visualization, SAS programming tips, and statistical data analysis. Here are the most popular articles from 2018 in each category. Data Visualization Visualize repetition in song lyrics: In one

I regularly see questions on a SAS discussion forum about how to visualize the predicted values for a mixed model that has at least one continuous variable, a categorical variable, and possibly an interaction term. SAS procedures such as GLM, GENMOD, and LOGISTIC can automatically produce plots of the predicted

Many data analysts use a quantile-quantile plot (Q-Q plot) to graphically assess whether data can be modeled by a probability distribution such as the normal, lognormal, or gamma distribution. You can use the QQPLOT statement in PROC UNIVARIATE to create a Q-Q plot for about a dozen common distributions. However,

This article describes best practices and techniques that every data analyst should know before bootstrapping in SAS. The bootstrap method is a powerful statistical technique, but it can be a challenge to implement it efficiently. An inefficient bootstrap program can take hours to run, whereas a well-written program can give

The best way to spread Christmas cheer is singing loud for all to hear! -Buddy in Elf In the Christmas movie Elf (2003), Jovie (played by Zooey Deschanel) must "spread Christmas cheer" to help Santa. She chooses to sing "Santa Claus is coming to town," and soon all of New