Correlation is a fundamental statistical concept that measures the linear association between two variables. There are multiple ways to think about correlation: geometrically, algebraically, with matrices, with vectors, with regression, and more. To paraphrase the great songwriter Paul Simon, there must be 50 ways to view your correlation! But don't

## Tag: **Data Analysis**

A previous article discussed the mathematical properties of the singular value decomposition (SVD) and showed how to use the SVD subroutine in SAS/IML software. This article uses the SVD to construct a low-rank approximation to an image. Applications include image compression and denoising an image. Construct a grayscale image The

Visualizing the correlations between variables often provides insight into the relationships between variables. I've previously written about how to use a heat map to visualize a correlation matrix in SAS/IML, and Chris Hemedinger showed how to use Base SAS to visualize correlations between variables. Recently a SAS programmer asked how

When someone refers to the correlation between two variables, they are probably referring to the Pearson correlation, which is the standard statistic that is taught in elementary statistics courses. Elementary courses do not usually mention that there are other measures of correlation. Why would anyone want a different estimate of

Recently, I was asked whether SAS can perform a principal component analysis (PCA) that is robust to the presence of outliers in the data. A PCA requires a data matrix, an estimate for the center of the data, and an estimate for the variance/covariance of the variables. Classically, these estimates

Last week I blogged about the broken-stick problem in probability, which reminded me that the broken-stick model is one of the many techniques that have been proposed for choosing the number of principal components to retain during a principal component analysis. Recall that for a principal component analysis (PCA) of

Skewness is a measure of the asymmetry of a univariate distribution. I have previously shown how to compute the skewness for data distributions in SAS. The previous article computes Pearson's definition of skewness, which is based on the standardized third central moment of the data. Moment-based statistics are sensitive to

An important problem in machine learning is the "classification problem." In this supervised learning problem, you build a statistical model that predicts a set of categorical outcomes (responses) based on a set of input features (explanatory variables). You do this by training the model on data for which the outcomes

A SAS customer asked how to use SAS to conduct a Z test for the equality of two proportions. He was directed to the SAS Usage Note "Testing the equality of two or more proportions from independent samples." The note says to "specify the CHISQ option in the TABLES statement

Suppose you roll six identical six-sided dice. Chance are that you will see at least one repeated number. The probability that you will see six unique numbers is very small: only 6! / 6^6 ≈ 0.015. This example can be generalized. If you draw a random sample with replacement from

In a previous article, I showed two ways to define a log-likelihood function in SAS. This article shows two ways to compute maximum likelihood estimates (MLEs) in SAS: the nonlinear optimization subroutines in SAS/IML and the NLMIXED procedure in SAS/STAT. To illustrate these methods, I will use the same data

Maximum likelihood estimation (MLE) is a powerful statistical technique that uses optimization techniques to fit parametric models. The technique finds the parameters that are "most likely" to have produced the observed data. SAS provides many tools for nonlinear optimization, so often the hardest part of maximum likelihood is writing down

If you toss a coin 28 times, you would not be surprised to see three heads in a row, such as ...THHHTH.... But what about eight heads in a row? Would a sequence such as THHHHHHHHTH... be a rare event? This question popped into my head last weekend as I

According to Hyndman and Fan ("Sample Quantiles in Statistical Packages," TAS, 1996), there are nine definitions of sample quantiles that commonly appear in statistical software packages. Hyndman and Fan identify three definitions that are based on rounding and six methods that are based on linear interpolation. This blog post shows

In last week's article about the Flint water crisis, I computed the 90th percentile of a small data set. Although I didn't mention it, the value that I reported is different from the the 90th percentile that is reported in Significance magazine. That is not unusual. The data only had

The April 2017 issue of Significance magazine features a cover story by Robert Langkjaer-Bain about the Flint (Michigan) water crisis. For those who don't know, the Flint water crisis started in 2014 when the impoverished city began using the Flint River as a source of city water. The water was

Quick! What is the next term in the numerical sequence 1, 2, 1, 2, 3, 4, 5, 4, 3, 4, ...? If you said '3', then you must be an American history expert, because that sequence represents the number of living US presidents beginning with Washington's inauguration on 30APR1789 and

If a financial analyst says it is "likely" that a company will be profitable next year, what probability would you ascribe to that statement? If an intelligence report claims that there is "little chance" of a terrorist attack against an embassy, should the ambassador interpret this as a one-in-a-hundred chance,

A frequently asked question on SAS discussion forums concerns randomly assigning units (often patients in a study) to various experimental groups so that each group has approximately the same number of units. This basic problem is easily solved in SAS by using PROC SURVEYSELECT or a DATA step program. A

Most SAS regression procedures support a CLASS statement which internally generates dummy variables for categorical variables. I have previously described what dummy variables are and how are they used. I have also written about how to create design matrices that contain dummy variables in SAS, and in particular how to

There are several ways to visualize data in a two-way ANOVA model. Most visualizations show a statistical summary of the response variable for each category. However, for small data sets, it can be useful to overlay the raw data. This article shows a simple trick that you can use to

Restricted cubic splines are a powerful technique for modeling nonlinear relationships by using linear regression models. I have attended multiple SAS Global Forum presentations that show how to use restricted cubic splines in SAS regression procedures. However, the presenters have all used the %RCSPLINE macro (Frank Harrell, 1988) to generate

It is time for Pi Day, 2017! Every year on March 14th (written 3/14 in the US), geeky mathematicians and their friends celebrate "all things pi-related" because 3.14 is the three-decimal approximation to pi. This year I use SAS software to show an amazing fact: you can find your birthday

Sometimes SAS programmers ask about how to analyze quantiles with SAS. Common questions include: How can I compute 95% confidence intervals for a median in SAS? How can I test whether the medians of two independent samples are significantly different? How can I repeat the previous analyses with other percentiles,

Many introductory courses in probability and statistics encourage students to collect and analyze real data. A popular experiment in categorical data analysis is to give students a bag of M&M® candies and ask them to estimate the proportion of colors in the population from the sample data. In some classes,

A common question on SAS discussion forums is how to repeat an analysis multiple times. Most programmers know that the most efficient way to analyze one model across many subsets of the data (perhaps each country or each state) is to sort the data and use a BY statement to

Last week I wrote about the 10 most popular articles from The DO Loop in 2016. The popular articles tend to be about elementary topics that appeal to a wide range of SAS programmers. Today I present an "editor's choice" list of technical articles that describe more advanced statistical methods

“La Quinta” is Spanish for “next to Denny’s.” -- Mitch Hedberg, comedian Mitch Hedberg's joke resonates with travelers who drive on the US interstate system because many highway exits feature both a La Quinta Inn™ and a Denny's® restaurant within a short distance of each other. But does a

*The DO Loop*in 2016

I wrote 105 posts for The DO Loop blog in 2016. My most popular articles were about data analysis, SAS programming tips, and elementary statistics. Without further ado, here are the most popular articles from 2016. Data Analysis and Visualization Start with a juicy set of data and an interesting

Data governance seems to be the hottest topic at data-related conferences this year, and the question I get asked most often is, “where do we start?” Followed closely by how do we do it, what skills do we need, how do we convince the rest of the organisation to get