From the early days of probability and statistics, researchers have tried to organize and categorize parametric probability distributions. For example, Pearson (1895, 1901, and 1916) developed a system of seven distributions, which was later called the Pearson system. The main idea behind a "system" of distributions is that for each

## Tag: **Data Analysis**

Did you add "learn something new" to your list of New Year's resolutions? Last week, I wrote about the most popular articles from The DO Loop in 2019. The most popular articles are about elementary topics in SAS programming or univariate statistics because those topics have broad appeal. Advanced topics

*The DO Loop*in 2019

Last year, I wrote more than 100 posts for The DO Loop blog. The most popular articles were about SAS programming tips for data analysis, statistical analysis, and data visualization. Here are the most popular articles from 2019 in each category. SAS programming tips Create training, testing, and validation data

A 2-D "bin plot" counts the number of observations in each cell in a regular 2-D grid. The 2-D bin plot is essentially a 2-D version of a histogram: it provides an estimate for the density of a 2-D distribution. As I discuss in the article, "The essential guide to

Binary matrices are used for many purposes. I have previously written about how to use binary matrices to visualize missing values in a data matrix. They are also used to indicate the co-occurrence of two events. In ecology, binary matrices are used to indicate which species of an animal are

This is a second article about analyzing longitudinal data, which features measurements that are repeatedly taken on subjects at several points in time. The previous article discusses a response-profile analysis, which uses an ANOVA method to determine differences between the means of an experimental group and a placebo group. The

Longitudinal data are used in many health-related studies in which individuals are measured at multiple points in time to monitor changes in a response variable, such as weight, cholesterol, or blood pressure. There are many excellent articles and books that describe the advantages of a mixed model for analyzing longitudinal

In a linear regression model, the predicted values are on the same scale as the response variable. You can plot the observed and predicted responses to visualize how well the model agrees with the data, However, for generalized linear models, there is a potential source of confusion. Recall that a

Biplots are two-dimensional plots that help to visualize relationships in high dimensional data. A previous article discusses how to interpret biplots for continuous variables. The biplot projects observations and variables onto the span of the first two principal components. The observations are plotted as markers; the variables are plotted as

In grade school, students learn how to round numbers to the nearest integer. In later years, students learn variations, such as rounding up and rounding down by using the greatest integer function and least integer function, respectively. My sister, who is an engineer, learned a rounding method that rounds half-integers

Principal component analysis (PCA) is an important tool for understanding relationships in continuous multivariate data. When the first two principal components (PCs) explain a significant portion of the variance in the data, you can visualize the data by projecting the observations onto the span of the first two PCs. In

Understanding multivariate statistics requires mastery of high-dimensional geometry and concepts in linear algebra such as matrix factorizations, basis vectors, and linear subspaces. Graphs can help to summarize what a multivariate analysis is telling us about the data. This article looks at four graphs that are often part of a principal

Computing rates and proportions is a common task in data analysis. When you are computing several proportions, it is helpful to visualize how the rates vary among subgroups of the population. Examples of proportions that depend on subgroups include: Mortality rates for various types of cancers Incarceration rates by race

The EFFECT statement is supported by more than a dozen SAS/STAT regression procedures. Among other things, it enables you to generate spline effects that you can use to fit nonlinear relationships in data. Recently there was a discussion on the SAS Support Communities about how to interpret the parameter estimates

I recently wrote about how to use PROC TTEST in SAS/STAT software to compute the geometric mean and related statistics. This prompted a SAS programmer to ask a related question. Suppose you have dozens (or hundreds) of variables and you want to compute the geometric mean of each. What is

In a recent video blog, I discuss forecast accuracy as a parameter for measuring the ability to forecast and plan demand. I further argue for the use of causal data as a key input to understanding historical demand and forecasting/planning future demand. Forecast accuracy is often claimed NOT to be

In a previous article, I mentioned that the VLINE statement in PROC SGPLOT is an easy way to graph the mean response at a set of discrete time points. I mentioned that you can choose three options for the length of the "error bars": the standard deviation of the data,

I frequently see questions on SAS discussion forums about how to compute the geometric mean and related quantities in SAS. Unfortunately, the answers to these questions are sometimes confusing or even wrong. In addition, some published papers and web sites that claim to show how to calculate the geometric mean

A moving average is a statistical technique that is used to smooth a time series. My colleague, Cindy Wang, wrote an article about the Hull moving average (HMA), which is a time series smoother that is sometimes used as a technical indicator by stock market traders. Cindy showed how to

When you order an item online, the website often recommends other items based on your purchase. In fact, these kinds of "recommendation engines" contributed to the early success of companies like Amazon and Netflix. SAS uses a recommender engine to suggest articles on the SAS Support Communities. Although recommender engines

An important application of the dot product (inner product) of two vectors is to determine the angle between the vectors. If u and v are two vectors, then cos(θ) = (u ⋅ v) / (|u| |v|) You could apply the inverse cosine function if you wanted to find θ in

Most SAS programmers know how to use PROC APPEND or the SET statement in DATA step to unconditionally append new observations to an existing data set. However, sometimes you need to scan the data to determine whether or not to append observations. In this situation, many SAS programmers choose one

An important application of nonlinear optimization is finding parameters of a model that fit data. For some models, the parameters are constrained by the data. A canonical example is the maximum likelihood estimation of a so-called "threshold parameter" for the three-parameter lognormal distribution. For this distribution, the objective function is

One of my friends likes to remind me that "there is no such thing as a free lunch," which he abbreviates by "TINSTAAFL" (or TANSTAAFL). The TINSTAAFL principle applies to computer programming because you often end up paying a cost (in performance) when you call a convenience function that simplifies

Do you want to bin a numeric variable into a small number of discrete groups? This article compiles a dozen resources and examples related to binning a continuous variable. The examples show both equal-width binning and quantile binning. In addition to standard one-dimensional techniques, this article also discusses various techniques

Binning transforms a continuous numerical variable into a discrete variable with a small number of values. When you bin univariate data, you define cut point that define discrete groups. I've previously shown how to use PROC FORMAT in SAS to bin numerical variables and give each group a meaningful name

I recently showed how to create an annotation data set that will overlay cell counts or percentages on a mosaic plot. A mosaic plot is a visual representation of a cross-tabulation of observed frequencies for two categorical variables. The mosaic plot with cell counts is shown to the right. The

SAS/STAT software contains a number of so-called HP procedures for training and evaluating predictive models. ("HP" stands for "high performance.") A popular HP procedure is HPLOGISTIC, which enables you to fit logistic models on Big Data. A goal of the HP procedures is to fit models quickly. Inferential statistics such

When fitting a least squares regression model to data, it is often useful to create diagnostic plots of the residuals versus the explanatory variables. If the model fits the data well, the plots of the residuals should not display any patterns. Systematic patterns can indicate that you need to include

A previous article describes the DFBETAS statistics for detecting influential observations, where "influential" means that if you delete the observation and refit the model, the estimates for the regression coefficients change substantially. Of course, there are other statistics that you could use to measure influence. Two popular ones are the