A remarkable result in probability theory is the "three-sigma rule," which is a generic name for theorems that bound the probability that a univariate random variable will appear near the center of its distribution. This article discusses the familiar three-sigma rule for the normal distribution, a less-familiar rule for unimodal
Tag: Statistical Thinking
As announced and demonstrated at SAS Innovate 2024, SAS plans to include a generative AI assistant called SAS Viya Copilot in the forthcoming SAS Viya Workbench. You can submit a text prompt (by putting it in a comment string) and the Copilot will generate SAS code for you. My colleagues
This article discusses how to scale a probability density curve so that it fits appropriately on a histogram, as shown in the graph to the right. By definition, a probability density curve is scaled so that the area under the curve equals 1. However, a histogram might show counts or
After writing a program that simulates data, it is important to check that the statistical properties of the simulated (synthetic) data match the properties of the model. As a first step, you can generate a large random sample from the model distribution and compare the sample statistics to the expected
A SAS statistical programmer recently asked a theoretical question about statistics. "I've read that 'p-values are uniformly distributed under the null hypothesis,'" he began, "but what does that mean in practice? Is it important?" I think data simulation is a great way to discuss the conditions for which p-values are
At a recent conference in Las Vegas, a presenter simulated the sum of two dice and used it to simulate the game of craps. I write a lot of simulations, so I'd like to discuss two related topics: How to simulate the sum of two dice in SAS. This is
A statistical analyst used the GENMOD procedure in SAS to fit a linear regression model. He noticed that the table of parameter estimates has an extra row (labeled "Scale") that is not a regression coefficient. The "scale parameter" is not part of the parameter estimates table produced by PROC REG
A previous article explains the Spearman rank correlation, which is a robust cousin to the more familiar Pearson correlation. I've also discussed why you might want to use rank correlation, and how to interpret the strength of a rank correlation. This article gives a short example that helps you to
A lot of programmers have been impressed by the ability of ChatGPT, GPT-4, and Bing Chat to write computer programs. Recently, I wrote an article that discusses an elementary programming assignment, called FizzBuzz, which is sometimes used as part of a hiring process to assess a candidate's basic knowledge of
The "Teacher’s Corner" of The American Statistician enables statisticians to discuss topics that are relevant to teaching and learning statistics. Sometimes, the articles have practical relevance, too. Andersson (2023) "The Wald Confidence Interval for a Binomial p as an Illuminating 'Bad' Example," is intended for professors and masters-level students in
A journal article listed the mean, median, and size for subgroups of the data, but did not report the overall mean or median. A SAS programmer wondered what, if any, inferences could be made about the overall mean and median for the data. The answer is that you can calculate
The metalog family of distributions (Keelin, Decision Analysis, 2016) is a flexible family that can model a wide range of continuous univariate data distributions when the data-generating mechanism is unknown. This article provides an overview of the metalog distributions. A subsequent article shows how to download and use a library
The moments of a continuous probability distribution are often used to describe the shape of the probability density function (PDF). The first four moments (if they exist) are well known because they correspond to familiar descriptive statistics: The first raw moment is the mean of a distribution. For a random
Many modern statistical techniques incorporate randomness: simulation, bootstrapping, random forests, and so forth. To use the technique, you need to specify a seed value, which determines pseudorandom numbers that are used in the algorithm. Consequently, the seed value also determines the results of the algorithm. In theory, if you know
Here's a fun problem to think about: Suppose that you have two different valid ways to test a statistical hypothesis. For a given sample, will both tests reject or fail to reject the hypothesis? Or might one test reject it whereas the other does not? The answer is that two
Recall that the binomial distribution is the distribution of the number of successes in a set of independent Bernoulli trials, each having the same probability of success. Most introductory statistics textbooks discuss the approximation of the binomial distribution by the normal distribution. The graph to the right shows that the
A reader asked whether it is possible to find a bootstrap sample that has some desirable properties. I am using the term "bootstrap sample" to refer to the result of randomly resampling with replacement from a data set. Specifically, he wanted to find a bootstrap sample that has a specific
People love rankings. You've probably seen articles about the best places to live, the best colleges to attend, the best pizza to order, and so on. Each of these is an example of a ranking that is based on multiple characteristics. For example, a list of the best places to
This article uses simulation to demonstrate the fact that any continuous distribution can be transformed into the uniform distribution on (0,1). The function that performs this transformation is a familiar one: it is the cumulative distribution function (CDF). A continuous CDF is defined as an integral, so the transformation is
The skewness of a distribution indicates whether a distribution is symmetric or not. The Wikipedia article about skewness discusses two common definitions for the sample skewness, including the definition used by SAS. In the middle of the article, you will discover the following sentence: In general, the [estimators] are both
A previous article about standardizing data in groups shows how to simulate data from two groups. One sample (with n1=20 observations) is simulated from an N(15, 5) distribution whereas a second (with n2=30 observations) is simulated from an N(16, 5) distribution. The sample means of the two groups are close
Testing people for coronavirus is a public health measure that reduces the spread of coronavirus. Dr. Anthony Fauci, a US infectious disease expert, recently mentioned the concept of "pool testing." The verb "to pool" means "to combine from different sources." In a USA Today article, Dr. Deborah Birx, the coordinator
The first time I saw a formula for the pooled variance, I was quite confused. It looked like Frankenstein's monster, assembled from bits and pieces of other quantities and brought to life by a madman. However, the pooled variance does not have to be a confusing monstrosity. The verb "to
Every day we face risks. If we drive to work, we risk a fatal auto accident. If we eat red meat and fatty foods, we risk a heart attack. If we go out in public during a pandemic, we risk contracting a disease. A logical response to risk is to
During an outbreak of a disease, such as the coronavirus (COVID-19) pandemic, the media shows daily graphs that convey the spread of the disease. The following two graphs appear frequently: New cases for each day (or week). This information is usually shown as a histogram or needle plot. The graph
Books about statistics and machine learning often discuss the tradeoff between bias and variance for an estimator. These discussions are often motivated by a sophisticated predictive model such as a regression or a decision tree. But the basic idea can be seen in much simpler situations. This article presents a
The ROC curve is a graphical method that summarizes how well a binary classifier can discriminate between two populations, often called the "negative" population (individuals who do not have a disease or characteristic) and the "positive" population (individuals who do have it). As shown in a previous article, there is
The purpose of this article is to show how to use SAS to create a graph that illustrates a basic idea in a binary classification analysis, such as discriminant analysis and logistic regression. The graph, shown at right, shows two populations. Subjects in the "negative" population do not have some
In a previous article, I showed how to perform collinearity diagnostics in SAS by using the COLLIN option in the MODEL statement in PROC REG. For models that contain an intercept term, I noted that there has been considerable debate about whether the data vectors should be mean-centered prior to
In my book Simulating Data with SAS, I show how to use a graphical tool, called the moment-ratio diagram, to characterize and compare continuous probability distributions based on their skewness and kurtosis (Wicklin, 2013, Chapter 16). The idea behind the moment-ratio diagram is that skewness and kurtosis are essential for