SAS supports many ways to compute the rank of a numeric variable and to handle tied values. However, sometimes I need to rank the values in a character categorical variable. For example, the values {"Male", "Female", "Male"} have ranks {2, 1, 2} because, in alphabetical order, "Female" is the first-ranked

# Author

A previous article defines the silhouette statistic (Rousseeuw, 1987) and shows how to use it to identify observations in a cluster analysis that are potentially misclassified. The article provides many graphs, including the silhouette plot, which is a bar chart or histogram that displays the distribution of the silhouette statistic

Assigning observations into clusters can be challenging. One challenge is deciding how many clusters are in the data. Another is identifying which observations are potentially misclassified because they are on the boundary between two different clusters. Ralph Abbey's 2019 paper ("How to Evaluate Different Clustering Results") is a good way

A lot of programmers have been impressed by the ability of ChatGPT, GPT-4, and Bing Chat to write computer programs. Recently, I wrote an article that discusses an elementary programming assignment, called FizzBuzz, which is sometimes used as part of a hiring process to assess a candidate's basic knowledge of

Recently, I learned about an elementary programming assignment called the FizzBuzz program. Some companies use this assignment for the first round of interviews with potential programmers. A competent programmer can write FizzBuzz in 5-10 minutes, which leaves plenty of time to discuss other topics. If an applicant can't complete the

In SAS, you can approximate the exponential of a matrix by using the EXPMATRIX function in SAS IML software. This article discusses the exponential of a matrix: what it is, how to compute it, why it is useful, and why you should think of it as a linear map that

In a previous article, I showed how to overlay a density estimate on a histogram by using the Graph Template Language (GTL). However, a SAS programmer asked how to overlay a curve on a histogram when the curve is not a density estimate. In this case, the vertical axis for

When the SAS statistical graphics (SG) procedures were designed in the early 2000s, a goal was to create a comprehensive Graph Template Language (GTL) and leverage the GTL by using SG procedures that perform common tasks easily without having to write any GTL. This project was hugely successful, and "ODS

A previous article discusses how to compute the union, intersection, and other subsets of a pair of sets. In that article, I displayed a simple Venn diagram (reproduced to the right) that illustrates the intersection and difference between two sets. The diagram uses a red disk for one set, a

The fundamental operations on sets are union, intersection, and set difference, all of which are supported directly in the SAS IML language. While studying another programming language, I noticed that the language supports an additional operation, namely the symmetric difference between two sets. The language also supports query functions to

The "Teacherâ€™s Corner" of The American Statistician enables statisticians to discuss topics that are relevant to teaching and learning statistics. Sometimes, the articles have practical relevance, too. Andersson (2023) "The Wald Confidence Interval for a Binomial p as an Illuminating 'Bad' Example," is intended for professors and masters-level students in

A journal article listed the mean, median, and size for subgroups of the data, but did not report the overall mean or median. A SAS programmer wondered what, if any, inferences could be made about the overall mean and median for the data. The answer is that you can calculate

A SAS user asked how to interpret a rank-based correlation such as a Spearman correlation or a Kendall correlation. These are alternative measures to the usual Pearson product-moment correlation, which is widely used. The programmer knew that words like "weak," "moderate," and "strong" are sometimes used to describe the Pearson

A previous article discusses rank correlation and lists some advantages of using rank correlation. However, the article does not show examples where an analyst might prefer to report the rank correlation instead of the traditional Pearson product-moment correlation. This article provides three examples where the rank correlation is a better

I recently discussed introductory programming with a colleague who teaches Python at a university. He told me about the following introductory programming assignment: Let N be an integer parameter in the range [1, 9]. For each value of N, find all pairs of one-digit positive integers d1 and d2 that

A previous article discusses the issue of a confounding variable and uses correlation to give an example. The example shows that the correlation between two variables might be affected by a third variable, which is called a confounding variable. The article mentions that you can use the PARTIAL statement in

A data analyst wanted to estimate the correlation between two variables, but he was concerned about the influence of a confounding variable that is correlated with them. The correlation might affect the apparent relationship between main two variables in the study. A common confounding variable is age because young people

In a previous article about Markov transition matrices, I mentioned that you can estimate a Markov transition matrix by using historical data that are collected over a certain length of time. A SAS programmer asked how you can estimate a transition matrix in SAS. The answer is that you can

Most homeowners know that large home improvement projects can take longer than you expect. Whether it's remodeling a kitchen, adding a deck, or landscaping a yard, big projects are expensive and subject to a lot of uncertainty. Factors such as weather, the availability of labor, and the supply of materials,

A previous article describes the metalog distribution (Keelin, 2016). The metalog distribution is a flexible family of distributions that can model a wide range of shapes for data distributions. The metalog system can model bounded, semibounded, and unbounded continuous distributions. This article shows how to use the metalog distribution in

Happy Pi Day! Every year on March 14th (written 3/14 in the US), people in the mathematical sciences celebrate "all things pi-related" because 3.14 is the three-decimal approximation to π ≈ 3.14159265358979.... Modern computer methods and algorithms enable us to calculate 100 trillion digits of π. However, I think it

Undergraduate textbooks on probability and statistics typically prove theorems that show how the variance of a sum of random variables is related to the variance of the original variables and the covariance between them. For example, the Wikipedia article on Variance contains an equation for the sum of two random

A SAS programmer wanted to compute the distribution of X-Y, where X and Y are two beta-distributed random variables. Pham-Gia and Turkkan (1993) derive a formula for the PDF of this distribution. Unfortunately, the PDF involves evaluating a two-dimensional generalized hypergeometric function, which is not available in all programming languages.

In applied mathematics, there is a large collection of "special functions." These function appera in certain applications, often as the solution to a differential equation, but also in the definition of probability distributions. For example, I have written about Bessel functions, the complete and incomplete gamma function, and the complete

The metalog family of distributions (Keelin, Decision Analysis, 2016) is a flexible family that can model a wide range of continuous univariate data distributions when the data-generating mechanism is unknown. This article provides an overview of the metalog distributions. A subsequent article shows how to download and use a library

A SAS programmer asked for help to simulate data from a distribution that has certain properties. The distribution must be supported on the interval [a, b] and have a specified mean, μ, where a < μ < b. It turns out that there are infinitely many distributions that satisfy these

You can use a Markov transition matrix to model the transition of an entity between a set of discrete states. A transition matrix is also called a stochastic matrix. A previous article describes how to use transition matrices for stochastic modeling. You can estimate a Markov transition matrix by using

SAS programmers love to make special graphs for Valentine's Day. In fact, there is a long history of heart-shaped graphs and love-inspired programs written in SAS! Last year, I added to the collection by showing how a ball bounces on a heart-shaped billiards table. This year, I create a similar

This article is about how to use Git to share SAS programs, specifically how to share libraries of SAS IML functions. Some IML programmers might remember an earlier way to share libraries of functions: SAS/IML released "packages" in SAS 9.4m3 (2015), which enable you to create, document, share, and use

SAS supports the ColorBrewer system of color palettes from the ColorBrewer website (Brewer and Harrower, 2002). The ColorBrewer color ramps are available in SAS by using the PALETTE function in SAS IML software. The PALETTE function supports all ColorBrewer palettes, but some palettes are not interpretable by people with color