A previous article shows an example of a Markov chain model and computes the probability that the system ends up in a terminal state (called an absorbing state). As explained previously, you can often compute exact probabilities for questions about Markov chains. Nevertheless, it can be useful to know how
Author
A previous article shows how to model the probabilities in a discrete-time Markov chain by using a Markov transition matrix. A Markov chain is a discrete-time stochastic process for which the current state of the system determines the probability of the next state. In this process, the probabilities for transitioning
Given a set of N points in k-dimensional space, can you find the location that minimizes the sum of the distances to the points? The location that minimizes the distances is called the geometric median of the points. For univariate data, the "points" are merely a set of numbers $${p_1,
While writing an article about labeling a polygon by using the centroid, I almost made a false claim about the centroid. I almost claimed that that the centroid is the point in a polygon that minimizes the sum of the distances to the vertices. It is not. The point that
A colleague asked how to compute the barycentric coordinates of a point inside a triangle. Given a triangle in the plane with vertices p1, p2, and p3, every point in the triangle can be represented as a convex combination of the vertices: c1*p1 + c2*p2 + c3*p3, where c1,c2,c3 ≥
Part of the power of the SAS ODS system is the ability to visualize data by using ODS templates. An ODS template describes how to render data as a table or as a graph. A lot of papers and documentation have been written about how to define a custom template
While writing an article about Toeplitz matrices, I saw an interesting fact about the eigenvalues of tridiagonal Toeplitz matrices on Nick Higham's blog. Recall that a Toeplitz matrix is a banded matrix that is constant along each diagonal. A tridiagonal Toeplitz matrix is zero except for the main diagonal, the
A Toeplitz matrix is a banded matrix. You can construct it by specifying the parameters that are constant along each diagonal, including sub- and super-diagonals. For a square N x N matrix, there is one main diagonal, N-1 sub-diagonals, and N-1 super-diagonals, for a total of 2N-1 parameters. In statistics and applied
A previous article explains the Spearman rank correlation, which is a robust cousin to the more familiar Pearson correlation. I've also discussed why you might want to use rank correlation, and how to interpret the strength of a rank correlation. This article gives a short example that helps you to
Since the COVID-19 pandemic began, video presentations and webcasts have become a regular routine for many of us. On days that I will be using my webcam, I wear a solid-color shirt. If I don't plan to be on camera, I can wear a pinstripe Oxford shirt. Why the difference?
Real-world data often exhibits extreme skewness. It is not unusual to have data span many orders of magnitude. Classic examples are the distributions of incomes (impoverished and billionaires) and population sizes (small countries and populous nations). The readership of books and blog posts show a similar distribution, which is sometimes
Labeling objects in graphs can be difficult. SAS has a long history of providing support for labeling markers in scatter plots and for labeling regions on a map. This article discusses how the SGPLOT procedure decides where to put a label for a polygon. It discusses the advantages and disadvantages
SAS supports many ways to compute the rank of a numeric variable and to handle tied values. However, sometimes I need to rank the values in a character categorical variable. For example, the values {"Male", "Female", "Male"} have ranks {2, 1, 2} because, in alphabetical order, "Female" is the first-ranked
A previous article defines the silhouette statistic (Rousseeuw, 1987) and shows how to use it to identify observations in a cluster analysis that are potentially misclassified. The article provides many graphs, including the silhouette plot, which is a bar chart or histogram that displays the distribution of the silhouette statistic
Assigning observations into clusters can be challenging. One challenge is deciding how many clusters are in the data. Another is identifying which observations are potentially misclassified because they are on the boundary between two different clusters. Ralph Abbey's 2019 paper ("How to Evaluate Different Clustering Results") is a good way
A lot of programmers have been impressed by the ability of ChatGPT, GPT-4, and Bing Chat to write computer programs. Recently, I wrote an article that discusses an elementary programming assignment, called FizzBuzz, which is sometimes used as part of a hiring process to assess a candidate's basic knowledge of
Recently, I learned about an elementary programming assignment called the FizzBuzz program. Some companies use this assignment for the first round of interviews with potential programmers. A competent programmer can write FizzBuzz in 5-10 minutes, which leaves plenty of time to discuss other topics. If an applicant can't complete the
In SAS, you can approximate the exponential of a matrix by using the EXPMATRIX function in SAS IML software. This article discusses the exponential of a matrix: what it is, how to compute it, why it is useful, and why you should think of it as a linear map that
In a previous article, I showed how to overlay a density estimate on a histogram by using the Graph Template Language (GTL). However, a SAS programmer asked how to overlay a curve on a histogram when the curve is not a density estimate. In this case, the vertical axis for
When the SAS statistical graphics (SG) procedures were designed in the early 2000s, a goal was to create a comprehensive Graph Template Language (GTL) and leverage the GTL by using SG procedures that perform common tasks easily without having to write any GTL. This project was hugely successful, and "ODS
A previous article discusses how to compute the union, intersection, and other subsets of a pair of sets. In that article, I displayed a simple Venn diagram (reproduced to the right) that illustrates the intersection and difference between two sets. The diagram uses a red disk for one set, a
The fundamental operations on sets are union, intersection, and set difference, all of which are supported directly in the SAS IML language. While studying another programming language, I noticed that the language supports an additional operation, namely the symmetric difference between two sets. The language also supports query functions to
The "Teacher’s Corner" of The American Statistician enables statisticians to discuss topics that are relevant to teaching and learning statistics. Sometimes, the articles have practical relevance, too. Andersson (2023) "The Wald Confidence Interval for a Binomial p as an Illuminating 'Bad' Example," is intended for professors and masters-level students in
A journal article listed the mean, median, and size for subgroups of the data, but did not report the overall mean or median. A SAS programmer wondered what, if any, inferences could be made about the overall mean and median for the data. The answer is that you can calculate
A SAS user asked how to interpret a rank-based correlation such as a Spearman correlation or a Kendall correlation. These are alternative measures to the usual Pearson product-moment correlation, which is widely used. The programmer knew that words like "weak," "moderate," and "strong" are sometimes used to describe the Pearson
A previous article discusses rank correlation and lists some advantages of using rank correlation. However, the article does not show examples where an analyst might prefer to report the rank correlation instead of the traditional Pearson product-moment correlation. This article provides three examples where the rank correlation is a better
I recently discussed introductory programming with a colleague who teaches Python at a university. He told me about the following introductory programming assignment: Let N be an integer parameter in the range [1, 9]. For each value of N, find all pairs of one-digit positive integers d1 and d2 that
A previous article discusses the issue of a confounding variable and uses correlation to give an example. The example shows that the correlation between two variables might be affected by a third variable, which is called a confounding variable. The article mentions that you can use the PARTIAL statement in
A data analyst wanted to estimate the correlation between two variables, but he was concerned about the influence of a confounding variable that is correlated with them. The correlation might affect the apparent relationship between main two variables in the study. A common confounding variable is age because young people
In a previous article about Markov transition matrices, I mentioned that you can estimate a Markov transition matrix by using historical data that are collected over a certain length of time. A SAS programmer asked how you can estimate a transition matrix in SAS. The answer is that you can