Welcome to my annual Pi Day post. Every year on March 14th (written 3/14 in the US), geeky mathematicians and their friends celebrate "all things pi-related" because 3.14 is the three-decimal approximation to pi. Pi is a mathematical constant that never changes. Pi is the same value today as it
Author
Data analysts often fit a probability distribution to data. When you have access to the data, a common technique is to use maximum likelihood estimation (MLE) to compute the parameters of a distribution that are "most likely" to have produced the observed data. However, how can you fit a distribution
Many people know that a surface can contain a saddle point, but did you know that you can define the saddle point of a matrix? Saddle points in matrices are somewhat rare, which means that if you choose a random matrix you are unlikely to choose one that has a
This article shows how to use SAS to solve a system of nonlinear equations. When there are n unknowns and n equations, this problem is equivalent to finding a multivariate root of a vector-valued function F(x) = 0 because you can always write the system as f1(x1, x2, ..., xn)
My article about the difference between CLASS variables and BY variables in SAS focused on SAS analytical procedures. However, the BY statement is also useful in the SAS DATA step where it is used to merge data sets and to analyze data at the group level. When you use the
This article describes and implements a fast algorithm that estimates a median for very large samples. The traditional median estimate sorts a sample of size N and returns the middle value (when N is odd). The algorithm in this article uses Monte Carlo techniques to estimate the median much faster.
Your statistical software probably provides a function that computes quantiles of common probability distributions such as the normal, exponential, and beta distributions. Because there are infinitely many probability distributions, you might encounter a distribution for which a built-in quantile function is not implemented. No problem! This article shows how to
When I first learned to program in SAS, I remember being confused about the difference between CLASS statements and BY statements. A novice SAS programmer recently asked when to use one instead of the other, so this article explains the difference between the CLASS statement and BY variables in SAS
Did you know that SAS can combine or "merge" a symbol and a line pattern into a single legend item, as shown below? This kind of legend is useful when you are overlaying a group of curves onto a scatter plot. It enables the reader to quickly associate values of
If N random people are in a room, the classical birthday problem provides the probability that at least two people share a birthday. The birthday problem does not consider how many birthdays are in common. However, a generalization (sometimes called the Multiple-Birthday Problem) examines the distribution of the number of
This article simulates the birthday-matching problem in SAS. The birthday-matching problem (also called the birthday problem or birthday paradox) answers the following question: "if there are N people in a room, what is the probability that at least two people share a birthday?" The birthday problem is famous because the
This article shows how to construct a "stacked band plot" in SAS, as shown to the right. (Click to enlarge.) You are probably familiar with a stacked bar chart in which the cumulative amount of some quantity is displayed by stacking the contributions of several groups. A canonical example is
What is a random number generator? What are the random-number generators in SAS, and how can you use them to generate random numbers from probability distributions? In SAS 9.4M5, you can use the STREAMINIT function to select from eight random-number generators (RNGs), including five new RNGs. After choosing an RNG,
A popular way to use lists in the SAS/IML language is to pack together several related matrices into a single data structure that can be passed to a function. Imagine that you have written an algorithm that requires a dozen different parameters. Historically, you would have to pass those parameters
SAS/IML 14.3 (SAS 9.4M5) introduced a new syntax for creating lists and for assigning and extracting item in a list. Lists (introduced in SAS/IML 14.2) are data structures that are convenient for holding heterogeneous data. A single list can hold character matrices, numeric matrices, scalar values, and other lists, as
Money magazine (Jan/Feb 2018) contains an article about how much it costs to give birth in the US. The costs, which are based on insurance data, include prenatal care and hospital delivery but exclude infant care. The data are compiled for each state (including Washington, DC) and by type of
Last week I got the following message: Dear Rick: How can I create a normal distribution within a specified range (min and max)? I need to simulate a normal distribution that fits within a specified range. I realize that a normal distribution is by definition infinite... Are there any alternatives,
Last week I wrote about the 10 most popular articles from The DO Loop in 2017. My most popular articles tend to be about elementary statistics or SAS programming tips. Less popular are the articles about advanced statistical and programming techniques. However, these technical articles fill an important niche. Not
A SAS programmer asked how to label multiple regression lines that are overlaid on a single scatter plot. Specifically, he asked to label the curves that are produced by using the REG statement with the GROUP= option in PROC SGPLOT. He wanted the labels to be the slope and intercept
I wrote more than 100 posts for The DO Loop blog in 2017. The most popular articles were about SAS programming tips, statistical data analysis, and simulation and bootstrap methods. Here are the most popular articles from 2017 in each category. General SAS programming techniques INTCK and INTNX: Do you
I previously showed an easy way to visualize a regression model that has several continuous explanatory variables: use the SLICEFIT option in the EFFECTPLOT statement in SAS to create a sliced fit plot. The EFFECTPLOT statement is directly supported by the syntax of the GENMOD, LOGISTIC, and ORTHOREG procedures in
Slice, slice, baby! You've got to slice, slice, baby! When you fit a regression model that has multiple explanatory variables, it is a challenge to effectively visualize the predicted values. This article describes how to visualize the regression model by slicing the explanatory variables. In SAS, you can use the
The SAS language is large. Even after 20+ years of using SAS, there are many features that I have never used. Recently it became necessary for me to learn about DICTIONARY tables in PROC SQL (and the associated SASHELP views) because I needed to programmatically obtain the text for the
Happy holidays to all my readers! My greeting-card to you is an image of a self-similar Christmas tree. The image (click to enlarge) was created in SAS by using two features that I blog about regularly: matrix computations and ODS statistical graphics. Self-similarity in Kronecker products I have previously shown
In a previous article, I showed how to use SAS to perform mean imputation. However, there are three problems with using mean-imputed variables in statistical analyses: Mean imputation reduces the variance of the imputed variables. Mean imputation shrinks standard errors, which invalidates most hypothesis tests and the calculation of confidence
Imputing missing data is the act of replacing missing data by nonmissing values. Mean imputation replaces missing data in a numerical variable by the mean value of the nonmissing values. This article shows how to perform mean imputation in SAS. It also presents three statistical drawbacks of mean imputation. How
Missing values present challenges for the statistical analyst and data scientist. Many modeling techniques (such as regression) exclude observations that contain missing values, which can reduce the sample size and reduce the power of a statistical analysis. Before you try to deal with missing values in an analysis (for example,
When you run an optimization, it is often not clear how to provide the optimization algorithm with an initial guess for the parameters. A good guess converges quickly to the optimal solution whereas a bad guess might diverge or require many iterations to converge. Many people use a default value
A statistical programmer read my article about the beta-binomial distribution and wanted to know how to compute the cumulative distribution (CDF) and the quantile function for this distribution. In general, if you know the PDF for a discrete distribution, you can also compute the CDF and quantile functions. This article
This article shows how to simulate beta-binomial data in SAS and how to compute the density function (PDF). The beta-binomial distribution is a discrete compound distribution. The "binomial" part of the name means that the discrete random variable X follows a binomial distribution with parameters N (number of trials) and