In SAS, the reserved keyword _NULL_ specifies a SAS data set that has no observations and no variables. When you specify _NULL_ as the name of an output data set, the output is not written. The _NULL_ data set is often used when you want to execute DATA step code
I recently recorded a short video about the new syntax for specifying and manipulating lists in SAS/IML 14.3. This is a video of my Super Demo at SAS Global Forum 2018. The new syntax supports dynamic arrays, associative arrays ("named lists"), and hierarchical data structures such as lists of lists.
The SURVEYSELECT procedure in SAS 9.4M5 supports the OUTRANDOM option, which causes the selected items in a simple random sample to be randomly permuted after they are selected. This article describes several statistical tasks that benefit from this option, including simulating card games, randomly permuting observations in a DATA step,
Years ago, I wrote an article about how to create a Top 10 table and bar chart. The program can be trivially modified to create a "Top N" table and plot, such as Top 5, Top 20, or even Top 100. Not long after the article was written, the developer
A previous article showed how to use a calibration plot to visualize the goodness-of-fit for a logistic regression model. It is common to overlay a scatter plot of the binary response on a predicted probability plot (below, left) and on a calibration plot (below, right): The SAS program that creates
The SAS language provides syntax that enables you to quickly specify a list of variables. SAS statements that accept variable lists include the KEEP and DROP statements, the ARRAY statement, and the OF operator for comma-separated arguments to some functions. You can also use variable lists on the VAR statements
This article shows how to construct a butterfly plot in SAS. A butterfly plot (also called a butterfly chart) is a comparative bar chart or histogram that displays the distribution of a variable for two subpopulations. A butterfly plot for the cholesterol readings of 5,057 patients in a medical study
In my article about how to construct calibration plots for logistic regression models in SAS, I mentioned that there are several popular variations of the calibration plot. The previous article showed how to construct a loess-based calibration curve. Austin and Steyerberg (2013) recommend the loess-based curve on the basis of
A logistic regression model is a way to predict the probability of a binary response based on values of explanatory variables. It is important to be able to assess the accuracy of a predictive model. This article shows how to construct a calibration plot in SAS. A calibration plot is
In a previous blog post, I discussed ways to produce statistically independent samples from a random number generator (RNG). The best way is to generate all samples from one stream. However, if your program uses two or more SAS DATA steps to simulate the data, you cannot use the same
Simulation studies require both randomness and reproducibility, two qualities that are sometimes at odds with each other. A Monte Carlo simulation might need to generate millions of random samples, where each sample contains dozens of continuous variables and many thousands of observations. In simulation studies, the researcher wants each sample
Order matters. When you create a graph that has a categorical axis (such as a bar chart), it is important to consider the order in which the categories appear. Most software defaults to alphabetical order, which typically gives no insight into how the categories relate to each other. Alphabetical order
Some say that opposites attract. Others say that birds of a feather flock together. Which is it? Phillip N. Cohen, a professor of sociology at the University of Maryland, recently posted an interesting visualization that indicates that married couples who are college graduates tend to be birds of a feather.
SAS programmers on SAS discussion forums sometimes ask how to run thousands of regressions of the form Y = B0 + B1*X_i, where i=1,2,.... A similar question asks how to solve thousands of regressions of the form Y_i = B0 + B1*X for thousands of response variables. I have previously
You've probably heard about the "80-20 Rule," which describes many natural and manmade phenomena. This rule is sometimes called the "Pareto Principle" because it was discovered by Vilfredo Pareto (1848–1923) who used it to describe the unequal distribution of wealth. Specifically, in his study, 80% of the wealth was held
The sweep operator performs elementary row operations on a system of linear equations. The sweep operator enables you to build regression models by "sweeping in" or "sweeping out" particular rows of the X`X matrix. As you do so, the estimates for the regression coefficients, the error sum of squares, and
A colleague and I recently discussed how to generate random permutations without encountering duplicates. Given a set of n items, there are n! permutations My colleague wants to generate k unique permutations at random from among the total of n!. Said differently, he wants to sample without replacement from the
Sometimes it is important to ensure that a matrix has unique rows. When the data are all numeric, there is an easy way to detect (and delete!) duplicate rows in a matrix. The main idea is to subtract one row from another. Start with the first row and subtract it
When we breathe, we breathe in and breathe out. If we choose only one or the other, the results are disastrous. The same principle applies to professional growth and development. Whether we are programmers, statisticians, teachers, students, or writers, we benefit from taking in and giving back. We "take in"
Correlation is a statistic that measures how closely two variables are related to each other. The most popular definition of correlation is the Pearson product-moment correlation, which is a measurement of the linear relationship between two variables. Many textbooks stress the linear nature of the Pearson correlation and emphasize that
As a general rule, when SAS programmers want to manipulate data row by row, they reach for the SAS DATA step. When the computation requires column statistics, the SQL procedure is also useful. When both row and column operations are required, the SAS/IML language is a powerful addition to a
Suppose you want to find observations in multivariate data that are closest to a numerical target value. For example, for the students in the Sashelp.Class data set, you might want to find the students whose (Age, Height, Weight) values are closest to the triplet (13, 62, 100). The way to
Simulation studies are used for many purposes, one of which is to examine how distributional assumptions affect the coverage probability of a confidence interval. This article describes the "zipper plot," which enables you to compare the coverage probability of a confidence interval when the data do or do not follow
I often claim that the "natural syntax" of the SAS/IML language makes it easy to implement an algorithm or statistical formula as it appears in a textbook or journal. The other day I had an opportunity to test the truth of that statement. A SAS programmer wanted to implement the
About once a month I see a question on the SAS Support Communities that involves what I like to call "computations with combinations." A typical question asks how to find k values (from a set of p values) that maximize or minimize some function, such as "I have 5 variables,
One of my favorite magazines, Significance, printed an intriguing image of a symmetric matrix that shows repetition in a song's lyrics. The image was created by Colin Morris, who has created many similar images. When I saw these images, I knew that I wanted to duplicate the analysis in SAS!
Welcome to my annual Pi Day post. Every year on March 14th (written 3/14 in the US), geeky mathematicians and their friends celebrate "all things pi-related" because 3.14 is the three-decimal approximation to pi. Pi is a mathematical constant that never changes. Pi is the same value today as it
Data analysts often fit a probability distribution to data. When you have access to the data, a common technique is to use maximum likelihood estimation (MLE) to compute the parameters of a distribution that are "most likely" to have produced the observed data. However, how can you fit a distribution
Many people know that a surface can contain a saddle point, but did you know that you can define the saddle point of a matrix? Saddle points in matrices are somewhat rare, which means that if you choose a random matrix you are unlikely to choose one that has a