A colleague recently posted an article about how to use SAS Visual Analytics to create a circular graph that displays a year's worth of temperature data. Specifically, the graph shows the air temperature for each day in a year relative to some baseline temperature, such as 65F (18C). Days warmer
The ROC curve is a graphical method that summarizes how well a binary classifier can discriminate between two populations, often called the "negative" population (individuals who do not have a disease or characteristic) and the "positive" population (individuals who do have it). As shown in a previous article, there is
The purpose of this article is to show how to use SAS to create a graph that illustrates a basic idea in a binary classification analysis, such as discriminant analysis and logistic regression. The graph, shown at right, shows two populations. Subjects in the "negative" population do not have some
Are you a statistical programmer whose company has adopted SAS Viya? If so, you probably know that the DATA step can run in parallel in SAS Cloud Analytic Services (CAS). As Sekosky (2017) says, "running in a single thread in SAS is different from running in many threads in CAS."
A previous article shows how to interpret the collinearity diagnostics that are produced by PROC REG in SAS. The process involves scanning down numbers in a table in order to find extreme values. This can be a tedious and error-prone process. Friendly and Kwan (2009) compare this task to a
The Johnson system (Johnson, 1949) contains a family of four distributions: the normal distribution, the lognormal distribution, the SB distribution, and the SU distribution. Previous articles explain why the Johnson system is useful and show how to use PROC UNIVARIATE in SAS to estimate parameters for the Johnson SB distribution
You can represent every number as a nearby integer plus a decimal. For example, 1.3 = 1 + 0.3. The integer is called the integer part of x, whereas the decimal is called the fractional part of x (or sometimes the decimal part of x). This representation is not unique.
A SAS programmer wanted to create a graph that illustrates how Deming regression differs from ordinary least squares regression. The main idea is shown in the panel of graphs below. The first graph shows the geometry of least squares regression when we regress Y onto X. ("Regress Y onto X"
Recently someone on social media asked, "how can I compute the required sample size for a binomial test?" I assume from the question that the researcher was designing an experiment to test the proportions between two groups, such as a control group and a treatment/intervention group. They wanted to know
In a previous article, I showed how to perform collinearity diagnostics in SAS by using the COLLIN option in the MODEL statement in PROC REG. For models that contain an intercept term, I noted that there has been considerable debate about whether the data vectors should be mean-centered prior to
The Johnson system (Johnson, 1949) contains a family of four distributions: the normal distribution, the lognormal distribution, the SB distribution (which models bounded distributions), and the SU distribution (which models unbounded distributions). Note that 'B' stands for 'bounded' and 'U' stands for 'unbounded.' A previous article explains the purpose of
I was recently asked about how to interpret the output from the COLLIN (or COLLINOINT) option on the MODEL statement in PROC REG in SAS. The example in the documentation for PROC REG is correct but is somewhat terse regarding how to use the output to diagnose collinearity and how
From the early days of probability and statistics, researchers have tried to organize and categorize parametric probability distributions. For example, Pearson (1895, 1901, and 1916) developed a system of seven distributions, which was later called the Pearson system. The main idea behind a "system" of distributions is that for each
In my book Simulating Data with SAS, I show how to use a graphical tool, called the moment-ratio diagram, to characterize and compare continuous probability distributions based on their skewness and kurtosis (Wicklin, 2013, Chapter 16). The idea behind the moment-ratio diagram is that skewness and kurtosis are essential for
Did you add "learn something new" to your list of New Year's resolutions? Last week, I wrote about the most popular articles from The DO Loop in 2019. The most popular articles are about elementary topics in SAS programming or univariate statistics because those topics have broad appeal. Advanced topics
Many SAS procedures can automatically create a graph that overlays multiple prediction curves and their prediction limits. This graph (sometimes called a "fit plot" or a "sliced fit plot") is useful when you want to visualize a model in which a continuous response variable depends on one continuous explanatory variable
Last year, I wrote more than 100 posts for The DO Loop blog. The most popular articles were about SAS programming tips for data analysis, statistical analysis, and data visualization. Here are the most popular articles from 2019 in each category. SAS programming tips Create training, testing, and validation data
A 2-D "bin plot" counts the number of observations in each cell in a regular 2-D grid. The 2-D bin plot is essentially a 2-D version of a histogram: it provides an estimate for the density of a 2-D distribution. As I discuss in the article, "The essential guide to
Rockin' around the Christmas tree At the Christmas party hop. – Brenda Lee Last Christmas, I saw a fun blog post that used optimization methods to de-noise an image of a Christmas tree. Although there are specialized algorithms that remove random noise from an image, I am not going to
Binary matrices are used for many purposes. I have previously written about how to use binary matrices to visualize missing values in a data matrix. They are also used to indicate the co-occurrence of two events. In ecology, binary matrices are used to indicate which species of an animal are
Recently I showed how to visualize and analyze longitudinal data in which subjects are measured at multiple time points. A very common situation is that the data are collected at two time points. For example, in medicine it is very common to measure some quantity (blood pressure, cholesterol, white-blood cell
This is a second article about analyzing longitudinal data, which features measurements that are repeatedly taken on subjects at several points in time. The previous article discusses a response-profile analysis, which uses an ANOVA method to determine differences between the means of an experimental group and a placebo group. The
Longitudinal data are used in many health-related studies in which individuals are measured at multiple points in time to monitor changes in a response variable, such as weight, cholesterol, or blood pressure. There are many excellent articles and books that describe the advantages of a mixed model for analyzing longitudinal
This article discusses how to restrict a multivariate function to a linear subspace. This is a useful technique in many situations, including visualizing an objective function that is constrained by linear equalities. For example, the graph to the right is from a previous article about how to evaluate quadratic polynomials.
What is an efficient way to evaluate a multivariate quadratic polynomial in p variables? The answer is to use matrix computations! A multivariate quadratic polynomial can be written as the sum of a purely quadratic term (degree 2), a purely linear term (degree 1), and a constant term (degree 0).
In a linear regression model, the predicted values are on the same scale as the response variable. You can plot the observed and predicted responses to visualize how well the model agrees with the data, However, for generalized linear models, there is a potential source of confusion. Recall that a
My colleague, Mike Drutar, recently showed how to create a "strip plot" that shows the distribution of temperatures for each calendar month at a particular location. Mike created the strip plot in SAS Visual Analytics by using a point-and-click interface. This article shows how to create a similar graph by
Biplots are two-dimensional plots that help to visualize relationships in high dimensional data. A previous article discusses how to interpret biplots for continuous variables. The biplot projects observations and variables onto the span of the first two principal components. The observations are plotted as markers; the variables are plotted as
In grade school, students learn how to round numbers to the nearest integer. In later years, students learn variations, such as rounding up and rounding down by using the greatest integer function and least integer function, respectively. My sister, who is an engineer, learned a rounding method that rounds half-integers
Principal component analysis (PCA) is an important tool for understanding relationships in continuous multivariate data. When the first two principal components (PCs) explain a significant portion of the variance in the data, you can visualize the data by projecting the observations onto the span of the first two PCs. In