On Kaiser Fung's Junk Charts blog, he showed a bar chart that was "published by Teach for America, touting its diversity." Kaiser objected to the chart because the bar lengths did not accurately depict the proportions of the Teach for America corps members. The chart bothers me for another reason:

## Tag: **Data Analysis**

In my last blog post I described how to implement a "runs test" in the SAS/IML language. The runs test determines whether a sequence of two values (for example, heads and tails) is likely to have been generated by random chance. This article describes two applications of the runs test.

While walking in the woods, a statistician named Goldilocks wanders into a cottage and discovers three bears. The bears, being hungry, threaten to eat the young lady, but Goldilocks begs them to give her a chance to win her freedom. The bears agree. While Mama Bear and Papa Bear block

"It's a floor wax, and a dessert topping" - this pretty much describes SAS/Graph! (bonus points if you know where this quote came from!) Some people think of SAS as just a quality control tool. Others think of it as just a sales & marketing tool. And yet others think

SAS has supported calling R from the SAS/IML language since 2009. The interface to R is part of the SAS/IML language. However, there have been so many versions of SAS and R since 2009, that it is hard to remember which SAS release supports which versions of R. The following

A common visualization is to compare characteristics of two groups. This article emphasizes two tips that will help make the comparison clear. First, consider graphing the differences between the groups. Second, in any plot that has a categorical axis, sort the categories by a meaningful quantity. This article is motivated

Earlier this week I posted a "guest blog" in which my 8th grade son described a visualization of data for the 2013 ASA Poster Competition. The purpose of today's blog post is to present a higher-level statistical analysis of the same data. I will use a t test and a

Editor's Note: My 8th grade son, David, created a poster that he submitted to the 2013 ASA Poster Competition. The competition encourages students to display "two or more related graphics that summarize a set of data, look at the data from different points of view, and answer specific questions about

Do you have dozens (or even hundreds) of SAS data sets that you want to read into SAS/IML matrices? In a previous blog post, I showed how to iterate over a series of data sets and analyze each one. Inside the loop, I read each data set into a matrix

In a previous article I discussed how to bin univariate observations by using the BIN function, which was added to the SAS/IML language in SAS/IML 9.3. You can generalize that example and bin bivariate or multivariate data. Over two years ago I wrote a blog post on 2D binning in

It is often useful to partition observations for a continuous variable into a small number of intervals, called bins. This familiar process occurs every time that you create a histogram, such as the one on the left. In SAS you can create this histogram by calling the UNIVARIATE procedure. Optionally,

“How can we begin to make sense of the unstructured data, when we still don’t make the most of our structured data?” said the exasperated senior manager from a large retail firm. One of the great pleasures of my job is the relationship with students that continues after class has

The CLUSTER procedure in SAS/STAT software creates a dendrogram automatically. The black-and-white dendrogram is nice, but plain. A SAS customer wanted to know whether it is possible to add color to the dendrogram to emphasize certain clusters. For example, the plot at the left emphasizes a four-cluster scenario for clustering

A regular reader noticed my post on initializing vectors by using repetition factors and asked whether that technique would be useful to expand data that are given in value-frequency pairs. The short answer is "no." Repetition factors are useful for defining (static) matrix literals. However, if you want to expand

In a previous blog post, I described how to use a spread plot to compare the distributions of several variables. Each spread plot is a graph of centered data values plotted against the estimated cumulative probability. Thus, spread plots are similar to a (rotated) plot of the empirical cumulative distribution

Suppose that you have several data distributions that you want to compare. Questions you might ask include "Which variable has the largest spread?" and "Which variables exhibit skewness?" More generally, you might be interested in visualizing how the distribution of one variable differs from the distribution of other variables. The

Has anyone noticed that the REG procedure in SAS/STAT 12.1 produces heat maps instead of scatter plots for fit plots and residual plots when the regression involves more than 5,000 observations? I wasn't aware of the change until a colleague informed me, although the change is discussed in the "Details"

I've conducted a lot of univariate analyses in SAS, yet I'm always surprised when the best way to carry out the analysis uses a SAS regression procedure. I always think, "This is a univariate analysis! Why am I using a regression procedure? Doesn't a regression require at least two variables?"

At a recent conference, I talked with a SAS customer who told me that he was using an R package to create a three-panel visualization of a distribution. Unfortunately, he couldn't remember the name of the package, and he has not returned my e-mails, so the purpose of today's article

PROC UNIVARIATE has provided confidence intervals for standard percentiles (quartiles) for eons. However, in SAS 9.3M2 (featuring the 12.1 analytical procedures) you can use a new feature in PROC UNIVARIATE to compute confidence intervals for a specified list of percentiles. To be clear, percentiles and quantiles are essentially the same

I often see variations of the following question posted on statistical discussion forums: I want to bin the X variable into a small number of values. For each bin, I want to draw the quartiles of the Y variable for that bin. Then I want to connect the corresponding quartile

I was recently asked how to compute the difference between two density estimates in SAS. The person who asked the question sent me a link to a paper from The Review of Economics and Statistics that contains several examples of this technique (for example, see Figure 3 on p. 16

In statistics, distances between observations are used to form clusters, to identify outliers, and to estimate distributions. Distances are used in spatial statistics and in other application areas. There are many ways to define the distance between observations. I have previously written an article that explains Mahalanobis distance, which is

A big part of "winning" these days (be it sports or a business) is performing analytics better than your competition. This is demonstrated in awe-inspiring fashion in the book (and movie) "Moneyball." And on that topic, I'd like to show you a few ways SAS can be used to analyze sports data

Someone recently asked a question on the SAS Support Communities about estimating parameters in ridge regression. I answered the question by pointing to a matrix formula in the SAS documentation. One of the advantages of the SAS/IML language is that you can implement matrix formulas in a natural way. The

Argh! I've just spilled coffee on output that shows the least squares coefficients for a regression model that I was investigating. Now the parameter estimate for the intercept is completely obscured, although I can still see the parameter estimates for the coefficients of the continuous explanatory variable. What can I

There is something for everyone at SAS Global Forum 2013. I like to attend presentations in the Statistics and Data Analysis track and talk with SAS customers in the SAS Support and Demo Area. But one activity that I enjoy the most is to stroll through the poster area and

A SAS user asked an interesting question on the SAS/GRAPH and ODS Graphics Support Forum. The question is: Does PROC SGPLOT support a way to display the slope of the regression line that is computed by the REG statement? Recall that the REG statement in PROC SGPLOT fits and displays

Last week the SAS Training Post blog posted a short article on an easy way to find variables in common to two data sets. The article used PROC CONTENTS (with the SHORT option) to print out the names of variables in SAS data sets so that you can visually determine

When a categorical variable has dozens or hundreds of categories, it is often impractical and undesirable to create a bar chart that shows the counts for all categories. Two alternatives are popular: Display only the Top 10 or Top 20 categories. As I showed last week, to do this in