It is "well known" that the pairwise deletion of missing values and the resulting computation of correlations can lead to problems in statistical computing. I have previously written about this phenomenon in my article "When is a correlation matrix not a correlation matrix." Specifically, consider the symmetric array whose elements

## Tag: **Data Analysis**

Yesterday, December 7, 1941, a date which will live in infamy... - Franklin D. Roosevelt Today is the 70th anniversary of the Japanese attack on Pearl Harbor. The very next day, America declared war. During a visit to the Smithsonian National Museum of American History, I discovered the results of

Recently the "SAS Sample of the Day" was a Knowledge Base article with an impressively long title: Sample 42165: Using a stored process to eliminate duplicate values caused by multiple group memberships when creating a group-based, identity-driven filter in SAS® Information Map Studio "Wow," I thought. "This is the longest

Halloween night was rainy, so many fewer kids knocked on the door than usual. Consequently, I'm left with a big bucket of undistributed candy. One evening as I treated myself to a mouthful of tooth decay, I chanced to open a package of Wonka® Bottle Caps. The package contained three

Being able to reshape data is a useful skill in data analysis. Most of the time you can use the TRANSPOSE procedure or the SAS DATA step to reshape your data. But the SAS/IML language can be handy, too. I only use PROC TRANSPOSE a few times per year, so

"I think that my data are exponentially distributed, but how can I check?" I get asked that question a lot. Well, not specifically that question. Sometimes the question is about the normal, lognormal, or gamma distribution. A related question is "Which distribution does my data have," which was recently discussed

When I learn a new statistical technique, one of first things I do is to understand the limitations of the technique. This blog post shares some thoughts on modeling finite mixture models with the FMM procedure. What is a reasonable task for FMM? When are you asking too much? I

A popular use of SAS/IML software is to optimize functions of several variables. One statistical application of optimization is estimating parameters that optimize the maximum likelihood function. This post gives a simple example for maximum likelihood estimation (MLE): fitting a parametric density estimate to data. Which density curve fits the

When you misspell a word on your mobile device or in a word-processing program, the software might "autocorrect" your mistake. This can lead to some funny mistakes, such as the following: I hate Twitter's autocorrect, although changing "extreme couponing" to "extreme coupling" did make THAT tweet more interesting. [@AnnMariaStat] When

In my previous post, I blogged about how to sample from a finite mixture distribution. I showed how to simulate variables from populations that are composed of two or more subpopulations. Modeling a response variable as a mixture distribution is an active area of statistics, as judged by many talks

Last week I showed a graph of the number of US births for each day in 2002, which shows a strong day-of-the-week effect. The graph also shows that the number of births on a given day is affected by US holidays. This blog post looks closer at the holiday effect.

My friend Chris posted an analysis of the distribution of birthdays for 236 of his Facebook friends. He noted that more of his friends have birthdays in April than in September. The numbers were 28 for April, but only 25 for September. As I reported in my post on "the

My elderly mother enjoys playing Scrabble®. The only problem is that my father and most of my siblings won't play with her because she beats them all the time! Consequently, my mother is always excited when I visit because I'll play a few Scrabble games with her. During a recent

Exploring correlation between variables is an important part of exploratory data analysis. Before you start to model data, it is a good idea to visualize how variables related to one another. Zach Mayer, on his Modern Toolmaking blog, posted code that shows how to display and visualize correlations in R.

This article describes the SAS/IML CHOOSE function: how it works, how it doesn't work, and how to use it to make your SAS/IML programs more compact. In particular, the CHOOSE function has a potential "gotcha!" that you need to understand if you want your program to perform as expected. What

In a previous blog post, I showed how to use the LOGISTIC procedure to construct a receiver operator characteristic (ROC) curve in SAS. That same day, Charlie H. blogged about how to use the DATA step to construct an ROC curve from basic principles. It has been a long time

"Dad? How many times do I have to roll a die until all six sides appear?" I stopped what I was doing to consider my son's question. Although I could figure out the answer mathematically, sometimes experiments are more powerful than math equations for showing how probability works. "Why don't

One of the joys of statistics is that you can often use different methods to estimate the same quantity. Last week I described how to compute a parametric density estimate for univariate data, and use the parameters estimates to compute the area under the probability density function (PDF). This article

If you create a scatter plot of highly correlated data, you will see little more than a thin cloud of points. Small-scale relationships in the data might be masked by the correlation. For example, Luke Miller recently posted a scatter plot that compares the body temperature of snails when they

In a previous article, I discussed random jittering as a technique to reduce overplotting in scatter plots. The example used data that are rounded to the nearest unit, although the idea applies equally well to ordinal data in general. The act of jittering (adding random noise to data) is a

Jittering. To a statistician, it is more than what happens when you drink too much coffee. Jittering is the act of adding random noise to data in order to prevent overplotting in statistical graphs. Overplotting can occur when a continuous measurement is rounded to some convenient unit. This has the

The area under a density estimate curve gives information about the probability that an event occurs. The simplest density estimate is a histogram, and last week I described a few ways to compute empirical estimates of probabilities from histograms and from the data themselves, including how to construct the empirical

A reader commented to me that he wants to use the HISTOGRAM statement of the SGPLOT procedure to overlay two histograms on a single plot. He could do it, but unfortunately SAS was choosing a large bin width for one of the variables and a small bin width for the

Each Sunday, my local paper has a starburst image on the front page that proclaims "Up to $169 in Coupons!" (The value changes from week to week.) One day I looked at the image and thought, "Does the paper hire someone to count the coupons? Is this claim a good

In a previous blog post, I presented a short SAS/IML function module that implements the trapezoidal rule. The trapezoidal rule is a numerical integration scheme that gives the integral of a piecewise linear function that passes through a given set of points. This article demonstrates an application of using the

Many people know that the SGPLOT procedure in SAS 9.2 can create a large number of interesting graphs. Some people also know how to create a panel of graphs (all of the same type) by using the SGPANEL procedure. But did you know that you can also create a panel

A fundamental operation in data analysis is finding data that satisfy some criterion. How many people are older than 85? What are the phone numbers of the voters who are registered Democrats? These questions are examples of locating data with certain properties or characteristics. The SAS DATA step has a

For years I've been making presentations about SAS/IML software at conferences. Since 2008, I've always mentioned to SAS customers that they can call R from within SAS/IML software. (This feature was introduced in SAS/IML Studio 3.2 and was added to the IML procedure in SAS/IML 9.22.) I also included a

When Charlie H. posted an interesting article titled "Top 10 most powerful functions for PROC SQL," there was one item on his list that was unfamiliar: the COALESCE function. (Edit: Charlie's blog no longer exists. The article used to be available at http://www.sasanalysis.com/2011/01/top-10-most-powerful-functions-for-proc.html) Ever since I posted my first response,

Last week the Flowing Data blog published an excellent visualization of the flight patterns of major US airlines. On Friday, I sent the link to Robert Allison, my partner in the 2009 ASA Data Expo, which explored airline data. Robert had written a SAS program for the Expo that plots