## Tag: Data Analysis

0
Fitting a Poisson distribution to data in SAS

Over at the SAS Discussion Forums, someone asked how to use SAS to fit a Poisson distribution to data. The questioner asked how to fit the distribution but also how to overlay the fitted density on the data and to create a quantile-quantile (Q-Q) plot. The questioner mentioned that the

0
Count missing values in observations

Locating missing values is important in statistical data analysis. I've previously written about how to count the number of missing values for each variable in a data set. In Base SAS, I showed how to use the MEANS or FREQ procedures to count missing values. In the SAS/IML language, I

0
Linear interpolation in SAS/IML

A recent discussion on the SAS-L discussion forum concerned how to implement linear interpolation in SAS. Some people suggested using PROC EXPAND in SAS/ETS software, whereas others proposed a DATA step solution. For me, the SAS/IML language provides a natural programming environment to implement an interpolation scheme. It also provides

0
Compute sample quantiles by using the QNTL call

SAS provides several ways to compute sample quantiles of data. The UNIVARIATE procedure can compute quantiles (also called percentiles), but you can also compute them in the SAS/IML language. Prior to SAS/IML 9.22 (released in 2010) statistical programmers could call a SAS/IML module that computes sample quantiles. With the release

0
Quantiles of discrete distributions

I work with continuous distributions more often than with discrete distributions. Consequently, I am used to thinking of the quantile function as being an inverse cumulative distribution function (CDF). (These functions are described in my article, "Four essential functions for statistical programmers.") For discrete distributions, they are not. To quote

Learn SAS
0
Functions to know: The MEAN, VAR, and STD functions

As a SAS developer, I am always looking ahead to the next release of SAS. However, many SAS customer sites migrate to new releases slowly and are just now adopting versions of SAS that were released in 2010 or 2011. Consequently, I want to write a few articles that discuss

0
Testing data for multivariate normality

I've blogged several times about multivariate normality, including how to generate random values from a multivariate normal distribution. But given a set of multivariate data, how can you determine if it is likely to have come from a multivariate normal distribution? The answer, of course, is to run a goodness-of-fit

0
What is Mahalanobis distance?

I previously described how to use Mahalanobis distance to find outliers in multivariate data. This article takes a closer look at Mahalanobis distance. A subsequent article will describe how you can compute Mahalanobis distance. Distance in standard units In statistics, we sometimes measure "nearness" or "farness" in terms of the

0
Detecting outliers in SAS: Part 3: Multivariate location and scatter

In two previous blog posts I worked through examples in the survey article, "Robust statistics for outlier detection," by Peter Rousseeuw and Mia Hubert. Robust estimates of location in a univariate setting are well-known, with the median statistic being the classical example. Robust estimates of scale are less well-known, with

0
Detecting outliers in SAS: Part 2: Estimating scale

In a previous blog post on robust estimation of location, I worked through some of the examples in the survey article, "Robust statistics for outlier detection," by Peter Rousseeuw and Mia Hubert. I showed that SAS/IML software and PROC UNIVARIATE both support the robust estimators of location that are mentioned

0
Detecting outliers in SAS: Part 1: Estimating location

I encountered a wonderful survey article, "Robust statistics for outlier detection," by Peter Rousseeuw and Mia Hubert. Not only are the authors major contributors to the field of robust estimation, but the article is short and very readable. This blog post walks through the examples in the paper and shows

0
Overlay density estimates on a plot

A recent question on a SAS Discussion Forum was "how can you overlay multiple kernel density estimates on a single plot?" There are three ways to do this, depending on your goals and objectives. Overlay different estimates of the same variable Sometimes you have a single variable and want to

0
Missing values and pairwise correlations: A cautionary example

It is "well known" that the pairwise deletion of missing values and the resulting computation of correlations can lead to problems in statistical computing. I have previously written about this phenomenon in my article "When is a correlation matrix not a correlation matrix." Specifically, consider the symmetric array whose elements

0
American pre-WW2 attitudes about Germany and Allies

Yesterday, December 7, 1941, a date which will live in infamy... - Franklin D. Roosevelt Today is the 70th anniversary of the Japanese attack on Pearl Harbor. The very next day, America declared war. During a visit to the Smithsonian National Museum of American History, I discovered the results of

0
The distribution of flavors in Halloween candies

Halloween night was rainy, so many fewer kids knocked on the door than usual. Consequently, I'm left with a big bucket of undistributed candy. One evening as I treated myself to a mouthful of tooth decay, I chanced to open a package of Wonka® Bottle Caps. The package contained three

0
Reshape data so that each category becomes a new variable

Being able to reshape data is a useful skill in data analysis. Most of the time you can use the TRANSPOSE procedure or the SAS DATA step to reshape your data. But the SAS/IML language can be handy, too. I only use PROC TRANSPOSE a few times per year, so

0
Modeling the distribution of data? Create a Q-Q plot

"I think that my data are exponentially distributed, but how can I check?" I get asked that question a lot. Well, not specifically that question. Sometimes the question is about the normal, lognormal, or gamma distribution. A related question is "Which distribution does my data have," which was recently discussed

0
The "power" of finite mixture models

When I learn a new statistical technique, one of first things I do is to understand the limitations of the technique. This blog post shares some thoughts on modeling finite mixture models with the FMM procedure. What is a reasonable task for FMM? When are you asking too much? I

0
Maximum likelihood estimation in SAS/IML

A popular use of SAS/IML software is to optimize functions of several variables. One statistical application of optimization is estimating parameters that optimize the maximum likelihood function. This post gives a simple example for maximum likelihood estimation (MLE): fitting a parametric density estimate to data. Which density curve fits the

0
Distances between words

When you misspell a word on your mobile device or in a word-processing program, the software might "autocorrect" your mistake. This can lead to some funny mistakes, such as the following: I hate Twitter's autocorrect, although changing "extreme couponing" to "extreme coupling" did make THAT tweet more interesting. [@AnnMariaStat] When

0
Modeling Finite Mixtures with the FMM Procedure

In my previous post, I blogged about how to sample from a finite mixture distribution. I showed how to simulate variables from populations that are composed of two or more subpopulations. Modeling a response variable as a mixture distribution is an active area of statistics, as judged by many talks

0
The effect of holidays on US births

Last week I showed a graph of the number of US births for each day in 2002, which shows a strong day-of-the-week effect. The graph also shows that the number of births on a given day is affected by US holidays. This blog post looks closer at the holiday effect.

0
Visualizing Scrabble games

My elderly mother enjoys playing Scrabble®. The only problem is that my father and most of my siblings won't play with her because she beats them all the time! Consequently, my mother is always excited when I visit because I'll play a few Scrabble games with her. During a recent

0
Visualizing correlations between variables in SAS

Exploring correlation between variables is an important part of exploratory data analysis. Before you start to model data, it is a good idea to visualize how variables related to one another. Zach Mayer, on his Modern Toolmaking blog, posted code that shows how to display and visualize correlations in R.

0
Complex assignment statements: CHOOSE wisely

This article describes the SAS/IML CHOOSE function: how it works, how it doesn't work, and how to use it to make your SAS/IML programs more compact. In particular, the CHOOSE function has a potential "gotcha!" that you need to understand if you want your program to perform as expected. What

0
Computing an ROC curve from basic principles

In a previous blog post, I showed how to use the LOGISTIC procedure to construct a receiver operator characteristic (ROC) curve in SAS. That same day, Charlie H. blogged about how to use the DATA step to construct an ROC curve from basic principles. It has been a long time