Blogs

Blogs

Tag: Data Analysis

Rick WicklinSeptember 16, 2015 0

Error distributions and exponential regression models

Last week I discussed ordinary least squares (OLS) regression models and showed how to illustrate the assumptions about the conditional distribution of the response variable. For a single continuous explanatory variable, the illustration is a scatter plot with a regression line and several normal probability distributions along the line. The

Read More

Rick WicklinAugust 21, 2015 0

She wants to be an airborne ranger

I wanna be an airborne ranger, Live the life of guts and danger.* If you are an 80's movie buff, you might remember the scene in The Breakfast Club where Bender, the juvenile delinquent played by Judd Nelson, distracts the principal by running through the school singing this song. Recently,

Read More

Rick WicklinAugust 19, 2015 0

Correlations between groups of variables

Typically a correlation analysis reports the correlations between all pairs of variables, including the variables with themselves. The resulting correlation matrix is square, symmetric, and has 1s on the main diagonal. But suppose you are interested in only specific combinations of variables. Perhaps you want the pairwise correlations between one

Read More

Rick WicklinAugust 5, 2015 0

Regression coefficient plots in SAS

Last week's post about odds ratio plots in SAS made me think about a similar plot that visualizes the parameter estimates for a regression analysis. The so-called regression coefficient plot is a scatter plot of the estimates for each effect in the model, with lines that indicate the width of

Read More

Rick WicklinJuly 29, 2015 0

Odds ratio plots with a logarithmic scale in SAS

I recently read an argument by Andrew Wheeler for using a logarithmic axis for plotting odds ratios. I found his argument convincing. Accordingly, this blog post shows how to create an odds ratio plot in SAS where the ratio axis is displayed on a log scale. Thanks to Bob Derr

Read More

Rick WicklinJuly 26, 2015 0

Wealth and winning in NC high school athletics

The Raleigh News & Observer published a front-page article about the effect of wealth and poverty on high school athletics in North Carolina. In particular, the article concluded that "high schools with a high percentage of poor students rarely win titles in the so-called country club sports—tennis, golf and swimming—and

Read More

Rick WicklinJuly 24, 2015 0

The relationship between toothlessness and income

My colleague Robert Allison finds the most interesting data sets to visualize! Yesterday he posted a visualization of toothless seniors in the US. More precisely, he created graphs that show the estimated prevalence of adults (65 years or older) who have had all their natural teeth extracted. The dental profession

Read More

Rick WicklinJuly 17, 2015 0

Visualizing the distribution of ACT scores

My son is in high school and plans to take the ACT, a standardized test to assess college aptitude and readiness. My wife asked, "What is a good score for the ACT?" I didn't know, but I did a quick internet search and discovered a tabulation of scores for the

Read More

Rick WicklinJuly 15, 2015 0

How to Winsorize data in SAS

Recently a SAS customer asked how to Winsorize data in SAS. Winsorization is best known as a way to construct robust univariate statistics. The Winsorized mean is a robust estimate of location. The Winsorized mean is similar to the trimmed mean, and both are described in the documentation for PROC

Read More

Rick WicklinJuly 1, 2015 0

Merge observed outcomes into a list of all outcomes

When you count the outcomes of an experiment, you do not always observe all of the possible outcomes. For example, if you roll a six-sided die 10 times, it might be that the "1" face does not appear in those 10 rolls. Obviously, this situation occurs more frequently with small

Read More

Rick WicklinJune 11, 2015 0

The spiral of splatter

"Daddy, help! Help me! Come quick!" I heard my daughter's screams from the upstairs bathroom and bounded up the stairs two at a time. Was she hurt? Bleeding? Was the toilet overflowing? When I arrived in the doorway, she pointed at the wall and at the floor. The wall was

Read More

Rick WicklinJune 3, 2015 0

The mystery of the density curve that was too short

I was reading a statistics book when I encountered a histogram that caught my eye. The histogram looked similar to the one at the left. It contained a normal density estimate overlaid on a histogram, but the height of the density curve seemed too short when compared to the heights

Read More

Learn SAS

Rick WicklinJune 1, 2015 0

SAS/IML functions that operate on columns of a matrix

A SAS programmer asked for a list of SAS/IML functions that operate on the columns of an n x p matrix and return a 1 x p row vector of results. The functions that behave this way tend to compute univariate descriptive statistics such as the mean, median, standard deviation, and quantiles. The following

Read More

Rick WicklinMay 14, 2015 0

No major hurricanes have hit the US coast recently. Lucky us!

Perhaps you saw the headlines earlier this week about the fact that it has been nine years since the last major hurricane (category 3, 4, or 5) hit the US coast. According to a post on the GeoSpace blog, which is published by the American Geophysical Union (AGU), researchers ran

Read More

Rick WicklinApril 29, 2015 0

Create and use a permutation matrix in SAS

Suppose that you compute the correlation matrix (call it R1) for a set of variables x1, x2, ..., x8. For some reason, you later want to compute the correlation matrix for the variables in a different order, maybe x2, x1, x7,..., x6. Do you need to go back to the

Read More

Rick WicklinMarch 18, 2015 0

Finding observations that match a target value

Imagine that you have one million rows of numerical data and you want to determine if a particular "target" value occurs. How might you find where the value occurs? For univariate data, this is an easy problem. In the SAS DATA step you can use a WHERE clause or a

Read More

Rick WicklinMarch 12, 2015 0

Analyzing the first 10 million digits of pi: Randomness within structure

Saturday, March 14, 2015, is Pi Day, and this year is a super-special Pi Day! This is your once-in-a-lifetime chance to celebrate the first 10 digits of pi (π) by doing something special on 3/14/15 at 9:26:53. Apologies to my European friends, but Pi Day requires that you represent dates

Read More

Rick WicklinFebruary 27, 2015 0

Plotting multiple time series in SAS/IML (Wide to Long, Part 2)

I recently wrote about how to overlay multiple curves on a single graph by reshaping wide data (with many variables) into long data (with a grouping variable). The implementation used PROC TRANSPOSE, which is a procedure in Base SAS. When you program in the SAS/IML language, you might encounter data

Read More

Rick WicklinFebruary 25, 2015 0

Plotting multiple series: Transforming data from wide to long

Data. To a statistician, data are the observed values. To a SAS programmer, analyzing data requires knowledge of the values and how the data are arranged in a data set. Sometimes the data are in a "wide form" in which there are many variables. However, to perform a certain analysis

Read More

Rick WicklinJanuary 2, 2015 0

Popular posts from The DO Loop in 2014

I published 118 blog posts in 2014. This article presents my most popular posts from 2014 and late 2013. 2014 will always be a special year for me because it was the year that the SAS University Edition was launched. The University Edition means that SAS/IML is available to all

Read More

Rick WicklinNovember 21, 2014 0

Resampling and permutation tests in SAS

My colleagues at the SAS & R blog recently posted an example of how to program a permutation test in SAS and R. Their SAS implementation used Base SAS and was "relatively cumbersome" (their words) when compared with the R code. In today's post I implement the permutation test in

Read More

Rick WicklinNovember 7, 2014 0

The distribution of blood types by country

My colleague Robert Allison has a knack for finding fascinating data. Last week he did it again by locating data about how blood types and Rh factors vary among countries. He produced a series of eight world maps, each showing the prevalence of a blood type (A+, A-, B+, B-,

Read More

Rick WicklinNovember 5, 2014 0

Binning data by quantiles? Beware of rounded data

In my article about how to create a quantile plot, I chose not to discuss a theoretical issue that occasionally occurs. The issue is that for discrete data (which includes rounded values), it might be impossible to use quantile values to split the data into k groups where each group

Read More

Rick WicklinOctober 22, 2014 0

Does this kurtosis make my tail look fat?

What is kurtosis? What does negative or positive kurtosis mean, and why should you care? How do you compute kurtosis in SAS software? It is not clear from the definition of kurtosis what (if anything) kurtosis tells us about the shape of a distribution, or why kurtosis is relevant to

Read More

Rick WicklinOctober 10, 2014 0

The frequency of double-letters in Cryptoquotes

It usually takes more than three weeks to prepare a good impromptu speech. --Mark Twain In the popular Cryptoquote puzzle, you are presented with an enciphered version of a quote by a famous person. One of the appeals of the puzzle for me is reading the deciphered quote, such

Read More

Rick WicklinOctober 3, 2014 0

Which double letters appear most frequently in English text?

Double, double toil and trouble; Fire burn, and caldron bubble. Macbeth, Act IV, Scene I For the cyptanalyst or recreational puzzle solver, "double double" does not lead to toil or trouble. Just the opposite: The occurrence of a double-letter bigram in an enciphered word puzzle is quite fortunate. Certain double

Read More

Rick WicklinSeptember 29, 2014 0

Create discrete heat maps in SAS/IML

In a previous article I introduced the HEATMAPCONT subroutine in SAS/IML 13.1, which makes it easy to visualize matrices by using heat maps with continuous color ramps. This article introduces a companion subroutine. The HEATMAPDISC subroutine, which also requires SAS/IML 13.1, is designed to visualize matrices that have a small

Read More

Rick WicklinSeptember 26, 2014 0

The frequency of bigrams in an English corpus

In last week's article about the distribution of letters in an English corpus, I presented research results by Peter Norvig who used Google's digitized library and tabulated the frequency of each letter. Norvig also tabulated the frequency of bigrams, which are pairs of letters that appear consecutively within a word.

Read More

Rick WicklinSeptember 24, 2014 0

Designing a quantile bin plot

While at JSM 2014 in Boston, a statistician asked me whether it was possible to create a "customized bin plot" in SAS. When I asked for more information, she told me that she has a large data set. She wants to visualize the data, but a scatter plot is not

Read More

Rick WicklinSeptember 22, 2014 0

Skew this

The skewness of a distribution indicates whether a distribution is symmetric or not. A distribution that is symmetric about its mean has zero skewness. In contrast, if the right tail of a unimodal distribution has more mass than the left tail, then the distribution is said to be "right skewed"

Read More