Double, double toil and trouble; Fire burn, and caldron bubble. Macbeth, Act IV, Scene I For the cyptanalyst or recreational puzzle solver, "double double" does not lead to toil or trouble. Just the opposite: The occurrence of a double-letter bigram in an enciphered word puzzle is quite fortunate. Certain double

## Tag: **Data Analysis**

In a previous article I introduced the HEATMAPCONT subroutine in SAS/IML 13.1, which makes it easy to visualize matrices by using heat maps with continuous color ramps. This article introduces a companion subroutine. The HEATMAPDISC subroutine, which also requires SAS/IML 13.1, is designed to visualize matrices that have a small

In last week's article about the distribution of letters in an English corpus, I presented research results by Peter Norvig who used Google's digitized library and tabulated the frequency of each letter. Norvig also tabulated the frequency of bigrams, which are pairs of letters that appear consecutively within a word.

While at JSM 2014 in Boston, a statistician asked me whether it was possible to create a "customized bin plot" in SAS. When I asked for more information, she told me that she has a large data set. She wants to visualize the data, but a scatter plot is not

The skewness of a distribution indicates whether a distribution is symmetric or not. A distribution that is symmetric about its mean has zero skewness. In contrast, if the right tail of a unimodal distribution has more mass than the left tail, then the distribution is said to be "right skewed"

It's time for another blog post about ciphers. As I indicated in my previous blog post about substitution ciphers, the classical substitution cipher is no longer used to encrypt ultra-secret messages because the enciphered text is prone to a type of statistical attack known as frequency analysis. At the root

In a previous blog post I showed how to order a set of variables by a statistic. After reshaping data, you can create a graph that contains box plots for many variables. Ordering the variables by some statistic (mean, median, variance,...) helps to differentiate and distinguish the variables. You can

While I was working on my recent blog post about two-dimensional binning, a colleague asked whether I would be discussing "the new hexagonal binning method that was added to the SURVEYREG procedure in SAS/STAT 13.2." I was intrigued: I was not aware that hexagonal binning had been added to a

Last Monday I discussed how to choose the bin width and location for a histogram in SAS. The height of each histogram bar shows the number of observations in each bin. Although my recent article didn't mention it, you can also use the IML procedure to count the number of

When you create a histogram with statistical software, the software uses the data (including the sample size) to automatically choose the width and location of the histogram bins. The resulting histogram is an attempt to balance statistical considerations, such as estimating the underlying density, and "human considerations," such as choosing

My wife got one of those electronic activity trackers a few months ago and has been diligently walking every day since then. At the end of the day she sometimes reads off how many steps she walked, as measured by her activity tracker. I am always impressed at how many

In a previous blog post, I showed how to use the graph template language (GTL) in SAS to create heat maps with a continuous color ramp. SAS/IML 13.1 includes the HEATMAPCONT subroutine, which makes it easy to create heat maps with continuous color ramps from SAS/IML matrices. Typical usage includes

In a previous blog post, I showed how to overlay a prediction ellipse on a scatter plot in SAS by using the ELLIPSE statement in PROC SGPLOT. The ELLIPSE statement draws the ellipse by using a standard technique that assumes the sample is bivariate normal. Today's article describes the technique

It is common in statistical graphics to overlay a prediction ellipse on a scatter plot. This article describes two easy ways to overlay prediction ellipses on a scatter plot by using SAS software. It also describes how to overlay multiple prediction ellipses for subpopulations. What is a prediction ellipse? A

An empty matrix is a matrix that has zero rows and zero columns. At first "empty matrix" sounds like an oxymoron, but when programming in a matrix language such as SAS/IML, empty matrices arise surprisingly often. Sometimes empty matrices occur because of a typographical error in your program. If you

In my four years of blogging, the post that has generated the most comments is "How to handle negative values in log transformations." Many people have written to describe data that contain negative values and to ask for advice about how to log-transform the data. Today I describe a transformation

In my previous blog post, I showed how to use log axes on a scatter plot in SAS to better visualize data that range over several orders of magnitude. Because the data contained counts (some of which were zero), I used a custom transformation x → log10(x+1) to visualize the

If you are trying to visualize numerical data that range over several magnitudes, conventional wisdom says that a log transformation of the data can often result in a better visualization. This article shows several ways to create a scatter plot with logarithmic axes in SAS and discusses some of the

A few years ago I blogged about how to expand a data set by using a frequency variable. The DATA step in the article was simple, but the SAS/IML function was somewhat complicated and used a DO loop to expand the data. (Although a reader later showed how to avoid

Last week I showed how to use the SUBMIT and ENDSUBMIT statements in the SAS/IML language to call the SGPLOT procedure to create ODS graphs of data that are in SAS/IML vectors and matrices. I also showed how to create a SAS/IML module that hides the details and enables you

As you develop a program in the SAS/IML language, it is often useful to create graphs to visualize intermediate results. The language supports basic statistical graphics such as bar charts, histograms, scatter plots, and so on. However, you can create more advanced graphics without leaving PROC IML by using the

A colleague asked me an interesting question: I have a journal article that includes sample quantiles for a variable. Given a new data value, I want to approximate its quantile. I also want to simulate data from the distribution of the published data. Is that possible? This situation is common.

If you’re an analyst, you know discovery in a complicated data set is one of the toughest problems to solve. But did you know the Business Knowledge Series course, Exploratory Analysis for Large and Complex Problems Using SAS Enterprise Miner, can help you solve those issues by tackling real-world problems?

A colleague sent me an interesting question: What is the best way to abort a SAS/IML program? For example, you might want to abort a program if the data is singular or does not contain a sufficient number of observations or variables. As a first attempt would be to try

My last blog post described three ways to add a smoothing spline to a scatter plot in SAS. I ended the post with a cautionary note: From a statistical point of view, the smoothing spline is less than ideal because the smoothing parameter must be chosen manually by the user.

In 2013 I published 110 blog posts. Some of these articles were more popular than others, often because they were linked to from a SAS newsletter such as the SAS Statistics and Operations Research News. In no particular order, here are some of my most popular posts from 2013, organized

The mosaic plot is a graphical visualization of a frequency table. In a previous post, I showed how to use the FREQ procedure to create a mosaic plot. This article shows how to create a mosaic plot by using the MOSAICPARM statement in the graph template language (GTL). (The MOSAICPARM

Mosaic plots (Hartigan and Kleiner, 1981; Friendly, 1994, JASA) are used for exploratory data analysis of categorical data. Mosaic plots have been available for decades in SAS products such as JMP, SAS/INSIGHT, and SAS/IML Studio. However, not all SAS customers have access to these specialized products, so I am pleased

If you've ever tried to use PROC FREQ to create a frequency table of two character variables, you know that by default the categories for each variable are displayed in alphabetical order. A different order is sometimes more useful. For example, consider the following two-way table for the smoking status

A challenge for statistical programmers is getting data into the right form for analysis. For graphing or analyzing data, sometimes the "wide format" (each subject is represented by one row and many variables) is required, but other times the "long format" (observations for each subject span multiple rows) is more