Missing data can be informative. Sometimes missing values in one variable are related to missing values in another variable. Other times missing values in one variable are independent of missing values in other variables. As part of the exploratory phase of data analysis, you should investigate whether there are patterns
Author
![HTvsHH2](https://blogs.sas.com/content/iml/files/2016/04/HTvsHH2.png)
I saw an interesting mathematical result in Wired magazine. The original article was about mathematical research into prime numbers, but the article included the following tantalizing fact: If Alice tosses a [fair]coin until she sees a head followed by a tail, and Bob tosses a coin until he sees two
![ODS graphics that share marker attributes](https://blogs.sas.com/content/iml/files/2016/04/odsstyleattrs4.png)
The SG procedures in SAS use aesthetically pleasing default colors, shapes, and styles, but sometimes it is necessary to override the default attributes. The MARKERATTRS= option enables you to override the default colors, symbols, and sizes of markers in scatter plots and other graphs. Similarly, the LINEATTRS= option enables you
![Generate random points uniformly in a ball](https://blogs.sas.com/content/iml/files/2016/03/Sim3D.gif)
Last week I showed how to generate random points uniformly inside a 2-d circular region. That article showed that the distance of a point to the circle's center cannot be distributed uniformly. Instead, you should use the square root of a uniform variate to generate 2-D distances to the origin.
In SAS procedures, the WHERE clause is a useful way to filter observations so that the procedure receives only a subset of the data to analyze. The IML procedure supports the WHERE clause in two separate statements. On the USE statement, the WHERE clause acts as a global filter. The
![](https://blogs.sas.com/content/iml/files/2016/03/simball2.png)
It is easy to generate random points that are uniformly distributed inside a rectangle. You simply generate independent random uniform values for each coordinate. However, nonrectangular regions are more complicated. An instructive example is to simulate points uniformly inside the ball with a given radius. The two-dimensional case is to
![](https://blogs.sas.com/content/iml/files/2016/03/descstatuni1.png)
Descriptive univariate statistics are the foundation of data analysis. Before you create a statistical model for new data, you should examine descriptive univariate statistics such as the mean, standard deviation, quantiles, and the number of nonmissing observations. In SAS, there is an easy way to create a data set that
![High school ranking and wrestling champions](https://blogs.sas.com/content/iml/files/2016/03/wrestlingrank1.png)
Last weekend was the 2016 NCAA Division I wrestling tournament. In collegiate wrestling there are ten weight classes. The top eight wrestlers in each weight class are awarded the title "All-American" to acknowledge that they are the best wrestlers in the country. I saw a blog post on the InterMat
![](https://blogs.sas.com/content/iml/files/2016/03/gampl1.png)
My previous blog post shows how to use PROC LOGISTIC and spline effects to predict the probability that an NBA player scores from various locations on a court. The LOGISTIC procedure fits parametric models, which means that the procedure estimates parameters for every explanatory effect in the model. Spline bases
![A statistical analysis of Stephen Curry's shooting](https://blogs.sas.com/content/iml/files/2016/03/curry5.png)
Last week Robert Allison showed how to download NBA data into SAS and create graphs such as the location where Stephen Curry took shots in the 2015-16 season to date. The graph at left shows the kind of graphs that Robert created. I've reversed the colors from Robert's version, so
There are several ways to simulate multinomial data in SAS. In the SAS/IML matrix language, you can use the RANDMULTINOMIAL function to generate samples from the multinomial distribution. If you don't have a SAS/IML license, I have previously written about how to use the SAS DATA step or PROC SURVEYSELECT
![piMCest1](https://blogs.sas.com/content/iml/files/2016/03/piMCest1.png)
Today is March 14th, which is annually celebrated as Pi Day. Today's date, written as 3/14/16, represents the best five-digit approximation of pi. On Pi Day, many people blog about how to approximate pi. This article uses a Monte Carlo simulation to estimate pi, in spite of the fact that
![](https://blogs.sas.com/content/iml/files/2016/03/overlayhist4.png)
You can use histograms to visualize the distribution of data. A comparative histogram enables you to compare two or more distributions, which usually represent subpopulations in the data. Common subpopulations include males versus females or a control group versus an experimental group. There are two common ways to construct a
![](https://blogs.sas.com/content/iml/files/2017/01/AdvancedAnalytics-2.png)
Most SAS regression procedures support the "stars and bars" operators, which enable you to create models that include main effects and all higher-order interaction effects. You can also easily create models that include all n-way interactions up to a specified value of n. However, it can be a challenge to
![](https://blogs.sas.com/content/iml/files/2016/02/t_designIML1.png)
Last week I showed how to create dummy variables in SAS by using the GLMMOD procedure. The procedure enables you to create design matrices that encode continuous variables, categorical variables, and their interactions. You can use dummy variables to replace categorical variables in procedures that do not support a CLASS
![](https://blogs.sas.com/content/iml/files/2016/02/t_deflib1.png)
One of the first things SAS programmers learn is that SAS data sets can be specified in two ways. You can use a two-level name such as "sashelp.class" which uses a SAS libref (SASHELP) and a member name (CLASS) to specify the location of the data set. Alternatively, you can
![](https://blogs.sas.com/content/iml/files/2016/02/t_design2.png)
SAS programmers sometimes ask, "How do I create a design matrix in SAS?" A design matrix is a numerical matrix that represents the explanatory variables in regression models. In simple models, the design matrix contains one column for each continuous variable and multiple columns (called dummy variables) for each classification
![](https://blogs.sas.com/content/iml/files/2016/02/t_dummyvar1.png)
A dummy variable (also known as indicator variable) is a numeric variable that indicates the presence or absence of some level of a categorical variable. The word "dummy" does not imply that these variables are not smart. Rather, dummy variables serve as a substitute or a proxy for a categorical
![](https://blogs.sas.com/content/iml/files/2016/02/legendcategories2.png)
Last week Sanjay Matange wrote about a new SAS 9.4m3 option that enables you to show all categories in a graph legend, even when the data do not contain all the categories. Sanjay's example was a chart that showed medical conditions classified according to the scale "Mild," "Moderate," and "Severe."
![](https://blogs.sas.com/content/iml/files/2016/02/sampletable.png)
Many simulation and resampling tasks use one of four sampling methods. When you draw a random sample from a population, you can sample with or without replacement. At the same time, all individuals in the population might have equal probability of being selected, or some individuals might be more likely
![](https://blogs.sas.com/content/iml/files/2016/02/samplepps.png)
How do you sample with replacement in SAS when the probability of choosing each observation varies? I was asked this question recently. The programmer thought he could use PROC SURVEYSELECT to generate the samples, but he wasn't sure which sampling technique he should use to sample with unequal probability. This
![](https://blogs.sas.com/content/iml/files/2016/02/readdata1-138x150.png)
In the SAS/IML language, you can read data from a SAS data set into a set of vectors (each with their own name) or into a single matrix. Beginning programmers might wonder about the advantages of each approach. When should you read data into vectors? When should you read data
![](https://blogs.sas.com/content/iml/files/2017/01/AdvancedAnalytics-2.png)
Last week I showed how to use PROC EXPAND to compute moving averages and other rolling statistics in SAS. Unfortunately, PROC EXPAND is part of SAS/ETS software and not every SAS site has a license for SAS/ETS. For simple moving averages, you can write a DATA step program, as discussed
![](https://blogs.sas.com/content/iml/files/2016/01/notsorted1-103x150.png)
Novice SAS programmers quickly learn the advantages of using PROC SORT to sort data, followed by a BY-group analysis of the sorted data. A typical example is to analyze demographic data by state or by ZIP code. A BY statement enables you to produce multiple analyses from a single procedure
![](https://blogs.sas.com/content/iml/files/2016/01/movingaverage1.png)
A common question on SAS discussion forums is how to compute a moving average in SAS. This article shows how to use PROC EXPAND and contains links to articles that use the DATA step or macros to compute moving averages in SAS. In a previous post, I explained how to
![](https://blogs.sas.com/content/iml/files/2016/01/movingaverage1.png)
A moving average (also called a rolling average) is a statistical technique that is used to smooth a time series. Moving averages are used in finance, economics, and quality control. You can overlay a moving average curve on a time series to visualize how each value compares to a rolling
![](https://blogs.sas.com/content/iml/files/2016/01/aspecttimeseries1.png)
In SAS, the aspect ratio of a graph is the physical height of the graph divided by the physical width. Recently I demonstrated how to set the aspect ratio of graphs in SAS by using the ASPECT= option in PROC SGPLOT or by using the OVERLAYEQUATED statement in the Graph
![](https://blogs.sas.com/content/iml/files/2017/01/AdvancedAnalytics-2.png)
Parameters in SAS procedures are specified a list of values that you manually type into the procedure syntax. For example, if you want to specify a list of percentile values in PROC UNIVARIATE, you need to type the values into the PCTLPTS= option as follows: proc univariate data=sashelp.cars noprint; var
![](https://blogs.sas.com/content/iml/files/2016/01/centroid.png)
Recently I blogged about how to compute a weighted mean and showed that you can use a weighted mean to compute the center of mass for a system of N point masses in the plane. That led me to think about a related problem: computing the center of mass (called
![](https://blogs.sas.com/content/iml/files/2017/01/AdvancedAnalytics-2.png)
I began 2016 by compiling a list of popular articles from my blog in 2015. This "People's Choice" list contains many interesting articles, but some of my personal favorites did not make the list. Today I present the "Editor's Choice" list of articles that deserve a second look. I've grouped