Missing data can be informative. Sometimes missing values in one variable are related to missing values in another variable. Other times missing values in one variable are independent of missing values in other variables. As part of the exploratory phase of data analysis, you should investigate whether there are patterns
Tag: Data Analysis
In SAS procedures, the WHERE clause is a useful way to filter observations so that the procedure receives only a subset of the data to analyze. The IML procedure supports the WHERE clause in two separate statements. On the USE statement, the WHERE clause acts as a global filter. The
![](https://blogs.sas.com/content/iml/files/2016/03/descstatuni1.png)
Descriptive univariate statistics are the foundation of data analysis. Before you create a statistical model for new data, you should examine descriptive univariate statistics such as the mean, standard deviation, quantiles, and the number of nonmissing observations. In SAS, there is an easy way to create a data set that
![High school ranking and wrestling champions](https://blogs.sas.com/content/iml/files/2016/03/wrestlingrank1.png)
Last weekend was the 2016 NCAA Division I wrestling tournament. In collegiate wrestling there are ten weight classes. The top eight wrestlers in each weight class are awarded the title "All-American" to acknowledge that they are the best wrestlers in the country. I saw a blog post on the InterMat
![](https://blogs.sas.com/content/iml/files/2016/03/gampl1.png)
My previous blog post shows how to use PROC LOGISTIC and spline effects to predict the probability that an NBA player scores from various locations on a court. The LOGISTIC procedure fits parametric models, which means that the procedure estimates parameters for every explanatory effect in the model. Spline bases
![A statistical analysis of Stephen Curry's shooting](https://blogs.sas.com/content/iml/files/2016/03/curry5.png)
Last week Robert Allison showed how to download NBA data into SAS and create graphs such as the location where Stephen Curry took shots in the 2015-16 season to date. The graph at left shows the kind of graphs that Robert created. I've reversed the colors from Robert's version, so
![](https://blogs.sas.com/content/iml/files/2017/01/AdvancedAnalytics-2.png)
Most SAS regression procedures support the "stars and bars" operators, which enable you to create models that include main effects and all higher-order interaction effects. You can also easily create models that include all n-way interactions up to a specified value of n. However, it can be a challenge to
![](https://blogs.sas.com/content/iml/files/2016/02/t_designIML1.png)
Last week I showed how to create dummy variables in SAS by using the GLMMOD procedure. The procedure enables you to create design matrices that encode continuous variables, categorical variables, and their interactions. You can use dummy variables to replace categorical variables in procedures that do not support a CLASS
![](https://blogs.sas.com/content/iml/files/2016/02/t_design2.png)
SAS programmers sometimes ask, "How do I create a design matrix in SAS?" A design matrix is a numerical matrix that represents the explanatory variables in regression models. In simple models, the design matrix contains one column for each continuous variable and multiple columns (called dummy variables) for each classification
![](https://blogs.sas.com/content/iml/files/2017/01/AdvancedAnalytics-2.png)
Last week I showed how to use PROC EXPAND to compute moving averages and other rolling statistics in SAS. Unfortunately, PROC EXPAND is part of SAS/ETS software and not every SAS site has a license for SAS/ETS. For simple moving averages, you can write a DATA step program, as discussed
![](https://blogs.sas.com/content/iml/files/2016/01/movingaverage1.png)
A common question on SAS discussion forums is how to compute a moving average in SAS. This article shows how to use PROC EXPAND and contains links to articles that use the DATA step or macros to compute moving averages in SAS. In a previous post, I explained how to
![](https://blogs.sas.com/content/iml/files/2016/01/movingaverage1.png)
A moving average (also called a rolling average) is a statistical technique that is used to smooth a time series. Moving averages are used in finance, economics, and quality control. You can overlay a moving average curve on a time series to visualize how each value compares to a rolling
![](https://blogs.sas.com/content/iml/files/2015/12/weightedmean2.png)
Weighted averages are all around us. Teachers use weighted averages to assign a test more weight than a quiz. Schools use weighted averages to compute grade-point averages. Financial companies compute the return on a portfolio as a weighted average of the component assets. Financial charts show (linearly) weighted moving averages
![](https://blogs.sas.com/content/iml/files/2017/02/AdvancedAnalytics-3.png)
I wrote 114 posts for The DO Loop blog in 2015. Which were the most popular with readers? In general, highly technical articles appeal to only a small group of readers, whereas less technical articles appeal to a larger audience. Consequently, many of my popular articles were related to data
![](https://blogs.sas.com/content/iml/files/2015/12/chi2fit.png)
A SAS customer asked: Why isn't the chi-square distribution supported in PROC UNIVARIATE? That is an excellent question. I remember asking a similar question when I first started learning SAS. In addition to the chi-square distribution, I wondered why the UNIVARIATE procedure does not support the F distribution. These are
![](https://blogs.sas.com/content/iml/files/2017/01/AdvancedAnalytics-2.png)
Did you know that the FREQ procedure in SAS can compute exact p-values for more than 20 statistical tests and statistics that are associated with contingency table? Mamma mia! That's a veritable smorgasbord of options! Some of the tests are specifically for one-way tables or 2 x 2 tables, but many apply
![](https://blogs.sas.com/content/iml/files/2015/10/t_tabulatelevels.png)
Suppose that you are tabulating the eye colors of students in a small class (following Friendly, 1992). Depending upon the ethnic groups of these students, you might not observe any green-eyed students. How do you put a 0 into the table that summarizes the number of students who have each
![](https://blogs.sas.com/content/iml/files/2015/09/GLM_normal_identity.png)
Last week I discussed ordinary least squares (OLS) regression models and showed how to illustrate the assumptions about the conditional distribution of the response variable. For a single continuous explanatory variable, the illustration is a scatter plot with a regression line and several normal probability distributions along the line. The
![](https://blogs.sas.com/content/iml/files/2015/08/t_ranger1.png)
I wanna be an airborne ranger, Live the life of guts and danger.* If you are an 80's movie buff, you might remember the scene in The Breakfast Club where Bender, the juvenile delinquent played by Judd Nelson, distracts the principal by running through the school singing this song. Recently,
![](https://blogs.sas.com/content/iml/files/2015/08/t_corrwith1.png)
Typically a correlation analysis reports the correlations between all pairs of variables, including the variables with themselves. The resulting correlation matrix is square, symmetric, and has 1s on the main diagonal. But suppose you are interested in only specific combinations of variables. Perhaps you want the pairwise correlations between one
![](https://blogs.sas.com/content/iml/files/2015/07/coefplot1.png)
Last week's post about odds ratio plots in SAS made me think about a similar plot that visualizes the parameter estimates for a regression analysis. The so-called regression coefficient plot is a scatter plot of the estimates for each effect in the model, with lines that indicate the width of
![](https://blogs.sas.com/content/iml/files/2015/07/oddsratio1.png)
I recently read an argument by Andrew Wheeler for using a logarithmic axis for plotting odds ratios. I found his argument convincing. Accordingly, this blog post shows how to create an odds ratio plot in SAS where the ratio axis is displayed on a log scale. Thanks to Bob Derr
![](https://blogs.sas.com/content/iml/files/2015/07/nchssports.png)
The Raleigh News & Observer published a front-page article about the effect of wealth and poverty on high school athletics in North Carolina. In particular, the article concluded that "high schools with a high percentage of poor students rarely win titles in the so-called country club sports—tennis, golf and swimming—and
![](https://blogs.sas.com/content/iml/files/2015/07/teeth1.png)
My colleague Robert Allison finds the most interesting data sets to visualize! Yesterday he posted a visualization of toothless seniors in the US. More precisely, he created graphs that show the estimated prevalence of adults (65 years or older) who have had all their natural teeth extracted. The dental profession
![](https://blogs.sas.com/content/iml/files/2015/07/act3.png)
My son is in high school and plans to take the ACT, a standardized test to assess college aptitude and readiness. My wife asked, "What is a good score for the ACT?" I didn't know, but I did a quick internet search and discovered a tabulation of scores for the
![](https://blogs.sas.com/content/iml/files/2015/06/t_winsorize-150x82.png)
Recently a SAS customer asked how to Winsorize data in SAS. Winsorization is best known as a way to construct robust univariate statistics. The Winsorized mean is a robust estimate of location. The Winsorized mean is similar to the trimmed mean, and both are described in the documentation for PROC
![](https://blogs.sas.com/content/iml/files/2015/07/t_mergecounts1-150x85.png)
When you count the outcomes of an experiment, you do not always observe all of the possible outcomes. For example, if you roll a six-sided die 10 times, it might be that the "1" face does not appear in those 10 rolls. Obviously, this situation occurs more frequently with small
![](https://blogs.sas.com/content/iml/files/2015/06/polishspiral.jpg)
"Daddy, help! Help me! Come quick!" I heard my daughter's screams from the upstairs bathroom and bounded up the stairs two at a time. Was she hurt? Bleeding? Was the toilet overflowing? When I arrived in the doorway, she pointed at the wall and at the floor. The wall was
![](https://blogs.sas.com/content/iml/files/2015/05/bindensity.png)
I was reading a statistics book when I encountered a histogram that caught my eye. The histogram looked similar to the one at the left. It contained a normal density estimate overlaid on a histogram, but the height of the density curve seemed too short when compared to the heights
![](https://blogs.sas.com/content/iml/files/2017/02/AdvancedAnalytics-3.png)
A SAS programmer asked for a list of SAS/IML functions that operate on the columns of an n x p matrix and return a 1 x p row vector of results. The functions that behave this way tend to compute univariate descriptive statistics such as the mean, median, standard deviation, and quantiles. The following