I am thankful to be a statistical programmer.
When I wake up in the morning, I am eager to start my day. I love statistics, programming, and working at SAS, and I write my blog to share that joy. This a Golden Age for statistical programmers because theoretical ideas and computational power have converged during the last 30 years so that it is now possible to attack problems that were previously intractable.
Here, then, is my list of ten areas of computational statistics that make me thankful to be a statistical programmer:
- Bootstrapping and resampling techniques: I remember reading a Scientific American article on the bootstrap in 1983 when I was in high school. It seemed like magic. Now, more than 25 years later, it still seems like magic, but it is magic that I can easily program and use.
- Nonparametric modeling: From kernel density estimation to multivariate regression splines, nonparametric techniques not only fit a wide variety of data, but also reduce the modeling biases of parametric models. Two of my favorite tools are PROC UNIVARIATE for density estimation and PROC LOESS for scatter plot smoothing, but I have used all of the SAS/STAT nonparametric regression procedures and I physically salivate if you mention the new EFFECT statement available in SAS/STAT 9.22.
- Variable selection techniques: If you have a response variable and p explanatory variables, there are 2p possible subsets that you can use to model the relationship between the response and the explanatory variables. Which subset to use? Some people want the data to decide. Variable selection techniques find a subset that optimizes some fit statistic. Programming an algorithm that finds the subset quickly requires a range of skills from computer science, statistics, and optimization. Fortunately, SAS has developers that excel in these areas, and PROC GLMSELECT (which includes popular techniques such as LAR and LASSO) is rapidly becoming a favorite tool with SAS customers because of its speed and flexibility.
- Robust estimation: Outliers in data can influence estimates of location, scale, regression coefficients, and so on. Robust techniques are often computationally more intensive than their classical counterparts, but in many cases they produce superior estimates. I enjoy using the ROBUSTREG procedure for robust regression and the MCD subroutine in PROC IML for robust estimation of covariance matrices.
- Bayesian computations: Alan Gelfand writes that "Markov chain Monte Carlo (MCMC) has revolutionized the way that statistical models are fitted" [Statistics in the 21st Century, p. 341]. I couldn't agree more. SAS customers flock to every presentation and paper on PROC MCMC and other SAS procedures that support Bayesian analysis. The enthusiasm does not seem to be waning.
- Imputing missing values: Missing values are a way of life, but new techniques are helping statisticians to impute missing values under certain conditions. Before I started studying statistics, I had never heard of a missing value. And I certainly was not aware that you can use multiple imputation techniques to make valid statistical inferences in the presence of missing values. In SAS/STAT software, the MI and MIANALYZE procedures do exactly that.
- Maximum likelihood estimation: Not only is MLE useful for the practicing data analyst, but it can be fun and challenging for the statistical programmer because it requires understanding statistics, numerical analysis, and optimization. There have been hundreds of papers written just on MLE for the three-parameter Weibull distribution! For a statistical programmer, MLE means "job security," and who isn't thankful for that?
- Simulation: Although simulation is hardly new, languages such as SAS/IML make it easy to investigate characteristics of a statistical technique by applying it to simulated data with known properties. In fact, I'll be describing how to do this at my "data simulation" seminar at SAS Global Forum in Las Vegas, April 4–7, 2011.
- Statistical graphics: I am a visual learner. I graph my data to discover relationships between variables and to examine unusual observations. I graph after I build a statistical model in order to assess the model fit. I'm thankful that most SAS/STAT procedures automatically create suitable graphs by using ODS Statistical Graphics. When I need to create specialized graphs, I can use the SGPLOT procedure or program custom, dynamically linked graphs in SAS/IML Studio.
- High-level computational languages: In my youth, there was FORTRAN. Powerful? Yes, but programming in FORTRAN is like eating overcooked Brussels sprouts: it leaves me nauseous and with a bad taste in my mouth. For me, statistical programming with SAS/IML software is the cure for my Brussels sprouts blues. Whether you program with SAS/STAT procedures and the DATA step, or whether you dive into matrix languages such as the SAS/IML language, R, or MATLAB, statistical programming is more fun and more productive when you use a high-level language.
Did I omit the statistical area that excites you the most? What makes you thankful to be a statistical programmer?