Getting started performing a Bayesian analysis with missing data in SAS

Every data scientist knows about them, and every data scientist has had to deal with them. I’m talking about incomplete data sets. You might already be familiar with the tools SAS/STAT offers for a principled treatment of missing data in a frequentist paradigm. But did you know that with just a few changes in your code, you can seamlessly transition to a Bayesian analysis?

Missing data and multiple imputation

When faced with missing data, ad hoc methods like mean imputation or dropping incomplete observations enable the analysis to proceed. However, these methods do not directly address the underlying problem. In a best-case scenario, you are losing precision. But you are likely to introduce bias into your inference. Using PROC MI, you can generate a multiply imputed data set for your analysis. In short, it generates multiple copies of your data set with the missing variables filled in by using various imputation techniques. This enables you to propagate uncertainty from your missing values through your analysis.

Recapping the frequentist approach to the multiple imputation workflow

The data set generated by PROC MI includes your input variables and a new variable, _IMPUTATION_, used to index the imputation number. You then analyze your data as usual but use SAS’ BY-group processing with the `_IMPUTATION_` variable. The resulting parameter estimates are saved to a single data set. Next, you use PROC MIANALYZE to pool the results and report parameter estimates for your original problem that reflect the underlying uncertainty introduced by the missing values.

This PROC MIANALYZE Getting Started example demonstrates this workflow. The input data set Fitness1 is multiply imputed, with the imputations saved to the data set OUTMI. PROC REG is used to fit a linear regression of the model `Oxygen=RunTime RunPulse`. This creates a data set OUTREG with parameter estimates that are pooled by using MIANALYZE. What we will do next is to create a simple Bayesian version of this model.

Before we generate the Bayesian model, run the Getting Started example in your SAS session. You could do this by copying and pasting the example from the User’s Guide. Or you can download the example from GitHub:

filename mianags url 'https://raw.githubusercontent.com/sassoftware/doc-supplement-statug/refs/heads/main/Examples/m-n/mianags.sas';

%include mianags;

You should now see the pooled estimates of the linear regression model as combined by MIANALYZE in your Results tab in SAS Studio and have three data sets in your WORK library.

A Bayesian approach to the multiple imputation workflow

The frequentist approach described above focuses on deriving point estimates and confidence intervals for fixed but unknown parameters that are valid under repeat sampling. The Bayesian approach views the unknown parameters as random variables with a distribution we can approximate by sampling. SAS makes running a Bayesian analysis of your data super simple. You can switch out the PROC REG call for PROC GENMOD, which has a very similar syntax.

Aside from being able to fit a much larger class of models compared to PROC REG, PROC GENMOD offers a second major benefit: the BAYES statement. By simply adding this one keyword, you can switch from a frequentist method to getting started on a Bayesian analysis. GENMOD implements sensible defaults to get you started so the statement can stand independently. Once you understand how to use MI with Bayesian methods by using GENMOD, you can apply the same technique to SAS’ other Bayesian procedures, such as BGLIMM.

At the modeling stage, the Bayesian workflow with multiple imputations is similar to the frequentist example. As before, you will fit your model to each imputation separately by using BY-group processing. However, the stage after the model fitting will be different.

There is one thing that makes the output of a Bayesian method different from a frequentist method. In addition to point estimates of parameters of interest (in this case, the regression parameters), we also have access to random draws from the parameters’ posterior distribution. This enables you to make probability statements regarding a parameter’s value. Although you could pool the Bayesian point estimates by using MIANALYZE, provided your posterior distribution is approximately normal, we want to keep this advantage of having access to the parameter distributions.

So, instead of pooling point estimates, the random draws generated with the PROC GENMOD MCMC sampler for each data set will be pooled together directly, as suggested in Bayesian Data Analysis. By pooling these draws, you can approximate the whole data posterior distribution. This is essentially treated as a mixture distribution of the individual data sets’ posterior distribution. It sounds complicated, but in practice, it’s easy to do. Let’s see this at work:

   proc genmod data=outmi;
       model Oxygen=RunTime RunPulse;
       by _Imputation_;
       /* this is the only statement you need to add to switch to a Bayesian method: */
       bayes outpost=outbayes seed=457;
   run;

Here, I have added two options to the BAYES statement. The `OUTPOST` option saves the generated parameter draws to the `OUTBAYES` data set, and setting the seed makes the simulation reproducible. By default, GENMOD will produce plenty of diagnostic output. It also summarizes the Bayesian parameter estimates for each imputed data set.

For this reason, I would recommend on a first run, you limit yourself to just a handful of imputed data sets like this:

   proc genmod data=outmi;
       model Oxygen=RunTime RunPulse;
       by _Imputation_;
       where _Imputation_ < 5 /* pick how many data sets you want to look at */
       bayes outpost=outbayes seed=457;
   run;

Here’s the table of parameter summaries and credible intervals GENMOD produces for the first imputed data set as an example:

Bayesian Analysis with Missing Data - Table 1: Summary statistics of the Bayesian parameter estimates for the first imputed data set — Table 1: Summary statistics of the Bayesian parameter estimates for the first imputed data set

These summaries are produced for each of the multiply imputed data sets, but you will not be working with them directly. An important piece of diagnostic output is this plot that is produced for each parameter:

Figure 1: Plots of MCMC sampler convergence diagnostics for the intercept parameter of the first imputed data set

The top plot is a trace plot of the MCMC sampler. The plot on the lower left-hand side is an autocorrelation plot. These two plots and other statistics printed by GENMOD help in Assessing Markov Chain Convergence. It's important to check the convergence for a few of the imputation data sets to make sure that your model fits and that the sampler can reach all regions of interest. When fitting all of the imputed data by dropping the WHERE statement in the example code above, you might want to suppress the output from PROC GENMOD. Because GENMOD doesn’t support the NOPRINT option, you can do so with an ODS statement instead.

So now that we have our OUTBAYES data set containing draws for the parameters in our model, what do we do with it? Common next steps include visualizing the distribution of the parameters. For example, a kernel density plot of the parameter draws produced with PROC SGPLOT can be generated. Or get summary statistics describing the distribution by using PROC MEANS or PROC UNIVARIATE.

Bayesian analysis with missing data summary

You have seen how easy it is to start trying out Bayesian methods in SAS, even with missing data problems. By using PROC GENMOD together with PROC MI, you can switch between a Bayesian and frequentist approach to modeling missing data by simply adding or removing a single line of code. This enables you to explore new methods. It also adds more tools to your statistical toolbox. All this from within the software environment you’re already used to.

What are some next steps you can take after running this example? By default, GENMOD uses a non-informative constant prior. Posterior draws are generated by a product of the likelihood and a prior. A constant prior on the parameters makes our analysis somewhat like a maximum likelihood analysis. Instead, we could use additional information to specify a more informative prior. For example, a normal distribution with a predefined mean and variance. In this example context, a constant prior would consider a very large RunTime parameter value, say 1000, as a priori equally reasonable as a small parameter value of 0.001. In the data sets you work with, you will probably have external information that informs whether that is a reasonable assumption you can formalize by using the prior setting.

When doing an analysis accounting for missing data, we are most interested in the impact of sampling variation on posterior estimates. By default, MI generates 25 imputed data sets, and GENMOD produces 10,000 parameter samples of each. However, GENMOD samples are pretty efficient. You can often get a good handle on the posterior distribution of one particular data set with many fewer samples. Sometimes as low as 500. You could experiment by adjusting the number of samples GENMOD generates by using the NMC option in the BAYES statement. Then, the number of imputations MI generates can be increased by setting a higher NIMPUTE value in the MI statement.

Go ahead and try the example code above online by using SAS OnDemand for Academics or your local SAS installation. Try changing the prior or the number of samples and see how they influence the parameter estimates!