Abstract
Modern regularization and variable selection methods such as lasso and Bayesian variable selection are important tools for psychological researchers to reduce the risk of overfitting, improve prediction in future samples, and increase model interpretability. Although missing data are common in psychological data, it is not straightforward to combine principled methods for addressing missing data with these modern variable selection methods. This challenge is well-illustrated in a recent paper by Gunn and colleagues (2022) with a comparison of three approaches for combining lasso with multiple imputation to address missing data. Each of the surveyed approaches results in markedly different results in terms of predictors selected. Their findings underscore limitations of the lasso for the purpose of variable selection. In this paper we show how to implement a Bayesian variable selection method, Stochastic Search Variable Selection (SSVS), with multiply imputed data. SSVS is a principled and consistent method for variable selection, and we demonstrate advantages relative to lasso in an example dataset and simulation study. It is straightforward to apply an impute-then-combine strategy for SSVS using existing software. Psychological researchers often analyze many potential predictors, which increases the risk of identifying relationships that do not replicate in future samples. Methods such as the lasso are widely used to reduce this risk by selecting a smaller set of important predictors. However, psychological data also frequently contain missing values, and combining variable-selection methods with approaches for handling missing data is challenging. Recent work comparing ways to combine the lasso with multiple imputation—a common method for addressing missing data—shows that different approaches can lead to very different conclusions about which predictors are important. In this paper, we demonstrate how to apply a Bayesian variable-selection method, Stochastic Search Variable Selection (SSVS), with multiply imputed data. Using both an example dataset and simulations, we show that this approach provides a principled and consistent way to identify important predictors and can offer advantages over the lasso, while remaining straightforward to implement with existing software.