Abstract
Annals of Applied Statistics 2011, Vol. 5, No. 3, 1780-1815 We consider applying Bayesian Variable Selection Regression, or BVSR, to
genome-wide association studies and similar large-scale regression problems.
Currently, typical genome-wide association studies measure hundreds of
thousands, or millions, of genetic variants (SNPs), in thousands or tens of
thousands of individuals, and attempt to identify regions harboring SNPs that
affect some phenotype or outcome of interest. This goal can naturally be cast
as a variable selection regression problem, with the SNPs as the covariates in
the regression. Characteristic features of genome-wide association studies
include the following: (i) a focus primarily on identifying relevant variables,
rather than on prediction; and (ii) many relevant covariates may have tiny
effects, making it effectively impossible to confidently identify the complete
"correct" subset of variables. Taken together, these factors put a premium on
having interpretable measures of confidence for individual covariates being
included in the model, which we argue is a strength of BVSR compared with
alternatives such as penalized regression methods. Here we focus primarily on
analysis of quantitative phenotypes, and on appropriate prior specification for
BVSR in this setting, emphasizing the idea of considering what the priors imply
about the total proportion of variance in outcome explained by relevant
covariates. We also emphasize the potential for BVSR to estimate this
proportion of variance explained, and hence shed light on the issue of "missing
heritability" in genome-wide association studies.