1 Introduction

The availability of large-scale data sets has led to interest in variable selection for regression models with large number of regressors. Typically, these variable selection problems are called “large p, small n” variable selection problems. Standard approaches to this problem include penalized maximum likelihood methods (Hastie et al. 2015) and Bayesian variable selection (Chipman et al. 2001; O’Hara and Sillanpää 2009; García-Donato and Martínez-Beneito 2013). Linear regression is the mostly widely studied problem in the literature, but other generalized linear models play an important role in answering challenging scientific questions. For example, Sanyal et al. (2017) use an iterative Bayesian procedure to analyse a case–control study with 55,000 SNPs and Nikooienejad et al. (2020) considered Bayesian variable selection applied to two cancer survival data sets with 13,267 and 13,335 genes, respectively.

In Bayesian variable selection, a joint prior distribution is placed on models defined by each possible subset of the potential regressors and the parameters of each of these models (the regression coefficients and other parameters such as dispersion parameters). This defines a posterior distribution on the parameters of the model and the models which can be used to investigate the importance of different variables or to make predictions for future observations. Working with this posterior distribution is challenging since (1) different models are defined on parameter spaces with different sizes and (2) there are \(2^p\) possible models for p potential regressors which leads to a vast space if p is large. The first challenge can be circumvented in the linear regression model by working with the marginal likelihood of the models (which is available analytically for commonly used prior distributions), but this is not possible in other generalized linear models. As discussed by García-Donato and Martínez-Beneito (2013), these issues can be addressed by sampling from the posterior distribution using Markov chain Monte Carlo algorithms which provide unbiased estimates of quantities of interest such as the posterior inclusion probability (PIP) for the jth variable, which is the marginal posterior probability that the jth variable is included in the model, or Bayesian model-averaged predictions. Designing MCMC algorithms for Bayesian variable selection which mix well is a computationally challenging task if p is large and a large literature has developed around different approaches (see e.g. Schäfer and Chopin 2013; Titsias and Yau 2017; Shin et al. 2018; Zanella and Roberts 2019).

In this paper, we will concentrate on computational methods for Bayesian variable selection in logistic regression and accelerated failure time models. In this case, the marginal likelihood of the models is not available analytically. There are three main approaches to addressing this issue. Firstly, the marginal likelihood can be approximated, usually with a Laplace approximation, to define an approximated posterior which is sampled using MCMC. Secondly, data augmentation methods introduce latent variables, whose marginal distribution given the model only can be calculated analytically. This allows an MCMC scheme to be used where the latent variables are updated from their full conditional and the model is updated using methods for linear regression models. Thirdly, reversible jump MCMC can be used to work directly on the joint posterior distribution of the model and the parameters which moves between models by proposing a new set of regression coefficients for the proposed model.

In binary data, there has been extensive work on variable selection in probit regression (see e.g. Sha et al. 2003, 2004) using the data augmentation approach of Albert and Chib (1993), which was extended to logistic regression by Holmes and Held (2006). Nikooienejad et al. (2016) use non-local prior and a Laplace approximation to define a Gibbs sampler. In time-to-event data, work has been divided between accelerated failure time (AFT) models and Cox regression models. Sha et al. (2006) initially demonstrated how MCMC with data augmentation could be used for Bayesian variable selection when the AFT model has a normal or t distribution. Newcombe et al. (2017) consider an AFT model with a Weibull distribution and propose a reversible jump MCMC sampler. Zhang et al. (2018) use a Dirichlet process mixture of normals for the error distribution and use a Bayesian lasso to induce sparsity in the regression coefficients. Held et al. (2016) review previous work on Bayesian variable selection in Cox regression models and develop a method based on test Bayes factor to derive a posterior distribution on models. Many authors have concentrated on finding high probability models. Nikooienejad et al. (2020) use a non-local prior for the regression coefficients and use a Laplace approximation with the S5 algorithm (Shin et al. 2018). Annest et al. (2009) use an iterative screening algorithm. Duan et al. (2018) derive an EM algorithm to find high probability model using the framework of EMVS (Rockova and George 2014).

This paper makes three main contributions. Firstly, we extend the Adaptively Scaled Individual (ASI) adaptive MCMC algorithm of Griffin et al. (2020) to binary response (using logistic regression models) and time-to-event data (using accelerated failure time models). Secondly, we propose a correlated pseudo-marginal scheme for GLMs. Thirdly, we compare the performance of the adaptive MCMC algorithm to the Add–Delete–Swap Metropolis–Hastings sampler using correlated pseudo-marginal methods, data augmentation and the Laplace approximation in the large p setting for both logistic regression and accelerated failure time models.

The paper is organized in the following way. Section 2 reviews the Bayesian approach to variable selection in generalized linear models. Section 3 describes different computational strategies for Bayesian variable selection. Section 4 compares the performance of the methods on different real data sets with many regressors and few observations. Section 5 discusses the work and offers some guidelines for the use of the algorithm.

2 Generalized linear models and Bayesian variable selection

We assume that there are p variables available (for example, a list of SNPs or gene expression measurements) and that we wish to find a subset of these variables which explains the variation in a response. We define \(\gamma =(\gamma _1,\dots ,\gamma _p)\in \Gamma \) to be a vector of indicator variables with \(\gamma _j = 1\) if the jth variable is included in the model (and \(\gamma _j = 0\) otherwise). Let y be an \((n\times 1)\)-dimensional vector of responses which is modelled by

$$\begin{aligned} y_i \sim F(\mu _i, \phi ) \end{aligned}$$
(1)

where \(F(\mu , \phi )\) is a distribution in the exponential family with mean \(\mu \) and dispersion parameter \(\phi \). We define the linear predictor for the ith observation to be

$$\begin{aligned} \eta _i= z_i\alpha + x_i^{(\gamma )}\beta _{\gamma } \end{aligned}$$
(2)

where \(z_i\) is a set of regressors which always appear in the model (and which will usually include an intercept and could include a treatment effect variable or other important variables), \(x_i^{(\gamma )}\) is a vector which contains the value for the ith observation of the variables included in the model (i.e. for which \(\gamma _j=1\)), and \(\alpha \) and \(\beta _{\gamma }\) are vectors of regression coefficients. We will write X to be a matrix whose rows are the value for the ith observation of all variables, and Z and \(X^{(\gamma )}\) to be matrix whose rows are \(z_i\) and \(x_i^{(\gamma )}\), respectively. The linear predictor \(\eta _i\) is linked to the mean \(\mu _i\) by \(\eta _i = g(\mu _i)\) where g is a link function. We define \(p_{\alpha }\) to be the dimension of \(\alpha \) and \(p_{\gamma }=\sum _{j=1}^p \gamma _j\).

The logistic regression model can be used to link a proportion of successes to the linear predictors using a generalized linear model. We let \(y_i\) be the proportion of success and assume that \(n_iy_i \sim \text{ Bin }(n_i, p_i)\). The logistic regression model assumes a linear relationship between the covariates and the probability of success (measured on the log-odds scale),

$$\begin{aligned} \eta _i= \log \bigg ( \frac{p_i}{1- p_i} \bigg ) = z_i\alpha + x_i^{(\gamma )} \beta _{\gamma }, \quad i = 1, \dots , n. \end{aligned}$$

The accelerate failure time (AFT) can be used to model time-to-event data which may be censored. We assume that the time-to-event for the ith individual is \(t_i\) but that this time is only observed if \(t_i < c_i\) for some (right) censoring time \(c_i\) and, otherwise, we observe \(c_i\). We will define \(\delta _i = \text{ I }(t_i > c_i)\) which is 1 if the ith observation is censored. The AFT model (on the log-scale) can be written as

$$\begin{aligned} y_i = \log t_i = z_i\alpha + x^{(\gamma )}_i \beta _{\gamma } + \sigma \epsilon _i, \qquad i = 1, \dots , n \end{aligned}$$
(3)

where the errors \(\epsilon _i{\mathop {\sim }\limits ^{i.i.d.}} G\) for a standardized distribution G, such as the standard normal or t distribution. This is a parametric survival model that assumes that the individual survival time \(t_i\) depends on the multiplicative effect of an unknown function of covariates over a baseline survival time \(\alpha \). We will follow Sha et al. (2006) by using this model for Bayesian variable selection with censored outcomes.

In Bayesian variable selection, a prior distribution is assumed for the parameters of the model (\(\alpha \), \(\beta _{\gamma }\), \(\phi \) and \(\gamma \)) which defines a posterior distribution

$$\begin{aligned}&p (\alpha , \beta _{\gamma }, \phi , \gamma \vert X, Z, y) \propto p(y\vert Z, X^{(\gamma )}, \alpha , \beta _{\gamma }, \phi , \gamma )\\&p(\alpha , \beta _{\gamma }, \phi , \gamma ). \end{aligned}$$

In some probability models, such as the binomial distribution, the dispersion parameter is known and so \(\phi \) does not appear in the prior or posterior distribution leading to

$$\begin{aligned} p (\alpha , \beta _{\gamma }, \gamma \vert X, Z, y) \propto p(y\vert Z, X^{(\gamma )}, \alpha , \beta _{\gamma }, \gamma ) p(\alpha , \beta _{\gamma }, \gamma ). \end{aligned}$$

We will assume the commonly used prior structure

$$\begin{aligned} p(\alpha , \beta _{\gamma }, \gamma \vert \phi )\propto p(\beta _{\gamma }\mid \phi , \gamma )\, p(\gamma ) \end{aligned}$$
(4)

with \(\beta _{\gamma } \mid \sigma ^2,\gamma \sim \text{ N }(0,\phi V_{\gamma })\), and \(p(\gamma ) = h^{p_{\gamma }}(1-h)^{p-p_{\gamma }}\). If the dispersion parameter is unknown, an additional prior distribution is placed on \(\phi \). The hyperparameter \(0<h<1\) is the prior probability that a particular variable is included in the model and can be chosen by defining a prior expected model size, \(p_0\), by \(h=\frac{p_0}{p}\). The prior can be further extended with hyperpriors to define a heavier tailed prior distribution for \(p_{\gamma }\) by assuming that \(h\sim \text{ Be }(1,\frac{p-p_0}{p_0})\) (Ley and Steel 2009). The scaled covariance matrix \(V_{\gamma }\) is often chosen as proportional to the identity matrix (implying conditional prior independence between the regression coefficients), \((X_\gamma ^T X_\gamma )^{-1}\) (a g-prior) or mixtures of g-priors (Liang et al. 2008; Li and Clyde 2018). Many computational methods, and all methods described in this paper, can be easily adjusted to works with any of these priors.

3 Computational approaches

In this section, we will concentrate on computational methods for Bayesian variable selection in two generalized linear models: logistic regression (for binary and some ordinal responses) and accelerated failure time (AFT) models (for time to response). We will write \(\theta _{\gamma }\) for the parameters of model \(\gamma \in \Gamma \) (which will be \(\alpha \), \(\beta _{\gamma }\) and \(\phi \) if the dispersion is unknown and \(\alpha \) and \(\beta _{\gamma }\) otherwise). In the linear regression models with normal errors, the marginal likelihood \(p(y\vert \gamma )\) is analytically available. This leads to a simple Metropolis–Hastings updating step where a proposed \(\gamma '\) is sampled from a proposal where the probability of proposing \(\gamma '\) given current value \(\gamma \) is \(q(\gamma , \gamma ')\). The proposed model is accepted with probability

$$\begin{aligned} \alpha = \min \left\{ 1, \frac{p(y\vert \gamma ')p(\gamma ')q(\gamma ',\gamma )}{p(y\vert \gamma )p(\gamma )q(\gamma , \gamma ')} \right\} . \end{aligned}$$
(5)

In GLMs, the marginal likelihood is not directly analytically available. We will describe three approaches: the data augmentation methods where latent variables \(\omega \) are introduced that allow \(p(y\vert \omega , \gamma )\) to be calculated analytically, the Laplace approximation where \(p(y\vert \gamma )\) is approximated leading to approximation error in the posterior and the correlated pseudo-marginal methods where \(p(y\vert \gamma )\) is approximated but leads to no approximation error in the posterior. These methods can be used with any proposal on model space. We will then describe how the ASI proposal (Griffin et al. 2020) can be used with these approaches.

3.1 Algorithms

3.1.1 Data augmentation

Data augmentation (Tanner and Wong 1987) approaches introduce latent variables to make an MCMC sampler simpler to implement. In Bayesian variables, these schemes introduce a fixed-dimension latent variable \(\omega \) which allows \(p(y\vert \gamma , \omega )\) to be calculated analytically. This leads to a simple MCMC scheme where \(\gamma \) can be updated conditional on \(\omega \) using a Metropolis–Hastings sampler with standard proposal distribution on model space and \(\omega \) is updated conditional on \(\gamma \) (often by first simulating the parameters \(\theta _{\gamma }\)).

In the logistic regression model, the Pólya-gamma data augmentation method (Polson et al. 2013) can be used (see e.g. Griffin et al. 2019). This exploits the following identity

$$\begin{aligned} \frac{(\mathrm{e}^{\psi })^a}{(1+\mathrm{e}^{\psi })^b}= 2^{-b} \exp \{\kappa \psi \} \int _0^{\infty } \exp \{-\omega \psi ^2/2\}\,p(\omega )\mathrm{d}\omega \end{aligned}$$

where \(\kappa = a - b/2\), \(\omega _i\sim \text{ PG }(b,0)\) and \(\text{ PG }\) represents a Pólya-gamma distribution, which is defined in Polson et al. (2013). An extended likelihood with additional latent variables \(\omega =(\omega _1,\dots ,\omega _n)\) has the form

$$\begin{aligned}&p(y, \omega \vert \theta _{\gamma },\gamma ) \propto \prod _{i=1}^n \left[ 2^{-n_i} \exp \left\{ \kappa _i\left( z_i\,\alpha +x^{(\gamma )}_{i}\beta _{\gamma }\right) \right\} \right. \\&\left. \quad \times \exp \left\{ -\omega _i (z_i\,\alpha +x^{(\gamma )}_{i}\beta _{\gamma })^2/2\right\} p(\omega _i)\right] \end{aligned}$$

where \(\kappa _i = y_i - n_i/2\) and \(p(\omega _i)\) is the \(\text{ PG }(n_i, 0)\) distribution. The identity can be used to show that marginalizing this distribution over \(\omega \) leads to the likelihood of the logistic regression model. If we let

$$\begin{aligned} J^{(\gamma )} = \left( \begin{array}{c} Z\\ X^{(\gamma )} \end{array}\right) \end{aligned}$$

and assume that \(p(\alpha , \beta _{\gamma })\sim \text{ N }(\mu _{\gamma }, V_{\gamma })\), the marginal likelihood is

$$\begin{aligned}&p(y\vert \gamma , \omega ) \propto |V_{\gamma }|^{-1/2} \left| {{\tilde{J}}^{(\gamma )\,T}} {\tilde{J}}^{(\gamma )} + V_{\gamma }^{-1}\right| ^{-1/2} \\&\quad \times \exp \left\{ -\frac{1}{2} \mu _{\gamma }^T V_{\gamma }^{-1} \mu _{\gamma } + \frac{1}{2}A^T B^{-1} A \right\} \end{aligned}$$

where \(A = J^{(\gamma )\,T}K + V_{\gamma }^{-1}\mu _{\gamma }\), \( B = {\tilde{J}}^{(\gamma )\,T}{\tilde{J}}^{(\gamma )} + V_{\gamma }^{-1}\), \({\tilde{J}}^{\gamma }\) is \((n\times (p_{\alpha } + p_{\gamma }))\)-dimensional matrix with entries \({{\tilde{J}}}^{(\gamma )}_{i,j} = \sqrt{\omega _i} {\tilde{X}}^{(\gamma )}_{i, j}\) and K is a \((n\times 1)\)-dimensional vector with entries \(K_i = \kappa _i\). The latent variables \(\omega =(\omega _1,\dots ,\omega _n)\) can be updated by first sampling \(\alpha ,\beta _{\gamma }\) according to

$$\begin{aligned}&(\alpha , \beta _{\gamma }) \sim \text{ N }\left( ({\tilde{J}}^{(\gamma )\,T}{\tilde{J}}^{(\gamma )} + V_{\gamma }^{-1})^{-1} (J^{(\gamma )\,T}K + V_{\gamma }^{-1}\mu _{\gamma }), \right. \\&\quad \left. ({\tilde{J}}^{(\gamma )\,T}{\tilde{J}}^{(\gamma )} + V_{\gamma }^{-1})^{-1} \right) \end{aligned}$$

and then sampling \(\omega _i\sim \text{ PG }(n_i, z_i\,\alpha + x^{(\gamma )}_{i}\beta _{\gamma })\). Polson et al. (2013) describe efficient algorithms for the generation of Pólya-gamma distributed random variables.

In the accelerated failure time model, we introduce the variables \(\omega _i = y_i\). These are latent if \(\delta _i = 1\), i.e. \(\omega _i\) is the missing survival time if the ith observation is censored. Conditional on the data and \(\omega =(\omega _1,\dots ,\omega _n)\), the accelerated failure time model in (3) is a linear regression. If \(\epsilon _i\sim \text{ N }(0, 1)\), we have a linear regression with normal errors. If we define W to be an \((n\times 1)\)-dimensional vector with ith entry \(\omega _i\) and assume that \(p(\alpha , \beta _{\gamma })\sim \text{ N }(\mu _{\gamma }, \sigma ^2\, V_{\gamma })\), the marginal likelihood is

$$\begin{aligned}&p(y\vert \gamma , \omega _1,\dots ,\omega _n) = |V_{\gamma }|^{-1/2} \left| {J^{(\gamma )\,T}} J^{(\gamma )} + V_{\gamma }^{-1}\right| ^{-1/2} \\&\quad \times \exp \left\{ -\frac{1}{2} \mu _{\gamma }^T V_{\gamma }^{-1} \mu _{\gamma } + \frac{1}{2} A^T B^{-1} A \right\} \end{aligned}$$

where \(A = (J^{(\gamma )\,T}W + V_{\gamma }^{-1}\mu _{\gamma })\) and \(B = J^{(\gamma )\,T}J^{(\gamma )} + V_{\gamma }^{-1}\)

The latent variables \(\omega _i\) for \(\delta _i=1\) can be updated by first sampling \((\alpha , \beta _{\gamma }, \sigma ^2)\). If we assume that \(\sigma ^{-2}\sim \text{ Ga }(a/2,b/2)\), then

$$\begin{aligned}&\sigma ^{-2}\sim \text{ Ga }\left( (a+n+p_{\alpha }+p_{\gamma })/2, \left( b+ \mu _{\gamma }^T V_{\gamma }^{-1} \mu _{\gamma }\right. \right. \\&\left. \left. \quad + W^T W - A^T B^{-1} A\right) /2\right) \end{aligned}$$

and

$$\begin{aligned} (\alpha , \beta _{\gamma }) \sim \text{ N }\left( B^{-1} A, \sigma ^2 B^{-1} \right) . \end{aligned}$$

The latent variable \(\omega _i\) for \(\delta _i=1\) can be sampled from its full conditional distribution which is \(\text{ TN}_{[c_i, \infty )}(z_i\,\alpha + x^{(\gamma )}_i\beta _{\gamma }, \sigma ^2)\) where \(\text{ TN}_{A}(\mu , \sigma ^2)\) represents a normal distribution with mean \(\mu \) and variance \(\sigma ^2\) truncated to the set \(A\in {\mathbb {R}}\).

3.1.2 Laplace approximation

The Laplace approximation has been widely used for variable selection in generalized linear models. The marginal likelihood is approximated by

$$\begin{aligned} p(y\vert \gamma ) \propto p\left( y\left| {{\hat{\theta }}}_{\gamma }\right. \right) p\left( {{\hat{\theta }}}_{\gamma }\right) \vert -H_{\gamma }\vert ^{-1/2}(2\pi )^{d/2} \end{aligned}$$

where d is the dimension of \(\theta _{\gamma }\), \({{\hat{\theta }}}\) is the posterior mode of \(\theta _{\gamma }\) and \(H_{\gamma }\) is the Hessian of \(\log p\left( y\vert \theta _{\gamma }\right) +\log p\left( \theta _{\gamma }\right) \) evaluated at \({\hat{\theta }}_{\gamma }\). The approximation follows from assuming that the posterior distribution of \(\theta _{\gamma }\) can be approximated by

$$\begin{aligned} p_\mathrm{Laplace}\left( \theta _{\gamma }\right) = \text{ N }\left( {\hat{\theta }}_{\gamma }, \Sigma ^{(\gamma )}_\mathrm{Laplace}\right) \end{aligned}$$
(6)

where \(\Sigma ^{(\gamma )}_\mathrm{Laplace} = \left( -H_{\gamma }\right) ^{-1}\).

3.1.3 Correlated pseudo-marginal sampler

The pseudo-marginal sampler (Andrieu and Roberts 2009) targets a distribution where the prior is multiplied by a Monte Carlo approximation \({\hat{p}}(y\vert \gamma )\) of the intractable marginal likelihood \(p(y\vert \gamma )\) and then runs a Metropolis–Hastings sampler on this target distribution, i.e. the acceptance rate of the Metropolis–Hastings sampler is

$$\begin{aligned} \alpha = \min \left\{ 1, \frac{{\hat{p}}(y\vert \gamma ')\,p(\gamma ')\,q(\gamma ',\gamma )}{{\hat{p}}(y\vert \gamma )\,p(\gamma )\,q(\gamma ,\gamma ')} \right\} . \end{aligned}$$

This would be the usual acceptance rate if \({\hat{p}}(y\vert \gamma ) = p(y\vert \gamma )\), but Andrieu and Roberts (2009) show that the method samples from the correct target distribution with the weaker condition that \(\text{ E }[{\hat{p}}(y\vert \gamma )] = p(y\vert \gamma )\) (unbiased estimation). The method has been extended to the correlated pseudo-marginal method (Deligiannidis et al. 2018) were the random numbers used to calculate \({\hat{p}}(y\vert \gamma ')\) are positively correlated with those used to calculate \({\hat{p}}(y\vert \gamma )\). They show that this can reduce the variance of the ratio \({\hat{p}}(y\vert \gamma ')/{\hat{p}}(y\vert \gamma )\) and help the mixing of the Markov chain.

In Bayesian variable selection for GLMs, it is natural to use the Laplace approximation of \(p(\theta _{\gamma }\vert \gamma , y)\) in (6) as an importance sampling distribution in an importance sampling approximation to \({\hat{p}}(y\vert \gamma )\). The approximation is

$$\begin{aligned} {\hat{p}}(y\vert \gamma )=\frac{1}{N}\sum _{i=1}^N \frac{p\left( y\left| \theta _{\gamma }^{(i)}, \gamma \right. \right) \,p\left( \theta _{\gamma }^{(i)}\right) }{p_\mathrm{Laplace}\left( \theta ^{(i)}_\gamma \right) } \end{aligned}$$

where \(\theta _{\gamma }^{(i)} {\mathop {\sim }\limits ^{i.i.d.}} p_\mathrm{Laplace}\). The samples \(\theta _{\gamma }^{(i)}\) can be written as \( \theta _{\gamma }^{(i)} = {\hat{\theta }}_{\gamma } + C_{\gamma }\Gamma _ {\nu _i}\) where \(\Gamma \) is a \(p_\gamma \times p\)-dimensional matrix, where \(\Gamma _{i,j}=1\) if the i-th variable included in the model has index j and \(\Gamma _{i,j}=0\) otherwise, \(C_{\gamma }\) is the Cholesky decomposition of \(\Sigma ^{(\gamma )}_\mathrm{Laplace}\) and \(\nu _i{\mathop {\sim }\limits ^{i.i.d.}}\text{ N }\left( 0, I_{p}\right) \). A correlated pseudo-marginal sampler can be implemented in the following way. At iteration k, suppose that \(\nu _i,\dots ,\nu _N\) are the current values of the random variates before making the proposal, then propose \(\nu _1',\dots ,\nu _N'\) by

$$\begin{aligned} \nu '_{i,k} = \rho \nu _{i,k} + \sqrt{1 - \rho ^2} \lambda _{i,k} \end{aligned}$$

where \(\lambda _{i,k}{\mathop {\sim }\limits ^{i.i.d.}}\text{ N }(0, 1)\). This implies that \(\eta _i\) follows an autoregressive process of lag 1 with AR parameter \(\rho \) and a standard normal stationary distribution. In practice, \(\eta '_i\) only need to be simulated if \(\gamma _k'=1\) and, if \(\gamma _k=0\) and \(\gamma _k'=1\), we can simulated \(\nu _{i,k}\sim \text{ N }(0,1)\). Lamnisos et al. (2009) suggest using the automatic generic sampler (Green 2003) for logistic regression model with their adaptive proposal. The correlated pseudo-marginal is equivalent to this sampler when \(N=1\) and \(\rho =1\) (completed dependence between successive \(\nu _i\)) if an extra Gibbs step is introduced where \(\nu _i\) is updated from its full conditional distribution. We use this additional step for all values of \(\rho \) and so this correlated pseudo-marginal sampler provides a generalization of the automatic generic sampler for GLMs.

3.2 Adaptively scaled individual proposal

Griffin et al. (2020) develop a parameterized proposal distribution for Bayesian variable selection in linear regression models and a method for adaptively tuning the parameters during the MCMC run. They demonstrate that the method can mix substantially faster than the standard Add–Delete–Swap sampler. They also show that the method can accurately estimate the PIPs of each variable for a range of problems up to thousands of regressors in a reasonable amount of time. For example, they run their algorithm on two large data sets. One data set had 22,575 variables and provided accurate results in 25 min, whereas the other data set had 79,748 variables and provided accurate results in 2.5 h, respectively.

The proposal on model space, \(\Gamma \), is

$$\begin{aligned} q_{\eta }(\gamma ,\gamma ')=\prod _{j=1}^p q_{\eta ,j}(\gamma _j,\gamma '_j) \end{aligned}$$

where \(\eta = (A,D) = (A_1,\dots , A_p, D_1, \dots , D_p)\), \(q_{\eta ,j}(\gamma _j=0, \gamma _j'=1)=A_j\) and \(q_{\eta ,j}(\gamma _j=1,\gamma _j'=0)=D_j\). Each dimension of \(\gamma '\) is independently proposed conditional on \(\gamma \). The probability of proposing to add the jth variable if it is currently excluded from the model is \(A_j\), and the probability of proposing to delete the jth variable if it is currently included in the model is \(D_j\). They show that an effective choice for (AD) is

$$\begin{aligned} A_j = \zeta \min \left\{ 1, \frac{\pi _j}{1-\pi _j}\right\} , \qquad D_j = \zeta \min \left\{ 1, \frac{1- \pi _j}{\pi _j}\right\} ,\nonumber \\ \end{aligned}$$
(7)

where \(\pi _j\) is the PIP of the jth variable and \(0<\zeta <1\) is a tuning parameter. They suggest estimating \(\pi _j\) within the Gibbs sampler using a Rao–Blackwellised estimate and tuning \(\zeta \) using a simple adaptation scheme. The proposal is

$$\begin{aligned}&A^{(i)}_j = \zeta ^{(i)} \min \left\{ 1, \frac{\pi ^{(i)}_j}{1-\pi ^{(i)}_j}\right\} , \qquad \\&D^{(i)}_j = \zeta ^{(i)}\min \left\{ 1, \frac{1- \pi ^{(i)}_j}{\pi ^{(i)}_j}\right\} , \end{aligned}$$

where \(\pi _j^{(i)}\) is a Rao–Blackwellised estimate of the PIP of the jth variable calculate using the first i sample and \(\zeta ^{(i)}\) is updated using

$$\begin{aligned} {\text{ logit }}_{\epsilon } \left( \zeta ^{(i+1)}\right) = {\text{ logit }}_{\epsilon }\left( \zeta ^{(i)}\right) + \phi _i \left( a_{\eta ^{(i)}}\left( \gamma ^{(i)},\gamma '\right) - \tau \right) ,\nonumber \\ \end{aligned}$$
(8)

where \( {\text{ logit }}_{\epsilon }(x) = \log (x-\epsilon ) - \log (1 - x - \epsilon ) \), \(\phi _i=O(i^{-\lambda })\) for some constant \(1/2<\lambda \le 1\) \(a_{\eta ^{(i)}}(\gamma ^{(i)},\gamma ')\) is the Metropolis–Hastings acceptance probability at the ith iteration, and \(\tau \) is a targeted acceptance rate. The algorithm is summarized in Algorithm 1.

figure a

The ASI algorithm uses a Rao–Blackwellised estimate of the PIP’s \(\pi _1,\dots ,\pi _p\) calculated in the run of the algorithm. Griffin et al. (2020) show how the update of the Rao–Blackwellised calculated in O(p) operations in the linear regression model. This can be directly extended to the data augmentation approach. We assume that \(V_{\gamma } = \left( \begin{array}{cc} V_{\alpha } &{} \mathbf{0} \\ \mathbf{0}^T &{} g I_{p_{\gamma }}\end{array}\right) \), where \(V_{\alpha }\) is \((p_\alpha \times p_\alpha )\)-dimensional matrix, \(I_q\) is the \(q\times q\) identity matrix and \(\mathbf{0}\) is a \((p_{\alpha } \times p_{\gamma })\)-dimensional matrix of 0’s. After N posterior samples, \(\gamma ^{(1)},\dots ,\gamma ^{(N)}\), the Rao–Blackwellised estimate of \(\pi _j=p(\gamma _j=1\vert y)\) is

$$\begin{aligned} {\hat{\pi _j}} = \frac{1}{N} \sum _{k=1}^N \frac{{\tilde{h}}_j^{(k)}\,\text{ BF}_j\left( \gamma _{-j}^{(k)}\right) }{1-{\tilde{h}}_j^{(k)}+{\tilde{h}}_j^{(k)}\,\text{ BF}_j\left( \gamma _{-j}^{(k)}\right) } \end{aligned}$$

where \({\tilde{h}}_j^{(k)} = h\) if h is fixed or \({\tilde{h}}_j^{(k)} = \frac{\#\gamma ^{(k)}_{-j}+1+a}{p+a+b}\) if \(h\sim \text{ Be }(a,b)\) and \(B = {\tilde{J}}_{\gamma }^T{\tilde{J}}_{\gamma } + V^{-1}_{\gamma }\). If \(\gamma _j=0\),

$$\begin{aligned} \text{ BF}_j(\gamma _{-j})={d_j^{\uparrow }}^{-1/2}g^{-1/2} \exp \left\{ \frac{1}{d_j^\uparrow } (\kappa ^T x_j - \kappa ^T J_{\gamma } B^{-1} {\tilde{J}}_{\gamma }^T {\tilde{x}}_j)^2\right\} \end{aligned}$$

with \( d_j^\uparrow = {\tilde{x}}_j^T {\tilde{x}}_j + g^{-1} - ({\tilde{x}}_j^T {\tilde{J}}_{\gamma })B^{-1}({\tilde{J}}_{\gamma }^T {\tilde{x}}_j)\) and \(\tilde{x_j}\) is a \((n\times 1)\)-dimensional vector with ith entry \(\tilde{x_{i,j}} = \sqrt{\omega }_i x_{i, j}\). In the case \(\gamma _j=1\), it is useful to define \(q_j\) to be ordered position of the included variables (\(q_j=1\) if j is the first included variable, etc.); then,

$$\begin{aligned} \text{ BF}_{j}(\gamma _{-j})={d_j^{\downarrow }}^{-1/2}g^{-1/2} \exp \left\{ -\frac{1}{2} {d_j^\downarrow } (\kappa ^T J_{\gamma } (B^{-1})_{\cdot , q_j+p_\alpha })^2\right\} \end{aligned}$$

where \( d_j^\downarrow = 1/(B^{-1})_{q_j+p_{\alpha },q_j+p_{\alpha }}\). Calculating the Rao–Blackwellised estimate using the Laplace approximation is time-consuming since this would involve an optimization step to a posterior mode for each potential variable at each iteration of the sampler. Therefore, in both the Laplace approximation and correlated pseudo-marginal approaches, the ASI algorithm with data augmentation is run for \(N_0\) iteration to calculated Rao–Blackwellised estimates of the PIP’s. After this initial phase has been run, \(\pi ^{(N_0)}_1, \dots , \pi ^{(N_0)}_p\) are used as the estimate PIPs at every iteration and only \(\zeta \) is adaptively updated in the sampler (and often for only a finite number of iterations of the sampler).

Adaptive MCMC algorithms do not necessarily lead to ergodic Markov chains and so care is needed in their design. Griffin et al. (2020) show that the ASI algorithm leads to an ergodic chains in a linear regression model. In our algorithms, adaptation only occurs for a fixed number of samples during the burn-in phase and so the algorithms are ergodic by design. However, it is interesting to think about whether these samplers are ergodic if adaptation is allowed to continue indefinitely. Roberts and Rosenthal (2007) set out two conditions for the ergodicity of adaptive MCMC algorithms. Firstly, the algorithm must have diminishing adaptation that is the adaptation of parameters tends to decrease as the sampler runs. This property is established for the ASI algorithm in Griffin et al. (2020). The second is containment which means that the transition kernel (for any value of the adaptive parameter) reaches stationarity in bounded time. This property can be easily established if the MCMC algorithm is uniformly ergodic. This property can be easily established by restricting the state space of the adaptive MCMC sampler to a (large) bounded subset of \({\mathbb {R}}^{p_{\gamma }}\), for example, by bounding the regression coefficients in the Pólya-gamma sampler or the sampled values in the importance sampler in the correlated pseudo-marginal sampler. The property can also be easily established if the sampler for Bayesian variable selection is run on the posterior distribution of the model \(\gamma \) only since the finite state space implies uniform ergodicity. If the marginal likelihood \(p(y\vert \gamma )\) is approximated using an importance sampler, then the pseudo-marginal sampler is also uniformly ergodic (Theorem 8, Andrieu and Roberts 2009) if the weights in the importance sampler are bounded. The Pólya-gamma sampling scheme is also uniformly ergodic (Choi and Hobert 2013) which implies that the adaptive scheme is uniformly adaptive.

4 Comparison of computational algorithms

The effective sample size of N MCMC draws of a parameter \(\theta \) is defined to be

$$\begin{aligned} \text{ ESS}_{\theta } = \frac{N}{1 + 2\sum _{k=1}^{t} {\hat{\rho _k}}} \end{aligned}$$

where \({\hat{\rho _k}}\) is the estimated lag k autocorrelation of the chain for \(\theta \) and t is a suitably chosen threshold (see e.g. Liu 2001). The ergodic average calculated using the N MCMC samples has the same Monte Carlo error as an independent sampler with \(\text{ ESS}_{\theta }\) samples and so larger values of \(\text{ ESS}_{\theta }\) for fixed N imply a better mixing chain. The use of the ESS in this adaptive context is justified since \(N_0\) is chosen to be smaller than the burn-in phase and \(\zeta \) is only adapted during the burn-in phase. We measure the performance of each algorithm by calculating the median of

$$\begin{aligned} \text{ ESS } = \text{ median}_{j=1,\dots ,p}(\text{ ESS}_{\gamma _j}). \end{aligned}$$

This gives an overall measure of the mixing of the MCMC chain but does not account for differences in the time taken by different algorithms. To include time in the performance measure, we define the time-normalized effective sample size of algorithm A to be \(\text{ ESS}_A / T_A\) where \(\text{ ESS}_A\) and \(T_A\) are the effective sample size and time taken to generate the MCMC samples, respectively.

The results compare various algorithms:

  • ASI-DA—ASI proposal with data augmentation sampling.

  • ADS-DA—Add–Delete–Swap proposal with data augmentation sampling.

  • ASI-Laplace—ASI proposal with a Laplace approximation of the marginal likelihood used directly (the Rao–Blackwellised estimates in the ASI proposal are calculated using the Pólya-gamma sampling scheme).

  • ADS-Laplace—Add–Delete–Swap proposal with a Laplace approximation of the marginal likelihood used directly

  • ASI-CPM—ASI proposal with the correlated pseudo-marginal sampler with a multivariate normal importance distribution.

  • ADS-CPM—ADS proposal with the correlated pseudo-marginal sampler with a multivariate normal importance distribution.

The ASI-DA, ADS-DA, ASI-CPM and ADS-CPM will provide samples for the true posterior, whereas the ASI-Laplace and ADS-Laplace will sample from the posterior defined by the Laplace approximation. The best ESS/Time for the method which samples from the true posterior distribution is the ASI-DA (6.6) with similar performance from the correlated pseudo-marginal with normal importance, \(\rho =1\) and \(N=5\). The methods performs about 3 times better than their ADS counterparts. The ASI-Laplace provides the best ESS/Time but only by a small amount.

The correlated pseudo-marginal method should become more efficient than the PG sampling method when the sample size becomes larger since: 1) ESS will decreases for PG sampling relative to correlated pseudo-marginal (due to the introduction of more latent variables – one for every observation) and 2) increased sampling costs for the PG sampling relative to the correlated pseudo-marginal (again, due to the introduction of more latent variables).

4.1 Logistic regression

We begin by considering a simulation study which converts the linear regression simulation study of Yang et al. (2016) to logistic regression. They assume a linear predictor \(\eta = X\beta ^{\star }\) and generated observations with normal errors. In this simulation study, we use a logistic link to give \(y_i = \exp \{\eta _i\}/(1+ \exp \{\eta _i\})\). Only the first 10 regression coefficients are nonzero with values

$$\begin{aligned} \beta ^{\star }=\text{ SNR } (2, -3, 2, 2, -3, 3, -2, 3, -2, 3, 0, \dots , 0)^T\in {\mathbb {R}}^p \end{aligned}$$

where \(\text{ SNR }\) controls the signal-to-noise ratio. The ith vector of regressors is generated as \(x_i\sim \text{ N }(0,\Sigma )\) where \(\Sigma _{jk}=\rho ^{\vert j - k\vert }\). In our examples, we use the value \(\rho =0.6\) which represents a relative large correlation between the regressors. We ran a simulation study to compare the performance of the exact samplers for all combinations of \(n = 500, 1000\) and \(p = 500, 5000\). Each sampler was run for 105,000 iterations with a burn-in of 5000. The adaptive parameters were only adapted during the burn-in phase. The results are presented in Table 1. The CPM results correspond to \(N=1\) and \(\rho =1\) which provided the largest time-normalized effective sample sizes. The ASI algorithm outperforms the ADS algorithm for both the DA and CPM updating schemes for all simulated data sets apart from the data set with \(n = 500\), \(p = 5000\) and \(\text{ SNR } = 1\). This is a low information data set since it combines a small sample size, large number of regressors and a low signal-to-noise ratio (the ASI only marginally outperforms ADS in the lower signal-to-noise case where \(\text{ SNR } = 0.5\)). The CPM methods always outperform the DA methods. The difference between the ASI method and the ADS method with CPM updating tends to be larger for more informative data sets (those with larger sample size, smaller numbers of regressors and larger signal-to-noise ratios). The posterior distribution in these cases is better able to distinguish between informative and uninformative variables in the regression and the ASI method is better able to exploit this information than the ADS sampler.

Table 1 Time-normalized effective sample size for various computational methods for the simulated logistic regression data sets

We consider four data sets which have been previously analysed in the literature on computational methods for Bayesian variable selection in logistic regression models with a number of regressors (see e.g. Lamnisos et al. 2009): Arthritis (Sha et al. 2003), Colon Tumor (Alon et al. 1999), Leukemia (Armstrong et al. 2002) and Prostate (Singh et al. 2002). The sample size and number of regressors for each data set are shown in Table 2. All variables were scaled to have a sample standard deviation of one. The prior distribution has the form \(\alpha \sim \text{ N }(0, 100)\), \(p(\beta _{\gamma }\vert \gamma )\sim \text{ N }(0, I)\), \(\gamma _j{\mathop {\sim }\limits ^{i.i.d.}} \text{ Bernoulli }(h)\) for \(j=1, \dots , p\) where \(h\sim \text{ Be }\left( 1, \frac{p-5}{p}\right) \). The samplers were run for 105,000 iterations with a burn-in period of 5000 iterations. All samples after the burn-in were used in the inference.

Table 2 Sample size n, number of regressors p and time-normalized effective sample size for various computational methods for the four logistic regression data sets

The time-normalized effective sample size for each method and each data set is shown in Table 2. In the correlated pseudo-marginal sampler, we find that \(\rho =1\) leads to the largest time-normalized effective sample size for both the ASI and ADS sampler for all data sets and we choose the number of random samplers in the importance sampler, N, which gives the largest value for each data set (with that value of N shown in Table). The results show some clear patterns. The ASI correlated pseudo-marginal sampler is no worse than the ADS correlated pseudo-marginal sampler for all data sets. The Pólya-gamma ASI sampler outperforms the Pólya-gamma ADS sampler for the smaller data sets (Arthritis, Colon Tumor and Leukemia) but not the largest data sets (Prostate). The correlated pseudo-marginal ASI outperforms the Pólya-gamma ADS for all data sets. The Laplace approximation methods tend to have larger time-normalized ESS than the exact methods (all data sets apart from Leukemia).

Table 3 Effective sample size for various computational methods on the four logistic regression data sets
Table 4 Sample size n, number of regressors p, the percentage of censored observations (%) and the time-normalized effective sample size for various computational methods on the two survival data sets

To understand the effects, it is also useful to look at effective sample sizes which are shown in Table 3. The relative ESS for the ASI over the ADS method in the correlated pseudo-marginal is much larger than for the Pólya-gamma sampler. This reflects the effect of introducing latent variables in the data augmentation which slows the ASI methods ability to make large jumps in model space. The effective sample size for the correlated pseudo-marginal and Laplace is very similar for both ASI and ADS for all data sets. The differences in the time-normalized ESS reflect the additional overhead of updating the regression coefficients and calculating the importance sampling approximation.

4.2 Accelerated failure time modelling

Bayesian variable selection and associated computational methods for survival analysis have been less developed than approaches for logistic regression in the literature. Consequently, there is no commonly used set of the data that is used in previous work. We consider two data sets. The first looks at survival following chemotherapy for diffuse large-b-cell lymphoma (Rosenwald et al. 2002) (DLBCL). The second breast cancer van’t Veer et al. (2002). The sample size, number of regressors and the percentage of censored observations are shown in Table 4. In both cases, there are a large number of variables and a large amount of censoring. The prior distribution has the form \(p(\alpha , \sigma ^{-2})\propto 1\), \(\gamma _j{\mathop {\sim }\limits ^{i.i.d.}} \text{ Bernoulli }(h)\) for \(j=1, \dots , p\) where \(h\sim \text{ Be }\left( 1, \frac{p-5}{p}\right) \). The data sets were not standardized and the following priors were used for the regression coefficients: \(p(\beta _{\gamma }\vert \gamma )\sim \text{ N }(0, 4\,I)\) for the DLBCL data and \(p(\beta _{\gamma }\vert \gamma )\sim \text{ N }(0, 0.25\,I)\) for the breast cancer data. The samplers were run for 105,000 iterations with a burn-in period of 5000 iterations. All samples after the burn-in were used in the inference.

The time-normalized effective sample size for each method and each data set is shown in Table 4. Unlike the logistic regression examples, in the correlated pseudo-marginal sampler we find the optimal value of \(\rho \) differs between data sets and the optimal value of N differs between both data sets and algorithms. The results show that the ASI methods outperform the corresponding ADS methods for DA, CPM and Laplace in both data sets (with a substantial improvement for the DLBCL data set). The ASI-CPM and ASI-Laplace methods perform better than the ASI-DA method for the DLBCL but perform worse for the Breast Cancer data sets. Although we only have two data sets, the results illustrate an important trade-off between the data augmentation methods and the CPM and Laplace methods. The data augmentation scheme scales with number of censored observations. The CPM and Laplace methods effectively scale like the typical model size since most time is spent finding the posterior mode for each update. The mixing of the CPM and Laplace methods should be better than the data augmentation schemes since no latent variables are introduced. Since the performance is largely effected by the typical model size (which is often not known before the analysis), it is hard to provide criteria for deciding when to use which algorithm. Our recommendation is to use the DA method in problems with sample sizes in the hundreds since this works well in both cases. The performance of the DA will deteriorate relative to the CPM/Laplace methods for large values of n.

5 Discussion

In this paper, we have looked at several different schemes for Bayesian variable selection in logistic regression or accelerated failure time models. We find that the use of the adaptive Metropolis–Hastings sampler provides better mixing chains than the standard Add–Delete–Swap method in both models. The choice between data augmentation and the CPM or Laplace method is less clear-cut. In the logistic regression model, the CPM method outperforms the data augmentation scheme in 3 out of 4 data sets. The CPM method is exact and performs similar to the (approximate) Laplace approximation method. Therefore, we suggest using the CPM with the ASI method for logistic regression. The performance in the accelerate failure time model is less clear. We find that the performance of the CPM and Laplace methods is strongly effected by the typical posterior model size. We recommend using the data augmentation method with ASI for data sets with sample sizes in the hundreds.