Adaptive MCMC for Bayesian Variable Selection in Generalised Linear Models and Survival Models

Developing an efficient computational scheme for high-dimensional Bayesian variable selection in generalised linear models and survival models has always been a challenging problem due to the absence of closed-form solutions to the marginal likelihood. The Reversible Jump Markov Chain Monte Carlo (RJMCMC) approach can be employed to jointly sample models and coefficients, but the effective design of the trans-dimensional jumps of RJMCMC can be challenging, making it hard to implement. Alternatively, the marginal likelihood can be derived conditional on latent variables using a data-augmentation scheme (e.g., Pólya-gamma data augmentation for logistic regression) or using other estimation methods. However, suitable data-augmentation schemes are not available for every generalised linear model and survival model, and estimating the marginal likelihood using a Laplace approximation or a correlated pseudo-marginal method can be computationally expensive. In this paper, three main contributions are presented. Firstly, we present an extended Point-wise implementation of Adaptive Random Neighbourhood Informed proposal (PARNI) to efficiently sample models directly from the marginal posterior distributions of generalised linear models and survival models. Secondly, in light of the recently proposed approximate Laplace approximation, we describe an efficient and accurate estimation method for marginal likelihood that involves adaptive parameters. Additionally, we describe a new method to adapt the algorithmic tuning parameters of the PARNI proposal by replacing Rao-Blackwellised estimates with the combination of a warm-start estimate and the ergodic average. We present numerous numerical results from simulated data and eight high-dimensional genetic mapping data-sets to showcase the efficiency of the novel PARNI proposal compared with the baseline add–delete–swap proposal.


Introduction
Variable selection is an automatic method for finding a small subset of covariates that explain most of the variation in the response of interest.In addition to identifying the most predictive covariates, there is a growing interest in exploring the low-rank structure between the covariates and the response, especially in genetic mapping problems where the objective is to find the expressed genes that are associated with a specific disease the most.In the frequentist framework, model selection is based on maximising the penalised log-likelihood [1] or minimising information criteria such as AIC [2] and BIC [3].Other approaches, such as the deviance information criterion (DIC) [4] and widely applicable information criterion (WAIC) [5], which are generalisations of the AIC, are also popular in model selection.
A natural alternative to these frequentist approaches is Bayesian variable selection (BVS).In the Bayesian approach, a prior is imposed on all candidate models, and the resulting posterior distribution naturally captures model uncertainty.In this work, we consider a spike-and-slab prior [6,7], which introduces indicator variables denoting the inclusion or exclusion of every covariate.Therefore, the spike-and-slab prior leads to a model posterior distribution that lies in a lattice with the same dimension as the number of covariates.We can understand the dependency between the importance of covariates and response using natural measures of the posterior distribution such as posterior model probability (PMP) and marginal posterior inclusion probability (PIP).The computation of the exact posterior distribution requires a full search over the whole model space, which is computationally infeasible when a high-dimensional data-set is analysed.In these settings, Markov Chain Monte Carlo (MCMC) algorithms are often used to explore the model space and estimate the posterior distribution.For "large n, large p" data-sets, which are now often encountered in some problems in genetics/genomics (such as genetic mapping studies), such algorithms must be carefully designed.In this work, we mainly consider Bayesian variable selection in generalised linear models and survival models and focus on three popular models: the logistic regression model [8,9], the Cox proportional hazards model with partial likelihood [10,11,12,13,14] and the Weibull regression model [15].In each case, we illustrate how carefully designed algorithms can facilitate effective posterior computation.
A natural challenge of Bayesian variable selection methods in the above settings is that the marginal likelihood (or the integrated likelihood in [16]) is not analytically available.One set of solutions are Reversible Jump MCMC schemes (RJMCMC) [17], which sample from the joint space of models and regression coefficients by jointly proposing moves between models and regression coefficients.But it is often difficult to construct efficient proposals for these trans-dimensional jumps and design an MCMC scheme that mixes well [18].For some specific models, data-augmentation methods [19] are available and result in closed-form marginal likelihood conditioned on latent variables, for instance, Pólya-gamma data augmentation [20] for logistic regression.For other models where no suitable data-augmentation scheme exists, the most popular approaches are the Laplace approximation and the correlated pseudo-marginal method [21], which rely on finding the maximum a posteriori (MAP) estimate of the regression coefficients.A novel scalable estimation method for marginal likelihood, approximate Laplace approximation (ALA), is introduced in [16] and relies on defining an initial value for the coefficient parameters.ALA can save computational time during the optimisation process of finding the MAP estimate, but it does not yield an asymptotically consistent estimate.A detailed discussion of these approaches will be given in Section 3. Assuming that the marginal likelihood has been estimated, several MCMC algorithms can be used for simulation starting from the posterior distribution of BVS.The widely used add-delete-swap proposal [22] can be employed here.The add-delete-swap proposal generates a new model by randomly selecting one of three possible moves: addition/deletion of a covariate into/from the model or swapping one covariate that is included with another that is not.Although it has been proved in [23] that the add-delete-swap proposal can produce a rapidly mixing Markov Chain, the chains may still converge slowly, particularly when dealing with large-p problems.Adaptive MCMC schemes [24], which involve updating tuning parameters on the fly, are found to be valuable in addressing the issue of poor convergence.Lamnisos et al. [25] describe an adaptive add-delete-swap proposal that allows for simultaneous changes to multiple variables at a time.Griffin et al. [26] introduce the Adaptively Scaled Individual adaptation proposal (ASI), which simulates a new model with probability proportional to the product of PIPs.Wan and Griffin [27] extend the ASI proposal to logistic regression and accelerated failure time models.Other popular MCMC approaches include the Hamming ball sampler (HBS) [28], which proposes a new model within a Hamming neighbourhood using the PMPs as proposal weights, and the tempered Gibbs sampler [29,30], which uses tempering to efficiently sample from the multi-modal posteriors that commonly arise due to highly correlated covariates.
The recent work of [31] provides useful insights into the design of efficient MCMC schemes in discrete spaces.The work introduces the locally informed proposal, which re-weights a given non-informed base kernel with a function of the PMPs.It is shown in [31] that the locally informed proposal constructed with a balancing function that satisfies certain functional properties is asymptotically optimal compared with other choices of function in terms of Peskun ordering.Building upon the idea of locally informed proposal, Zhou et al. [32] show that the locally informed and thresholded (LIT) proposal can achieve dimension-free mixing times under conditions similar to those mentioned in [23] for BVS in linear regression models.Recent work in [33] introduces a Point-wise implementation of the Adaptive Random Neighbourhood Informed proposal (PARNI), which combines the advantages of both adaptive schemes and locally informed proposals.The PARNI proposal outperforms other state-of-the-art algorithms in a wide range of high-dimensional data-sets for BVS in linear regression models.
Other computational approaches are also available for estimating the BVS posterior distribution.Hans et al. [34] introduce a novel Shotgun Stochastic Search (SSS) approach that also explores the "local neighbourhood" idea and targets very high-dimensional model spaces to find high-probability regions.The integrated nested Laplace approximations (INLAs) [35] can solve latent Gaussian models including generalised linear models and approximate the posterior marginals obtained from the continuous priors [36].Sara et al. [37] view survival models as latent Gaussian models and also approximate the posterior marginals using INLAs.The posterior distribution can also be approximated using Variational Bayes (VB) [38].Ray et al. [39] describe a scalable mean-field variational family to approximate the posterior distribution of BVS in linear regression and extended this VB approximation to the logistic regression model in [40].Komodromos et al. [41] apply the Sparse Variational Bayes (SVB) method to approximate the posterior of proportional hazards models with partial likelihood.Other works develop a sampling strategy based on simulating piece-wise deterministic Markov processes (PDMPs) [42,43], which directly target the posterior distribution obtained from a spike-and-slab prior.
In this paper, we extend the PARNI proposal to sampling from the BVS posterior distribution in generalised linear models and survival models.To avoid the overwhelming computational costs of approximating the marginal likelihood in the locally informed proposals and motivated by ALA [16], we introduce an ALA estimate of the marginal likelihood with a novel initial value.In contrast to the suggestion in [16], which initialises ALA at origin, the novel initial value is adaptively updated on the fly using previously sampled models.The new method is computationally less complex than the Laplace approximation or correlated pseudo-marginal scheme as a result of avoiding iterative optimisation and provides a more accurate estimate than the original ALA approach initialised at the origin.We also consider new approaches to adapt the tuning parameters in the PARNI proposal.The new adaptation scheme replaces the Rao-Blackwellised estimates of PIPs using the combination of a warm-start estimate and the ergodic average calculated using previously sampled models.
To illustrate the performance of the new PARNI scheme in real-life high-dimensional problems, we perform BVS on eight genetic mapping data-sets (four for the logistic regression model and four for survival analysis) and compare the output of the PARNI proposal with the add-delete-swap proposal as a baseline.For the logistic model with binary outcome, we consider the problem of finding expressed genes that are related the most to the presence of Systemic Lupus Erythematosus in a case-control study with 10,995 observations and various numbers of SNPs, from 5771 to 42,430, on four different chromosomes.In survival analysis, we consider four cancer-related data-sets (two for breast cancer and two for lung cancer), containing patients ranging from 130 to 1904 and genetic covariates varying from 662 to 54,675.
This paper is organised as follows: In Section 2, we review the model setup and prior specification for BVS in generalised linear models, Cox proportional hazards and Weibull survival models.In Section 3, we introduce four computational methods to estimate the marginal likelihood.Section 4 describes the PARNI proposal, highlighting the novelties in the adaption of algorithmic tuning parameters and the calculation of the accurate and efficient marginal likelihood estimates.We implement these MCMC algorithms in Section 5 and compare their performance with the add-delete-swap proposal on several real data-sets.We include a discussion in Section 6, highlighting some possible future research directions.
2 Bayesian variable selection for generalised linear models and survival models

Generic model setting
Suppose that p covariates are available in the data.Let X = (x1, . . ., xn) T ∈ R n×p be the full data matrix that contains n observations with rows xi = (Xi1, . . ., Xip) and let Z = (z1, . . ., zn) T ∈ R n×q be the full data matrix that contains q variables that must be included in every model.Let binary vector γ = (γ1, . . ., γp) ∈ Γ = {0, 1} p be a model indicator, where γj = 1 if the j-th variable is included in model Mγ and γj = 0 otherwise.Let y = (y1, . . ., yn) be the vector of responses.The generalised linear model associated with model Mγ can be specified as yi ∼ F(µγ,i, ϕ) where F(µ, ϕ) is a distribution that belongs to the exponential family with mean µ and dispersion ϕ.Linear predictor ηγ,i is defined as where xγ,i contains those variables j for which γj = 1.In addition, we define the size of model Mγ as pγ = p j=1 γj.Linear predictor ηγ,i is mapped to mean µγ,i using link function g as ηγ,i = g(µγ,i).
We consider the following setup of survival models: For the i-th patient, given hazard function hi(t) at time t, the probability that an event occurs at time Ti before a certain time ti can be written as where ST i is called the survival function and is defined by Data often involve censoring where the true time to event is not observed.Let t be the vector of the observed times, where each ti denotes the minimum of censoring time Ci and survival time Ti.In the case of "right-censored" data, we define an n-dimensional event indicator vector d to denote, for each patient i, whether the event was observed during their follow-up (di = 1) or was censored (di = 0).In the case where the event was observed for patient i (di = 1), then ti denotes their time to event; otherwise, we observed the length of their follow-up.
Given a model Mγ associated with linear predictor ηγ,i as in (2), we consider the exponential hazard, λγ,i = exp(ηγ,i), and assume that the hazard function conditioned on model Mγ has the form where k is an additional shape parameter if needed.We can conclude the following log-likelihood on y = (t, d): where Hi is the cumulative hazard function for the i-th patient and is defined by

Prior Elicitation
Recalling model indicator γ ∈ {0, 1} p , we consider the prior structure p(α, βγ, ϕ, γ) ∝ p(α)p(βγ|ϕ, γ)p(ϕ|γ)p(γ). ( For generalised linear models in which the dispersion parameter is known (e.g., the logistic regression model where ϕ = 1) or some survival models that do not involve a dispersion parameter, the prior specification becomes p(α, βγ, γ) ∝ p(α)p(βγ|γ)p(γ), (6) which is equivalent to treating ϕ as a fixed parameter.In this work, we focus on the prior structure described in (6), and we assume that there is no additional dispersion parameter in the model.We specify the following prior distribution for the coefficient parameters: where g is a positive scale parameter, σ 2 α is the prior variance on the coefficients of the fixed covariates, Ip denotes a p × p identity matrix and pγ = p j=1 γj is the size of model Mγ.We consider the choice of model prior where hyper-parameter h denotes the prior inclusion probability for each variable.It is possible to construct a fully Bayesian hierarchical model based on the prior specifications described above.We can impose the following hyper-priors on hyper-parameters g and h: where C + (0, 1) denotes the standard half-Cauchy distribution and Beta(a, b) denotes the Beta distribution with parameters a > 0 and b > 0. The half-Cauchy hyper-prior is a generalisation of the horseshoe prior [44,45,46], employed on the global-scale parameter of a continuous mixture of normal priors, to BVS problems.Liang et al. [47] note that fixing g can lead to several paradoxes and problems of model mis-specification.For other possible choices of hyper-priors on g, see [47,48].In the context of prior inclusion probability h, Ley et al. [49] advise against using a fixed h in the absence of strong prior knowledge about the number of important variables.Kohn et al. [50] propose a Beta-binomial model prior in which hyper-parameter h can be integrated out analytically, leading to where B(•, •) denotes the Beta function.

Logistic regression
Assume that yi ∈ {0, 1}, with yi = 1 indicating the success of an event and yi = 0 indicating failure.Logistic regression links the proportion of successes to the linear predictor with a logistic link function g as and the response variable is modelled as yi ∼ Bern(µγ,i) under model Mγ.

Cox's proportional hazards (PH) with partial likelihood
Starting with the exponential hazard function λγ,i = exp(ηγ,i) associated with model Mγ, in Cox's proportional hazard function, the hazards are assumed to have the form where h0 is some baseline hazard function.In this proportional model, all covariate effects are assumed to be multiplicative.The full likelihood is then given as where H0(t) = t 0 h0(u)du is the cumulative baseline hazard function.If we model the prior of H0 through a prior process p(H0) on the cumulative hazard function, the resulting posterior distribution of α and βγ is p(α, βγ|y, γ) ∝ L(α, βγ, H0|y, γ) × p(α)p(βγ|γ)p(H0)dH0. ( Alternatively we can take the partial likelihood of Cox, which is given by PL(α, βγ|y, γ) where R(t) = {i : ti ≥ t} is the set of patients at risk at time t.Unlike the full likelihood formulated in (11), the partial likelihood does not rely on the specification and estimation of the baseline hazard function h0.The partial likelihood and its variants are therefore popular alternatives to the full likelihood in many survival studies [11,51,14].It is highlighted in [52,53] that the partial likelihood can be obtained by integrating out the baseline hazard function using a Gamma process prior.Bayesian inference with the partial likelihood ( 13) relies on the approximate posterior pPL(β|Y ), which can be expressed as where the baseline hazard function h0 is eliminated.

Weibull regression
In addition to the semi-parametric approach of Cox's PH with partial likelihood, we consider another commonly used parametric model for survival analysis, namely the Weibull model.A Weibull model is obtained by extending the exponential model by raising the survival rate to a positive power k, giving The parameter k is the shape parameter of a Weibull random variable.When k < 1, the hazard rate decreases over time.Conversely, when k > 1, the hazard rate increases over time.It is possible to recover the exponential survival model when k = 1 and it represents a constant hazard rate over time.We can derive the hazard function and the log-likelihood for parameters α, βγ and k as It should be noted that the Weibull distribution does not belong to the exponential family, unless the shape parameter k is assumed to be fixed.In the Bayesian framework, we consider the prior p(log(k)) = N (0, σ 2 k ) for some σ 2 k > 0 as in [15].To perform MCMC we alternatively update γ|k using the PARNI proposal and k|γ through an adaptive random walk proposal.
Assuming that an estimate of marginal likelihood p(y|γ) can be obtained, we consider MCMC algorithms with random neighbourhood proposals as described in [33], which is a sub-class of Metropolis-Hastings (MH) schemes [54,55].The random neighbourhood proposal consists of the following three stages: 1. Around the current model, γ, randomly generate a neighbourhood N ∼ p(•|γ).
3. Accept the new proposal, γ ′ , with the MH acceptance probability where N ′ is the neighbourhood used in the reverse move of the MH scheme.
In this section, we will describe four methods commonly used to estimate marginal likelihood p(y|γ): data augmentation, Laplace approximation, correlated pseudo-marginal and approximate Laplace approximation.Before introducing these methods, it is necessary to define the following terms for convenience.Let Jγ be a n × (q + pγ) matrix which contains all necessary covariates for model Mγ and is given by Jγ = (Z Xγ), and let Vγ be the variance-covariance matrix of the prior distribution of θγ, defined by

Data Augmentation
The data-augmentation scheme [19] introduces latent variables ω into the model such that the posterior distribution of variables of interest becomes analytically tractable given ω.The Pólya-gamma dataaugmentation scheme [20] can be utilised for the logistic regression model to evaluate the marginal likelihood.Given real numbers ψ ∈ R, a > 0, b > 0 and a set of latent variables ω = (ω1, . . ., ωn), in which each individual ωi follows a Pólya-gamma distribution PG(b, 0), the application of Pólya-gamma data augmentation exploits the following identity: where κ = a − b/2.The above identity implies that the posterior distribution of the coefficients can be represented as a multivariate normal distribution: where ξ = J T γ κ, Λγ = J T γ W Jγ + V −1 γ , κ is an n-dimensional vector with entries κi = yi − 1/2 and W is a diagonal matrix with ω appearing along its diagonal.By integrating out coefficient θγ, analytically conditioned on Pólya-gamma random variables ω, we obtain the conditional marginal likelihood In each iteration of the MCMC algorithm, we update γ and ω alternatively.To refresh ω, we can perform a simulation directly from its posterior distribution, which also follows a Pólya-gamma distribution given by ωi ∼ PG(1, ηγ,i). ( where linear predictor ηγ,i involves coefficient θγ simulated from (22).Efficient samples from the Pólyagamma random variables can be simulated using the R package pgdraw (version 1.1) [56].In addition, Zens et al. described the ultimate Pólya-gamma sampler [57] to address the slow mixing rate for categorical imbalanced data, as illustrated in [58].
In general, the data-augmentation schemes may not be applicable to all generalised linear models and survival models.Specifically, for the Cox proportional hazards with partial likelihood or the Weibull model, there is currently no suitable data augmentation to directly yield a parametric posterior distribution for the regression coefficients.

Laplace Approximation
Assuming a unimodal posterior distribution of the regression coefficients, the Laplace approximation estimates the marginal likelihood with a second-order Taylor approximation.This method leads to a Gaussian integral, with the solution of the marginal likelihood being given by where θγ is the posterior mode of θγ and Hγ is the negated Hessian of log p(y|θγ, γ) + log p(θγ|γ) evaluated at mode θγ.Additionally, the Laplace approximation provides a normal approximation to the posterior distribution of coefficient θγ as To incorporate the Laplace approximation in MH sampling, we replace marginal likelihood p(y|γ) in (19) with the approximate pLA(y|γ) as described above.
Laplace approximation has been shown to be asymptotically consistent for estimating Bayes factors [59] and Bayesian variable selection on generalised linear models [60].In finite-sample problems, however, Laplace approximation introduces biases, so pLA(y|γ) is not an unbiased estimate of true marginal likelihood p(y|γ).The resulting MCMC scheme, which involves the step of Laplace approximation, targets a different distribution compared with the true posterior π(γ).Instead, it targets the distribution πLA(γ) ∝ pLA(y|γ)p(γ).
It is possible to extend a pseudo-marginal method to a correlated pseudo-marginal method [21], with the aim of reducing the estimation variance of the ratio of estimated marginal likelihoods p(y|γ ′ )/p(y|γ).The correlated pseudo-marginal method is applied to Bayesian variable selection for the logistic regression model in [27], which provides an implementation that we also adopt in this work.

Approximate Laplace Approximation
The above Laplace approximation and correlated pseudo-marginal methods are computationally intensive due to the optimisation process required to obtain the normal approximation in (26), especially for dealing with large-n data.To avoid the overwhelming computational cost associated with the optimisation process, Rossell et al. [16] introduce the approximate Laplace approximation method (ALA), which is more computationally tractable for large-n problems.In this work, we consider the alternative formula described in supplementary material S.1.of [16], as it offers better computational stability when inverting the Hessian under the independent prior in (7).
In ALA, a Taylor expansion of log-posterior density log p(y|θγ, γ) + log p(θγ|γ) is performed at initial value θ 0 γ .Solving the resulting Gaussian integral leads to where g 0 γ and H 0 γ are the gradient and Hessian of the negative log-posterior density evaluated at θ 0 γ , respectively.It is suggested in [16] to set initial value θ 0 γ to θ 0 γ = 0 for convenience.By applying ALA to the MH acceptance probability in (19), we obtain an MCMC algorithm that targets the ALA posterior distribution πALA(γ) ∝ pALA(y|γ)p(γ) as the equilibrium distribution.Although Ref. [16] shows that ALA can recover the optimal model with respect to a mean squared loss, it is important to note that ALA is not consistent with respect to the marginal likelihood (in contrast to the classical Laplace approximation) and pALA(y|γ) is not an unbiased estimator of the true marginal likelihood, p(y|γ).
4 Point-wise implementation of Adaptive Random neighborhood Informed proposal

The PARNI proposal
The PARNI proposal belongs to the class of random neighbourhood informed proposals, which typically involve the following two steps: (i) sampling a neighbourhood N ∼ p(•|γ) and then (ii) proposing a model γ ′ within this neighbourhood N according to the informed proposal of [31].In the PARNI proposal, we assume that the randomness in neighbourhood generation is characterised by an auxiliary variable k ∈ K, with conditional distribution p(k|γ), which leads to neighbourhood N = N (γ, k), such that p(k|γ) = p(N |γ).
By defining K = {0, 1} p , the value of k indicates whether the change in the corresponding position in γ is included in neighbourhood N .Specifically, for those positions j such that kj = 1, the neighbourhood consists of models obtained by varying some or all of these positions in the current model, γ.
The conditional distribution of k takes the product form p(k|γ) = p j=1 p(kj|γj), where each kj depends on the corresponding component γj in γ.This probability distribution is driven by a set of tuning parameters (A1, . . ., Ap, D1, . . ., Dp), where Aj, Dj ∈ (ϵ, 1 − ϵ) for a small value of ϵ ∈ (0, 1/2).The probabilities of event kj = 1 are then defined by and the consequent neighbourhood is constructed as Neighbourhood N (γ, k) contains 2 p k , models where p k denotes the number of 1s in k.Performing a full enumeration over the entire neighbourhood is, therefore, computationally expensive when p k is large.In fact, it becomes computationally infeasible to explore the whole neighbourhood when p k is beyond 30.Liang et al. [33], therefore, consider a point-wise approximate implementation of this algorithm that dramatically reduces the number of model probability evaluations from O(2 p k ) to O(2p k ).
The point-wise implementation proceeds by constructing a sequence of smaller neighbourhoods {Nr} such that each Nr is a subset of N .A proposed model γ ′ is sequentially simulated from these neighbourhoods {Nr} according to locally informed proposals qN r .This procedure requires us to define a sequence of intermediate models We collect positions j such that kj = 1 and define them as j1, . . ., jp k (the order is random).Small neighbourhood Nr is then defined as follows: Each small neighbourhood Nr only consists of two models, γ(r − 1) and γ * , which only differ with γ(r − 1) at position jr.The resulting proposal mass function is where qN is the locally informed proposal over neighbourhood N and is defined by Tuning parameter ζ ∈ (ϵ, 1 − ϵ) denotes the non-informative jumping probability.Two different methods for adapting ζ are provided in [33].One of the key factors influencing the performance of the informed proposal in (33) is the choice of weighting function g.Given a positive real number x > 0, a balancing function is defined as a function g that satisfies the condition g(x) = xg(1/x).The locally informed proposal constructed with a balancing function is locally optimal in terms of Peskun ordering under mild conditions [31].For the comparisons between different balancing functions, see Supplement B.1.3 of [31].
In this work, we exclusively focus on the Hastings' choice of balancing function given by gH (x) = min {1, x}, as gH has demonstrated better empirical performance in many problems (e.g., [33]).
To construct a π-reversible chain in the MH scheme, we define a collection of neighbourhoods {N ′ r } for the reverse moves, where {N ′ r } are identical to {Nr} but with reverse order.For a more detailed explanation of the PARNI proposal, we refer to Section 4.2.1 of [33].The MH acceptance probability of the PARNI proposal is given by Remark 1.The concept of a neighbourhood is also used in other schemes designed to estimate discrete posterior distributions, including the Shotgun Stochastic Search (SSS) approach [34] and Hamming ball sampler (HBS) [28].The SSS method works on the same neighbourhood as that constructed with the add-delete-swap proposal [22].Given the current model, γ, SSS constructs a neighbourhood N (γ) that comprises three disjoint sub-neighbourhoods: Na(γ), N d (γ) and Ns(γ).The "addition" neighbourhood, Na(γ), is formed by adding a covariate into the model, and similarly, the "deletion" neighbourhood, N d (γ), is formed by deleting a covariate from the model.Lastly, Ns(γ) is obtained by swapping an included covariate with an excluded one.On the contrary, the HBS constructs neighbourhoods based on the Hamming ball, H d (γ), consisting of models that differ from γ by at most d positions.The typical example is the 1-Hamming ball, denoted by H1(γ).It is worth mentioning that the SSS and HBS approaches construct neighbourhoods with sizes of (pγ + 1)p and p, respectively.By contrast, the PARNI proposal constructs neighbourhoods that are typically approximately of size pγ * , where γ * denotes the true underlying model.Assuming that the size of the true underlying model is much smaller than p, as is typical in many applications, pγ * ≪ p, and PARNI exhibits a higher level of scalability in handling the large-p data in comparison to SSS and HBS.
In the remaining parts of this section, we will describe a novel scheme to estimate tuning parameters A and D, and a new method for efficiently estimating the marginal likelihood in the locally informed proposal of (33).

New adaptation scheme on algorithmic tuning parameters
The performance of the PARNI proposal is largely dictated by the choice of algorithmic tuning parameters A and D. Griffin et al. [26] consider the informed proposal of the form where πj denotes the PIP for the j-th covariate and is defined by πj = π(γj = 1).In their ASI scheme for BVS in the linear regression model, tuning parameters π are adaptively updated based on a Rao-Blackwellised estimate of the PIP given in Equation ( 10) of [26].Wan and Griffin [27] extend ASI to the logistic regression model, in which they derive Rao-Blackwellised estimates of PIPs conditioned on the Pólya-gamma latent variables.Generalising this method to other generalised linear models and survival models that lack a suitable data-augmentation scheme is challenging.As the analytic marginal likelihood is inaccessible, it becomes intractable to derive Rao-Blackwellised estimates of PIPs.As an alternative, a simple Monte Carlo average over the output {γ (l) } L l=1 can be taken, where L is the current iteration number.This ergodic average is calculated as The ergodic average tends to be broad and biased, and it often downweights the importance of highly correlated covariates.Using the ergodic average directly in the PARNI proposal, however, results in a feedback effect, wherein a poor ergodic average leads to inadequate exploration over the sample space, leading to a subsequent bad ergodic average.To combat this phenomenon, we consider the following composition of two measures: a "warm-start" approximation, π(0) , and the ergodic average, π(L) , obtained from the first L samples.This composite estimate is adaptively updated using the formula where {ϕ l } L l=1 is a set of weights that control the trade-off between the warm-start approximation and the ergodic average.
We consider the ALA in (28) initialised at the origin to estimate the intractable Bayes factor, p(y|γ j↑ )/p(y|γ j↓ ), for including the j-th covariate.Let ηi be the i-th linear predictor, η 0 i be the i-th linear predictor evaluated at the origin (i.e., η 0 i = 0), Xj denote the j-th column of data matrix X, ỹ be a vector with i-th component equal to ∂p(y|θγ, γ)/∂ηi evaluated at ηi = η 0 i and W be a matrix such that W il = −∂ 2 p(y|θγ, γ)/∂ηi∂η l evaluated at ηi = η 0 i and η l = η 0 l .Thanks to the Schur complement, we can facilitate the computation of p Bayes factors as in [26,27]: when where where d ↓ j = 1/(Λ −1 γ )q+p j ,q+p j and pj is the ordered position of the j-th variable.When working with data that involve a high level of collinearity, it is also possible to increase the number of the ALA Rao-Blackwellised estimates.
The last building block of adapting π(L) j is defining weight ϕ l .We employ a straightforward construction of ϕ l given by where N b denotes the length of the burn-in period.This choice results in a weight that exceeds 1/2 during the period of burn-in and drops below 1/2 afterwards.Consequently, the PARNI proposal initially relies on the warm-start approximation to explore the model space.As the chain converges to the high-probability region and the ergodic average stabilises, the PARNI proposal gradually uses more information from the ergodic average.After running for a longer time, the PARNI proposal completely relies on the ergodic average.

The adaptive ALA informed proposal
In each iteration of the PARNI proposal, the locally informed proposal in (33) relies on computing the posterior model probabilities.Using the estimates from the Laplace approximation or correlated pseudomarginal scheme in PARNI can be computationally intractable in "large-p, large-n" situations due to the use of an optimisation algorithm that most run many times in one iteration.It should be noted, however, that the model probabilities in the locally informed proposal do not need to precisely match the true posterior model probabilities, and the PARNI proposal can still generate samples that preserve π-reversibility as long as the correct (or proper estimate) π is used in the MH acceptance probability of (34).In the locally informed proposal, one can incorporate the approximate Laplace approximation initialised at the origin to design the proposal distribution.In the MH acceptance probability, we can then use the estimates obtained from the Laplace approximation or correlated pseudo-marginal method.Based on empirical observations, however, this ALA informed proposal may not always mix well.One reason for this is the phenomenon of downweighting the model probabilities of non-null models in favour of the null model, resulting in an informed proposal that is less informative than the true likelihood.The simulated chain is, therefore, more likely to get stuck and becomes less effective in exploring the model space.
Alternatively, we can note that the ALA estimate coincides with the Laplace approximation when the initial value is chosen to be posterior mode θγ under model Mγ.Therefore, the accuracy of the ALA estimate is crucially influenced by the choice of initial value θ 0 γ .We employ the ALA informed proposal with an adaptive initial value for ALA (adaptive ALA), which aims to reduce the estimation errors and thus improve the overall performance of the MCMC algorithm.
For each model in the neighbourhood, the adaptive ALA starts with an initial guess of linear predictor η and proceeds with the following steps: 1. Calculate a "guess" estimate from linear predictor η: 2. Perform one step of Newton's method and obtain an updated estimate of the coefficient: where g 0 γ and H 0 γ are the gradient and Hessian of the negated log-posterior density evaluated at θ 0 γ , respectively.
3. Use the ALA estimate in (28) with θγ as the initial value to estimate marginal likelihood p(y|γ).
Practically speaking, we can skip step 1 with the matrix inverse operation in (41) and obtain θγ directly from the initial guess of linear predictor η.This simplification is followed by [63] and is given in Appendix C.This approach leads to a coherent computational scheme that is easy to implement.We adaptively update the initial-guess η according to where ηγ,i = Jγ θγ is the "optimal" i-th linear predictor obtained from MAP estimate θγ under model Mγ.
By storing the MAP estimate obtained from the Laplace approximation or correlated pseudo-marginal scheme, we can compute the linear predictor without introducing additional computational costs.
In addition, we experimented with adapting coefficients β from the posterior samples and using the coefficients of the covariates selected by γ to navigate the ALA.This approach did not work well, however, because the posterior distribution of β differs significantly from the posterior distribution of β conditioned on model γ.In contrast, the linear predictors offer more stability, in the sense that they do not vary as much across different models.
Combining all of the above components, we have the PARNI proposal.The complete algorithm is outlined in Algorithm 1.

Simulated dataset with adaptive ALA informed proposal
In this subsection, we study the mixing behaviour of different versions of the PARNI proposal for the logistic regression model, Cox PHs and Weibull survival models.We simulate two data-sets with 500 covariates and 500 observations as described in Appendix D and compare the following four algorithms: • PARNI-adaptiveALA: The PARNI proposal with adaptive approximate Laplace approximation in the informed proposal and the correlated pseudo-marginal scheme in the MH acceptance probability.
• PARNI-LA: The PARNI proposal with Laplace approximation in the informed proposal and the correlated pseudo-marginal scheme in the MH acceptance probability.
• PARNI-ALA: The PARNI proposal with approximate Laplace approximation in the informed proposal and the correlated pseudo-marginal scheme in the MH acceptance probability.
• ADS (thinned): The PARNI proposal with approximate Laplace approximation in the informed proposal and the correlated pseudo-marginal scheme in the MH acceptance probability.
The first three algorithms were run for 10,000 iterations, with the first 2000 iterations being discarded as burn-in, whereas the ADS (thinned) proposal was run for a CPU time similar to that of PARNI-adaptiveALA and PARNI-ALA, with all collected samples being thinned to 10,000 values.
Figure 1 presents trace plots of the log-posterior model probability and bar plots of CPU time for the PARNI-adaptiveALA, PARNI-LA, PARNI-ALA and ADS (thinned) proposals in the logistic model, and the Cox PHs and Weibull models.In all three models, the PARNI-adaptiveALA proposal mixes as well as the PARNI-LA proposal and performs much better than the PARNI-ALA proposal.The result of the ADS (thinned) proposal provides the benchmark performance of a simple add-delete-swap MCMC scheme on these data-sets for comparison purposes.As illustrated in Figure 1, the adaptive ALA informed proposal is computationally much cheaper than the LA informed proposal.In comparison to the ALA informed proposal, the adaptive ALA informed proposal is also computationally competitive, and it only introduces the additional computational costs of updating linear predictor η(L) and computing initial value θγ from the estimate of linear predictor η(L) as in (42).In addition, the PARNI-adaptiveALA proposal demonstrates improved mixing behaviour in comparison to the baseline add-delete-swap proposal in all three models with similar CPU time.Therefore, we conclude that the PARNI-adaptiveALA proposal is more computationally efficient than the informed proposals constructed using Laplace approximation or ALA initialised at the origin.

Logistic regression: Gene fine mapping for Systemic Lupus Erythematosus
Genetic mapping is a process of locating a specific gene or genetic variant within a particular genomic region and has the objective to find the precise genetic elements responsible for a particular trait or disease   We consider a problem of identifying the SNPs that play a crucial role in predicting Systemic Lupus Erythematosus using a case/control study.It consists of genotypes from a genome-wide genetic case/control association study involving 4035 cases and 6959 controls, where the cases are SLE patients and the controls are from a public repository of European ancestry.These data were previously studied in [64] using step-wise logistic regression in a meta-analysis.In Chapter 5 of [65], Griffin and Steel apply Bayesian variable selection to analyse these SLE data but only focus on exploring the relationship between disease and SNPs on Chromosome 1.In addition to their work, we extend the study by including a total of four chromosomes.We consider a different number of SNPs for each chromosome, with Chromosome 1 having 5771 SNPs, Chromosome 3 having 42,430 SNPs, Chromosome 11 having 32,290 SNPs and Chromosome 21 having 9306 SNPs.We consider the prior specification in Section 2.2 with hyper-parameter g = 1/4, σ 2 α = 1 and assume the hyper-prior of h ∼ Beta(1, (p − 5)/5), where p denotes the number of SNPs.The full details of the data-set are provided in Table 1, including the five fixed covariates (gender and top four We implement the following four algorithms: • PARNI-DA: PARNI proposal with Pólya-gamma data augmentation in both the informed proposal and the MH acceptance probability.
• PARNI-CPM: PARNI proposal with adaptive ALA informed proposal and correlated pseudomarginal method in the MH acceptance probability.
• ADS-DA: Add-delete-swap proposal with Pólya-gamma data augmentation in the MH acceptance probability.
• ADS-CPM: Add-delete-swap proposal with the correlated pseudo-marginal method in the MH acceptance probability.
These MCMC algorithms all simulate samples from the exact posterior distribution, π.We treat the ADS-DA proposal as the baseline to showcase the rapid mixing of the PARNI proposals.Each algorithm was run for 1 h with 10 repetitions, and we recorded the estimates of PIPs.Firstly, we calculated the mean squared errors of the estimates of p PIPs compared with the "gold standard" estimates taken from the PARNI-CPM proposal, which was run for roughly 12 h.Then, we took the average over p mean squared errors to obtain the average mean squared error (average MSE).To compare the computational efficiency of the PARNI proposals with the baseline ADS-DA proposal, we provide the relative efficiency (in brackets) as the ratio of their average MSE.
The average MSEs and relative efficiency values are presented in Table 2.The PARNI proposals consistently outperform the ADS proposals in terms of the average MSE.The PARNI proposals show at least twofold improvements over the add-delete-swap proposal and lead to much larger improvements in most cases, such as in Chromosome 21, where the PARNI proposals perform 78 times better than the add-delete-swap proposal.On the other hand, both the PARNI-DA and ADS-DA proposals consistently result in smaller average MSEs compared with the PARNI-CPM and ADS-CPM proposals due to their computational advantages.Firstly, data augmentation can evaluate the conditional marginal likelihood without finding the posterior mode using iteratively re-weighted least squares.Secondly, the Pólya-gamma latent variables are drawn using the R package pgdraw (version 1.1) [56] implemented using Rcpp (version 1.0.10)[66].

Survival analysis: variable selection for 5 large cancer-related gene expression data sets
We consider a total of four cancer-related real data-sets, where the first two data-sets are for breast cancer and the remaining data-sets are for lung cancer.NKI Breast Cancer Data ( https://data.world/deviramanan2016/nki-breast-cancer-data, accessed on 10 September 2023) contain patient info, treatment, survival time and the 1554 most varying genes of 272 breast cancer patients.These data were analysed in [67,68] with the aim of reducing the mortality rates from this disease.The METABRIC breast cancer data-set is derived from the Molecular Taxonomy of Breast Cancer International Consortium  (METABRIC) database.The METABRIC data-set was analysed in [69,70] and is publicly available in [71] (https://www.cbioportal.org/study/summary?id=brca_metabric, accessed on 10 September 2023).The data contain 1907 patients with the gene expression for 331 genes and mutations for 175 genes.Gene mutation variables are encoded as 1 if a mutation exists and 0 otherwise.For both data-sets, we include some clinical covariates, including the age of the patients and the stage of the cancer, as suggested by [72].We also consider the treatment variables (such as chemotherapy and surgery type), which also influence survival time.The last two lung cancer data-sets, "GSE31210" and "GSE4573", were previously studied in [72], and they are publicly available in the Gene Expression Omnibus repository [73].See Figure 1 in [72] for the estimated survival functions of these three data-sets.We provide the full details of these four real data-sets in Table 3.
We consider two computational algorithms used in previous studies for logistic models, the PARNI-CPM and ADS-CPM proposals, as a data augmentation scheme is not available for the Cox PHs or Weibull model.We consider the hyper-prior of h ∼ Beta(1, (p − 5)/5), where p denotes the number of genetic covariates, and impose a half-Cauchy hyper-prior on √ g, where a Gibbs update is taken on g conditioned on the model (see Appendix A for more details).In addition, we assume σ 2 α = 10 5 and σ 2 k = 10 5 (only for the Weibull model).
The average MSEs and relative efficiency values of the PARNI-CPM and ADS-CPM proposals on these four survival data-sets are shown in Table 4.For the Weibull model, the PARNI proposal consistently exhibits better computational efficiency compared with add-delete-swap on all four data-sets.In the case of the NKI and METABRIC data-sets, which have a relatively small number of covariates, the PARNI-CPM proposal produces PIP estimates that are seven times more accurate compared with ADS-CPM.For high-dimensional data-sets, we can obtain PIP estimates from the PARNI-CPM proposal that are two times better than ADS-CPM.The lesser improvement observed in the high-dimensional examples can be attributed to the increasing number of unimportant covariates, where both algorithms are good at excluding these unimportant covariates from the models.
The Bayesian variable selection in the Cox PHs model with partial likelihood is generally more challenging compared with the Weibull model.The primary reason is that the inclusion of the non-parametric setup introduces additional complexities in evaluating the log-likelihood functions and its Hessian matrices.The PARNI-CPM proposal provides roughly two times better estimates on the NKI and GSE4573 data-sets compared with ADS-CPM.However, the ADS-CPM proposal shows better performance on the remaining two data-sets.In the "small-p, large-n" METABRIC data, the add-delete-swap proposal shows greater computational efficiency compared with the PARNI proposal, as the informed proposal needs to evaluate many computationally expensive Hessian matrices.In fact, the computation of evaluating the Hessian matrix scales with the order of O(n 2 ), in contrast with parametric models, where the computation of the Hessian matrix scales linearly with n.In the GSE31210 data with few strong signals, the posterior distribution on model space is relatively flat, and both algorithms have smaller average MSEs compared with the other data-sets.In particular, the add-delete-swap proposal can run for more iterations; it is, therefore, more computationally efficient compared with the PARNI-CPM proposal.
For the survival model, we take the same construction on data design matrix X, coefficient β and linear predictor η = Xβ as in the logistic model.The survival time of each individual is simulated from a flexible generalised gamma parametric survival model [77] as suggested in [15].The generalised gamma parametric survival model encompasses four commonly used survival models, the exponential, Weibull, log-normal and gamma survival models, as special cases.We adopt a similar hyper-parameter specification for the generalised gamma parametric survival model as presented in [15] with σ = 0.8 and q = −2.In addition, we consider hyper-parameters σ 2 α = 100, g = 1 and h = 10/500 for all these three models.

Figure 1 :
Figure 1: Left four columns: Trace plots of the log-posterior model probability from runs of the PARNI-adaptiveALA, PARNI-LA, PARNI-ALA and ADS (thinned) algorithms on simulated data-sets.Right column: Bar plots of the CPU time of simulating 10,000 samples on simulated data-sets with the PARNI-adaptiveALA, PARNI-LA, PARNI-ALA and ADS (thinned) algorithms.

Table 1 :
Details of Systemic Lupus Erythematosus data on Chromosomes 1, 3, 11 and 21. .One common application is to study whether an individual has a particular disease.In this scenario, one can use a logistic regression model with the response variable based on the case/control design and explanatory variables consisting of single-nucleotide polymorphisms (SNPs). phenotype

Table 3 :
4 real datasets for survival analysis.