An Auxiliary Variable Method for Markov Chain Monte Carlo Algorithms in High Dimension

In this paper, we are interested in Bayesian inverse problems where either the data fidelity term or the prior distribution is Gaussian or driven from a hierarchical Gaussian model. Generally, Markov chain Monte Carlo (MCMC) algorithms allow us to generate sets of samples that are employed to infer some relevant parameters of the underlying distributions. However, when the parameter space is high-dimensional, the performance of stochastic sampling algorithms is very sensitive to existing dependencies between parameters. In particular, this problem arises when one aims to sample from a high-dimensional Gaussian distribution whose covariance matrix does not present a simple structure. Another challenge is the design of Metropolis–Hastings proposals that make use of information about the local geometry of the target density in order to speed up the convergence and improve mixing properties in the parameter space, while not being too computationally expensive. These two contexts are mainly related to the presence of two heterogeneous sources of dependencies stemming either from the prior or the likelihood in the sense that the related covariance matrices cannot be diagonalized in the same basis. In this work, we address these two issues. Our contribution consists of adding auxiliary variables to the model in order to dissociate the two sources of dependencies. In the new augmented space, only one source of correlation remains directly related to the target parameters, the other sources of correlations being captured by the auxiliary variables. Experiments are conducted on two practical image restoration problems—namely the recovery of multichannel blurred images embedded in Gaussian noise and the recovery of signal corrupted by a mixed Gaussian noise. Experimental results indicate that adding the proposed auxiliary variables makes the sampling problem simpler since the new conditional distribution no longer contains highly heterogeneous correlations. Thus, the computational cost of each iteration of the Gibbs sampler is significantly reduced while ensuring good mixing properties.


Introduction
In a wide range of applicative areas, we do not have access to the signal of interest x ∈ R Q , but only to some observations z ∈ R N related to x through the following model: where H ∈ R N×Q is the observation matrix that may express a blur or a projection and D is the noise model representing measurement errors. In this paper, we are interested in finding an estimatorx of x from the observations z. This inverse problem arises in several signal processing applications, such as denoising, deblurring, and tomography reconstruction [1,2]. The common Bayesian procedure for signal estimation consists of deriving estimators from the posterior distribution that captures all information inferred about the target signal from the collected data. Given the observation model (1), the minus logarithm of the density posterior distribution reads: (∀x ∈ R Q ) J (x) = − log p(x|z) = Φ(Hx; z) + Ψ(Vx). ( Hereabove, Φ is the neg-log likelihood that may take various forms depending on the noise statistical model D. In particular, if D models an additive Gaussian noise with covariance Λ −1 , it reduces (up to an additive constant) to the least squares function Φ(Hx; z) = 1 2 Hx − z 2 Λ . Other common choices can be found for instance in [3,4]. Moreover, Ψ(V·) is related to some prior knowledge one can have about x, and V ∈ R M×N is a linear transform that can describe, for example, a frame analysis [5] or a discrete gradient operator [6]. Within a Bayesian framework, it is related to a prior distribution of density p(x) whose logarithm is given by log p(x) = −Ψ(Vx).
Monte Carlo inference approaches allow us to have a good description of the target space from a set of samples drawn from a distribution [7][8][9][10][11][12]. In particular, these samples can be used to infer useful statistics such as the mean and the variance. In the context of Bayesian estimation, these techniques appear useful to compute, for example, the minimum mean square error (MMSE) estimator, which is equivalent to the posterior mean. In this case, the MMSE estimator is approximated using the empirical average over the generated samples from the posterior distribution. When the exact expression of the posterior density is intractable, Markov chain Monte Carlo (MCMC) algorithms have been widely used to approximate it [13]. These techniques are random variable generators that allow us to draw samples from complicated distributions. Perhaps the most commonly used MCMC algorithm is the Metropolis-Hastings (MH), which operates as follows [14]: from a given proposal distribution, we construct an irreducible Markov chain whose stationary distribution is the sought posterior law (i.e., samples generated by the algorithm after a suitable burn-in period are distributed according to desired posterior law). At each iteration t, a decision rule is applied to accept or reject the proposed sample given by the following acceptance probability: α(x (t) ,x (t) ) = min 1, p(x (t) |z)g(x (t) |x (t) ) p(x (t) |z)g(x (t) |x (t) ) , wherex (t) is the proposed sample at iteration t, generated from a proposal distribution with density g(.|x (t) ) that may depend on the current state x (t) . Note that when more than one unknown variable needs to be estimated (e.g., acquisition parameters or prior hyperparameters), one can iteratively draw samples from the conditional posterior distribution for each variable given the remaining ones using an MH iteration. This is known as the hybrid Gibbs sampler [15]. High-dimensional models-often encountered in inverse problems (e.g., in multispectral remote sensing applications [16])-constitute a challenging task for Bayesian inference problems. While many popular sampling algorithms have been widely used to fit complex multivariable models in small-dimensional spaces [17][18][19][20][21][22], they generally fail to explore the target distribution efficiently when applied to large-scale problems, especially when the variables are highly correlated. This may be due to the poor mixing properties of the Markov chain or to the high computational cost of each iteration [17].
In this work, we propose a novel approach based on a data augmentation strategy [23] which aims at overcoming the limitations of standard Bayesian sampling algorithms when facing large-scale problems. The remainder of this paper is organized as follows. In Section 2, we discuss the main difficulties encountered in standard sampling methods for large-scale problems. We show how the addition of auxiliary variables to the model can improve their robustness with respect to these issues.
The core of our contribution is detailed in Section 3. We first give a complete description of the proposed approach in the case of Gaussian noise, and we study its extension to scale mixtures of Gaussian models. Furthermore, we demonstrate how the proposed approach can facilitate sampling from Gaussian distributions in Gibbs algorithms. Then, some computational issues arising in the proposed Bayesian approach are discussed. Sections 4 and 5 are devoted to the experimental validation of our method. In Section 4, we show the advantages of the proposed approach in dealing with high-dimensional models involving highly correlated variables over a dataset of multispectral images affected by blur and additive Gaussian noise. In Section 5, we test the performance of our method in sampling from large-scale Gaussian distributions through an application to image recovery under two-term mixed Gaussian noise. Finally, we give some conclusions and perspectives in Section 6.

Sampling Issues in High-Dimensional Space
MCMC sampling methods may face two main difficulties when applied to large-scale inverse problems. First, except for particular cases (e.g., circulant observation matrix), the structure of the observation model that links the unknown signal to the observations usually makes the estimation of the parameters of the posterior distribution quite involved. Second, even with simple models, the posterior distribution may still be difficult to sample from directly or to explore efficiently using standard sampling algorithms. As a specific case, this problem arises for Gaussian distributions if the problem dimension is too high [24]. It can also arise in MH algorithms when sophisticated proposal rules are employed with the aim of coping with both the high dimensionality and the strong correlation existing between the target parameters [22]. In what follows, we will give more details about these two contexts.

Sampling from High-Dimensional Gaussian Distribution
Let us focus on the problem of sampling from a multivariate Gaussian distribution with a given precision matrix G ∈ R Q×Q . This problem emerges in many applications, such as linear inverse problems involving Gaussian or hierarchical Gaussian models. More precisely, let us consider the following linear model: where w is R N -valued, and let us assume that conditionally to some latent variables, w and x are drawn from Gaussian distributions N (0 N , Λ −1 ), and N (m x , G −1 x ), respectively, where m x ∈ R Q , Λ ∈ R N×N , and G x ∈ R Q×Q are positive semi-definite matrices. In the following, when not mentioned, the Gaussian law can be degenerated; that is, the precision matrix is semi-definite positive but not with full rank. In this case, (· · ·) −1 denotes the generalized inverse. The parameters of these Gaussian distributions may be either fixed or unknown (i.e., involving some unknown hyperparameters such as regularization or acquisition parameters). It follows that the posterior distribution of x is Gaussian, with mean m ∈ R Q and precision matrix G ∈ R N×N defined as follows: A common solution to sample from N (m, G −1 ) is to use the Cholesky factorization of the covariance or the precision matrix G [25]. However, when implemented through a Gibbs sampler, this method is of limited interest. First, the precision matrix G may depend on the unknown parameters of the model and may thus take different values along the algorithm. Thereby, spending such high computational time at each iteration of the Gibbs sampler to compute the Cholesky decomposition of the updated matrix may be detrimental to the convergence speed of the Gibbs sampler. Another concern is that when dealing with high dimensional problems, we generally have to face not only computational complexity issues but also memory limitations. Such problems can be alleviated when the matrix presents some specific structures (e.g., circulant [26,27] or sparse [28]). However, for more complicated structures, the problem remains critical, especially when H ΛH and G x cannot be diagonalized in the same basis. Other recently proposed algorithms for sampling Gaussian distributions in high dimension follow a two-step perturbation-optimization approach [24,[29][30][31][32][33], which can be summarized as follows: • Perturbation: Draw a Gaussian random vector n 1 ∼ N (0 Q , G).

•
Optimization: Solve the linear system The solution to the above linear system can be approximated using iterative methods such as conjugate gradient algorithms, leading to an approximate sample of the sought distribution [30,31]. This issue has been considered in [32] by adding a Metropolis step in the sampling algorithm. In [24,33], the authors propose to reduce the computational cost by sampling along mutually conjugate directions instead of the initial high-dimensional space.

Designing Efficient Proposals in MH Algorithms
Non-Gaussian models arise in numerous applications in inverse problems [34][35][36][37]. In this context, the posterior distribution is non-Gaussian and does not generally follow a standard probability model. In this respect, MH algorithms are good tools for exploring such posteriors, and hence for drawing inferences about models and parameters. However, the challenge for MH algorithms is constructing a proposal density that provides a good approximation of the target density while being inexpensive to manipulate. Typically, in large-scale problems, the proposal distribution takes the form of a random walk (RW); that is, in each iteration, the proposal density g(.|x (t) ) in (3) is a Gaussian law centered at the current state x (t) and with covariance matrix ε 2 Q(x (t) ). Moreover, ε is a positive constant whose value is adjusted so that the acceptance probability in (3) is bounded away from zero at convergence [17]. Other sampling algorithms incorporate information about the derivative of the logarithm of the target distribution to guide the Markov chain toward the target space where samples should be mostly concentrated. For instance, when the target density is differentiable, one can use Langevin-based algorithms where the mean of the Gaussian proposal density is replaced with one iteration of a preconditioned gradient descent algorithm as follows [20,22,[38][39][40][41]: In (7), ∇J is the gradient of J , ε is a positive constant, and Q is a symmetric definite positive matrix that captures possible correlations between the coefficients of the signal. Note that some advanced versions of Langevin-based algorithms have been proposed to address problems with non-smooth laws [42,43]. It is worth noting that the choice of the scale matrices Q(x (t) ) t may deeply affect the efficiency of the aforementioned algorithms [22]. In fact, an inappropriate choice of Q may alter the quality of the Markov chain, leading to very correlated samples and thereby biased estimates. Moreover, computationally cheap matrices are also preferable, especially in high-dimensional spaces. In the case of low-dimensional problems and when the coefficients of the signal are not highly correlated, the standard RW and Metropolis-adapted Langevin algorithm (MALA) obtained for Q ≡ I Q achieve overall good results. For instance, in the context of denoising problems with uncorrelated Gaussian noise, when the coefficients of the signal are assumed to be statistically independent in the prior, they can either be sampled independently using RW or jointly by resorting to MALA. However, these algorithms may be inaccurate for large-scale problems, especially when the coefficients of the signal exhibit high correlations [22]. In this case, the design of a good proposal often requires consideration of the curvature of the target distribution. More sophisticated (and thus more computationally expensive) scale matrices should be chosen to drive the chain in the directions that reflect the dependence structure. Optimally, the curvature matrix should be chosen such that it adequately captures two kinds of dependencies: correlation over the observations specified by the observation model, and correlation between different coefficients of the target signal specified by the prior law. For instance, Q can be set to the Hessian matrix of the minus logarithm of the posterior density in the current state [20,21], or to the Fisher matrix (especially when the Hessian matrix is not definite positive [22,41]) or to the empirical covariance matrix computed according to the previous states of the Markov chain [44]. When the minus-log of the target density can be expressed as in (2), good candidates of the curvature matrix take the following form: where Λ and Ω are semi-definite positive matrices. Feasible numerical factorization of Q can be ensured if H ΛH and V ΩV are diagonalizable in the same basis. Otherwise, the use of the full matrix (8) in the scheme (7) remains generally of limited interest, especially for large-scale problems where the manipulation of the resulting proposal generally induces a high computational complexity altering the convergence speed. Alternatively, under mild conditions on the posterior density, the Majorize-Minimize strategy offers a high flexibility for building curvature matrices with a lower computational cost (e.g., diagonal matrices, bloc-diagonal matrices, circulant, etc.) [40]. However, it should be pointed out that MH algorithms with too-simple preconditioning matrices resulting from rough approximations of the posterior density may fail to explore the target space efficiently. Therefore, the scale matrix Q should be adjusted to achieve a good tradeoff between the computational complexity induced in the algorithm and the accuracy/closeness of the proposal to the true distribution.

Auxiliary Variables and Data Augmentation Strategies
It is clear that the main difficulty arising in the aforementioned sampling problems is due to the intricate form of the target covariance matrix making difficult the direct sampling or the construction of a good MH proposal that mimics the local geometry of the target law. More specifically, there are generally heterogeneous types of dependencies between the coefficients of the signal, coming either from the likelihood or from the prior information. For instance, the observation matrix H in the likelihood may bring high dependencies between distant coefficients, even if the latter are assumed to be statistically dependent in the prior law. One solution is to address the problem in another domain where H can be easily diagonalized (i.e., the coefficients of the signal become uncorrelated in the likelihood). However, if one also considers the prior dependencies, this strategy may become inefficient, especially when the prior covariance matrix cannot be diagonalized in the same basis as H, which is the case in most real problems. One should therefore process these two sources of correlations separately.
To improve the mixing of sampling algorithms, many works have proposed the elimination of one of these sources of correlation directly related to x by adding some auxiliary variables to the initial model, associated with a given conditional distribution such that simulation can be performed in a simpler way in the new larger space. Instead of simulating directly from the initial distribution, a Markov chain is constructed by alternately drawing samples from the conditional distribution of each variable, which reduces to a Gibbs sampler in the new space. This technique has been used in two different statistical literatures: data augmentation [45] and auxiliary variables strategies [46]. It is worth noting that the two methods are equivalent in their general formulation, and the main difference is often related to the statistical interpretation of the auxiliary variable (unobserved data or latent variable) [23]. In the following, we will use the term data augmentation (DA) to refer to any method that constructs sampling algorithms by introducing auxiliary variables. Some DA algorithms have been proposed in [47][48][49][50][51][52][53]. Particular attention has been focused on the Hamiltonian MCMC (HMC) approach [22,54], which defines auxiliary variables based on physically-inspired dynamics.
In the following, we propose to alleviate the problem of heterogeneous dependencies by resorting to a DA strategy. More specifically, we propose to add some auxiliary variables u ∈ R J with predefined conditional distribution of density p(u|x, z) = p(u|x) so that the minus logarithm of the joint distribution density p(x, u|z) can be written as follows: where J (u|x) = − log p(u|x) up to an additive constant. Two conditions should be satisfied by p(x, u|z) for the DA strategy to be valid: where p(u|z) should define a valid probability density function (i.e., nonnegative and with integral with respect to u equal to 1). In fact, the importance of Condition (C 1 ) is obvious, because the latent variable is only introduced for computational purposes and should not alter the considered initial model. The need for the second requirement (C 2 ) stems from the fact that p(x, u|z) should define the density of a proper distribution. Note that • the first condition is satisfied thanks to the definition of the joint distribution in (9), provided that p(u|x, z) is a density of a proper distribution; • for the second condition, it can be noticed that if the first condition is met, Fubini-Tonelli's theorem allows us to claim that This shows that p(u|z) as defined in (C 2 ) is a valid probability density function.
Instead of simulating directly from P x|z , we now alternatively draw (in an arbitrary order) samples from the conditional distributions of the two variables x and u of respective densities P x|u,z and P u|x,z . This simply reduces to a special case of a hybrid Gibbs sampler algorithm with two variables, where each iteration t is composed of two sampling steps which can be expressed as follows: • Sample u (t+1) from P u|x (t) ,z ; • Sample x (t+1) from P x|u (t+1) ,z .
Under mild technical assumptions [9,55], the constructed chain can be proved to have a stationary distribution P x,u|z . The usefulness of the DA strategy is mainly related to the fact that with an appropriate choice of p(u|x, z), drawing samples from the new conditional distributions P x|u,z and P u|x,z is much easier than sampling directly from the initial distribution P x|z . Let us emphasize that, for the sake of efficiency, the manipulation of p(u|x, z) must not induce a high computation cost in the algorithm. In this work, we propose the addition of auxiliary variables u to the model such that the dependencies resulting from the likelihood and the prior are separated; that is, J (u|x) is chosen in such a way that only one source of correlations remains related directly to x in p(x, u|z), the other sources of correlations only intervening through the auxiliary variables u and z. Note that the advantage of introducing auxiliary variables in optimization or sampling algorithms has also been illustrated in several works in the image processing literature, related to half quadratic approaches [26,[56][57][58][59][60]. This technique has also been considered in [61] in order to simplify the sampling task by using a basic MH algorithm in a maximum likelihood estimation problem. Finally, in [62], a half-quadratic formulation was used to replace the prior distribution, leading to a new posterior distribution from which inference results are deduced. The contribution of our work is the proposal of an extended formulation of the data augmentation method that was introduced in [60] in the context of variational image restoration under uncorrelated Gaussian noise. Our proposal leads to a novel acceleration strategy for sampling algorithms in large-scale problems.

Proposed Approach
In this section, we discuss various scenarios typically arising in inverse problems and we explain how our approach applies in these contexts.

Correlated Gaussian Noise
Let us consider the linear observation model (4) when the noise term w is assumed to be Gaussian, additive, and independent from the signal that is w ∼ N (0 N , Λ −1 ), with Λ ∈ R N×N a symmetric semi-definite positive precision matrix that is assumed to be known. In this context, the minus logarithm of the posterior density takes the following form: Simulating directly from this distribution is generally not possible, and standard MCMC methods may fail to explore it efficiently due to the dependencies between signal coefficients [22]. In particular, the coupling induced by the matrix H ΛH may hinder the construction of suitable proposals when using MH algorithms. For example, when V = I Q and Ψ(x) = ∑ Q i=1 ψ i (x i ), RW and standard MALA algorithms may behave poorly, as they do not account for data fidelity dependencies, while a preconditioned MALA approach with full curvature matrices may exhibit high computational load due to the presence of heterogeneous dependencies [39].
In the following, we propose the elimination of the coupling induced by the linear operators (H, Λ) by adding auxiliary variables. Since the data fidelity term is Gaussian, a natural choice is to define p(u|x, z) as a Gaussian distribution with mean Ax and covariance matrix C: where C ∈ R J×J is a symmetric positive definite covariance matrix and A ∈ R J×Q . Then, the joint distribution satisfies the two conditions (C 1 ) and (C 2 ) defined in Section 2, and its minus logarithm has the following expression: with The expression in (12) yields the sampling scheme: with n (t) ∼ N (0 J , I J ). The efficiency of the DA strategy is thus highly related to the choice of the matrices A and C. Under the requirement that C is positive definite, the choice of (A, C) is subjective and is related to specifying the source of heterogeneous dependencies that one wants to eliminate in the target distribution based on the properties of H, Λ, V, and Ψ. More specifically, one should identify if the main difficulty stems from the structure of matrix H ΛH or only from the non-trivial form of the precision matrix Λ. In what follows, we will elaborate different solutions according to the type of encountered difficulty.

Alternative I: Eliminate the Coupling Induced by Λ
Let us first consider the problem of eliminating the coupling induced by matrix Λ. This problem is encountered for example for Model (5) with circulant matrices H and G x and with Λ = I N , which induces further correlation when passing to the Fourier domain. In this context, we propose the elimination of the correlations induced by Λ by setting where µ > 0 is such that µ Λ S < 1, where · S denotes the spectral norm. This is equivalent to choosing A and C such that Note that the condition over µ allows toguarantees that C is positive definite. Under (16), the minus logarithm of the conditional distribution of x given z and u reads, up to an additive constant: Let us discuss the application of the hybrid Gibbs sampling algorithm from Section 2 to this particular decomposition. The sampling scheme (15) yields: where n (t) ∼ N (0 J , I J ). Since A and C satisfy (17), this leads to: We can remark that for every t ∈ N, A C −1/2 n (t) follows the centered Gaussian distribution with and Γ = 1 µ I N − Λ is definite positive by construction. Then, the resulting algorithm can be viewed as a hybrid Gibbs sampler, associated to the minus logarithm of the conditional distribution of x given z and a new auxiliary variable v ∼ N (ΓHx, Γ): The main steps of the proposed Gibbs sampling algorithm are given in Algorithm 1. The appealing advantage of this algorithm with respect to a Gibbs sampler which would be applied directly to Model (5) when H and G x are diagonalizable in the same domain is that it allows easy handling of the case when Λ is not equal to a diagonal matrix having identical diagonal elements.

Algorithm 1 Gibbs sampler with auxiliary variables in order to eliminate the coupling induced by Λ.
Initialize: Note that minimizing (23) can be seen as a restoration problem with an uncorrelated noise of variance µ. It can be expected that Step 3 in Algorithm 1 can be more easily implemented in the transform domain where H and V are diagonalized, when this is possible (see Section 5 for an example)

Alternative II: Eliminate the Coupling Induced by H ΛH
In a large class of regularized models, H and V have different properties. While H almost reflects a blur, a projection, or a decimation matrix, V may model a wavelet transform or a discrete gradient operator. Such difference in their properties induces a complicated structure of the posterior covariance matrix. To address such cases, we propose the elimination of the source of correlations related to x where µ > 0 is such that µ H ΛH S < 1, so that C is positive definite. It follows that the minus logarithm of the conditional distribution of x given z and u is defined up to an additive constant as Let us make the following change of variables within the Gibbs sampling method: According to (15) and (24), we obtain where follows a zero-mean Gaussian distribution with covariance matrix Γ, then and the new target conditional distribution reads The proposed Gibbs sampling algorithm is then summarized by Algorithm 2.

Algorithm 2 Gibbs sampler with auxiliary variables in order to eliminate the coupling induced by H ΛH.
Initialize: It can be seen that heterogeneous dependencies initially existing in (11), carried by the likelihood and the prior operators, are now dissociated in the new target distribution (28). Likelihood-related correlations are no longer attached directly to the target signal. They intervene in the conditional law only through the auxiliary variable v and the observation z. In other words, the original problem reduces to solving a denoising problem where the variance of the Gaussian noise is µ. Thereby, the new target distribution (28) is generally easier to sample from compared with the initial one. In particular, one can sample the components independently when the coefficients of the signal are independent in the prior. Otherwise, if Ψ is a smooth function, one can use a Langevin-based MCMC algorithm. For instance, it may be possible to construct an efficient curvature matrix that accounts for the prior correlation and that can be easily manipulated. Table 1 summarizes the two different cases we have presented here. We would like to emphasize that the approach we propose for adding auxiliary variables according to the structure of the matrix H and Λ is sufficiently generic so that it covers a wide diversity of applications.
It is worth noting that the auxiliary variable could be introduced in the data fidelity term as well as in the prior information. The derivation of the proposed method in (13) allows us to identify classes of models for which our approach can be extended. Obviously, the key requirement is that the term which should be simplified can be written as a quadratic function with respect to some variables. Hence, without completely relaxing the Gaussian requirement, we can extend the proposed method to Gaussian models in which some hidden variables control the mean and/or the variance. This includes, for example, scale mixture of Gaussian models [63,64] such as the alpha-stable family (including the Cauchy distribution), the Bernoulli Gaussian model and the generalized Gaussian distributions, and also Gaussian Markov random fields [55]. In Section 3.2, we will investigate the case of the scale mixture of Gaussian models. When both the likelihood and the prior distribution are Gaussian conditionally to some parameters, the proposed method can be applied to each term as explained in Section 3.3.
Another point to pay attention to is the sampling of the auxiliary variable v. In particular, in Algorithm 2, we should be able to sample from the Gaussian distribution whose covariance matrix is of the form 1 µ I Q − H ΛH , which is possible for a large class of observation models as discussed in Section 3.4.

Problem Formulation
Let us consider the following observation model: such that for every i ∈ {1, . . . , N}, where (σ 1 , . . . , σ N ) are independent random variables distributed on R + according to P σ . Different forms of the mixing distribution P σ lead to different noise statistics. In particular, the Cauchy noise is obtained when σ 2 1 , . . . , σ 2 N are random variables following an inverse Gamma distribution. Let σ = [σ 1 , . . . , σ N ] . By assuming that x and σ are independent, the joint posterior distribution of x and σ is given by: In such a Bayesian estimation context, a Gibbs sampling algorithm is generally adopted to sample alternatively from the distributions P x|σ,z and P σ|x,z .
In the following, we assume that the set S 0 = {σ 1 = σ 2 = . . . = σ N = 0 } has a zero probability given the vector of observations z. Note that by imposing such rule, we ensure that at each iteration t of the Gibbs algorithm, σ (t) = 0 N almost surely.
Since sampling from P x|σ,z is supposed to be intractable, we propose the addition of auxiliary variables v ∈ R J that may depend on the variables of interest x and σ according to a given conditional distribution density p(v|x, σ, z) = p(v|x, σ) which satisfies the following conditions: where p(v|z) should be a valid probability density function.
Using the same arguments as in Section 2.2, these two properties are satisfied provided that p(v|x, σ, z) defines a proper probability density function. It follows that the initial two-step Gibbs iteration is replaced by the following three sampling steps. First, sample v (t+1) from P v|x (t) ,σ (t) ,z then sample x (t+1) from P x|σ (t) ,v (t+1) ,z , and finally sample σ (t+1) from P σ|x (t+1) ,v (t+1) ,z .

Proposed Algorithms
Let D(σ) be the diagonal matrix whose diagonal elements are given by Note that, since S 0 has zero probability, we almost surely have • Suppose first that there exists a constant ν > 0 such that Then, results in Section 3.1 with a Gaussian noise can be extended to scale mixture of Gaussian noise by substituting-at each iteration t-D (t) for Λ, and by choosing µ < ν 2 in Algorithm 1 and µ H 2 S < ν 2 in Algorithm 2. The only difference is that an additional step must be added to the Gibbs algorithm to draw samples of the mixing variables σ 1 , . . . , σ N from their conditional distributions given x, v, and z.

•
Otherwise, when ν > 0 satisfying (34) does not exist, results in Section 3.1 remain also valid when, at each iteration t, for a given value of σ (t) , we replace Λ by D(σ (t) ). However, there is a main difference with respect to the case when ν > 0, which is that µ depends on the value of the mixing variable σ (t) and hence can take different values along the iterations. Subsequently, µ(σ) will denote the chosen value of µ for a given value of σ. Here again, two strategies can be distinguished for setting µ(σ (t) ) t∈N , depending on the dependencies one wants to eliminate through the DA strategy.

Alternative I: Eliminate the Coupling Induced by D(σ (t) )
A first option is to choose, at each iteration t, µ(σ (t) ) positive such that with ∈]0, 1[ and The auxiliary variable is then drawn as follows: where Γ(σ (t) ) = 1 µ(σ (t) ) I N − D(σ (t) ) is positive definite by construction. The minus logarithm of the posterior density p(x|σ, v, z) is given by Alternative II: Eliminate the Coupling Induced by H D(σ (t) )H Similarly, in order to eliminate the coupling induced by the full matrix H D(σ (t) )H, µ(σ (t) ) can be chosen at each iteration t ∈ N so as to satisfy with ∈]0, 1[ and I (t) is given by (36). Then, the auxiliary variable is drawn as where Γ(σ (t) ) = 1 H is positive definite. The minus logarithm of the posterior density p(x|σ, v, z) then reads It is worth noting that σ and v are two dependent random variables conditionally to both x and z.
The resulting Gibbs samplers, corresponding to Alternatives I and II, respectively, are summarized in Algorithms 3 and 4.

Algorithm 3
Gibbs sampler with auxiliary variables in order to eliminate the coupling induced by D(σ) in the case of a scale mixture of Gaussian noise.

Partially Collapsed Gibbs Sampling
It can be noted that it is generally complicated to sample from P σ|x,v,z due to the presence of µ(σ) and D(σ) in the conditional distribution of v. One can replace this step by sampling from P σ|x,z ; that is, directly sampling σ from its marginal posterior distribution with respect to v and conditionally to x and z. In this case, we say that we are partially collapsing v in the Gibbs sampler. One of the main benefits of doing so is that, conditionally to x and z, σ has independent components. However, as σ is sampled independently from v, the constructed Markov chain t 0 may have a transition kernel with an unknown stationary distribution [65]. This problem can also be encountered when the auxiliary variable v depends on other unknown hyperparameters changing along the algorithm, such as prior covariance matrix or regularization parameter when the auxiliary variable is added to the prior instead of the likelihood. However, there are some rules based on marginalization, permutation, and trimming that allow the conditional distributions in the standard Gibbs sampler to be replaced with conditional distributions marginalized according to some variables while ensuring that the target stationary distribution of the Markov chain is maintained. The resulting algorithm is known as the Partially Collapsed Gibbs Sampler (PCGS) [65]. Although this strategy can significantly decrease the complexity of the sampling process, it must be implemented with care to guarantee that the desired stationary distribution is preserved. Applications of PCGS algorithms can be found in [66][67][68].
Assume that, in addition to x, σ, v, we have a vector Θ ∈ R P of unknown parameters to be sampled. Note that p(x, σ, Θ, v|z) should be integrable with respect to all the variables. Following [65], we propose the use of a PCGS algorithm that allows us to replace the full conditional distribution P σ|x,v,Θ,z with its conditional distribution P σ|x,Θ,z without affecting the convergence of the algorithm to the target stationary law. Algorithm 5 shows the main steps of the proposed sampler. It should be noted that, unlike the standard Gibbs algorithm, permuting the steps of this sampler may result in a Markov chain with an unknown stationary distribution.

High-Dimensional Gaussian Distribution
The proposed DA approach can also be applied to the problem of drawing random variables from a high-dimensional Gaussian distribution with parameters m and G as defined in (5) and (6). The introduction of auxiliary variables can be especially useful in facilitating the sampling process in a number of problems that we discuss below. In order to make our presentation clearer, an additional index will be added to the variables v and µ, introduced in Section 2.

•
If the prior precision matrix G x and the observation matrix H can be diagonalized in the same basis, it can be of interest to add the auxiliary variable v 1 in the data fidelity term. Following Algorithm 1, let µ 1 > 0 such that µ 1 Λ S < 1 and v 1 ∼ N 1 The resulting conditional distribution of the target signal x given the auxiliary variable v 1 and the vector of observation z is a Gaussian distribution with the following parameters: Similarly, if it is possible to write G x = V ΩV, such that H and V can be diagonalized in the same basis, we suggest the introduction of an extra auxiliary variable v 2 independent of v 1 in the prior term to eliminate the coupling introduced by Ω when passing to the transform domain. Let µ 2 > 0 be such that µ 2 Ω S < 1 and let the distribution of v 2 conditionally to x be given by The joint distribution of the unknown parameters is given by It follows that the minus logarithm of the conditional distribution of x given z, v 1 , and v 2 is Gaussian with parameters: and • If G x and H are not diagonalizable in the same basis, the introduction of an auxiliary variable either in the data fidelity term or the prior allows us to eliminate the coupling between these two heterogeneous operators. Let µ 1 > 0 such that µ 1 H ΛH S < 1 and Then, the parameters of the Gaussian posterior distribution of x given v 1 read: Note that if G x has some simple structure (e.g,. diagonal, block diagonal, sparse, circulant, etc.), the precision matrix (50) will inherit this simple structure.
Otherwise, if G x does not present any specific structure, one could apply the proposed DA method to both data fidelity and prior terms. It suffices to introduce an extra auxiliary variable v 2 in the prior law, additionally to the auxiliary variable v 1 in (49). Let µ 2 > 0 be such that µ 2 G x S < 1 and v 2 ∼ N 1 Then, the posterior distribution of x given v 1 and v 2 is Gaussian with the following parameters: and where

Sampling the Auxiliary Variable
It is clear that the main issue in the implementation of all the proposed Gibbs algorithms arises in the sampling of the auxiliary variable v. The aim of this section is to propose efficient strategies for implementing this step at a limited computational cost, in the context of large-scale problems.
For the sake of generality, we will consider that v follows a multivariate Gaussian distribution with a covariance matrix of the form Γ = 1 Our first suggestion is to set µ such that with β > 0. For example, one can set µ H 2 S Λ S and β = √ Λ S , where 0 < < 1. This allows us to verify the requirement µ H ΛH S < 1. Moreover, it leads to Thus, the sampling step of the auxiliary variable at iteration t ∈ N can be replaced by the three following steps: Compute Hereabove, y (t+1) and n (t+1) are independent random variables. One can notice that the sampling problem of the auxiliary variables is now separated into two independent subproblems of sampling from large-scale Gaussian distributions. The first sampling step can usually be performed efficiently. For instance, if Λ is diagonal (e.g., when the model is a scale mixture of Gaussian variables), coefficients  • Suppose that H satisfies HH = νI N with ν > 0, which is the case, for example, of tight frame synthesis operators or decimation matrices. Note that νλ √ < 1. We then have: It follows that a sample from the Gaussian distribution with covariance matrix 1 λ I Q − H H can be obtained as follows: where y (t+1) 1 ∈ R Q and y (t+1) 2 ∈ R N are independent Gaussian random vectors with covariance matrices equal to I Q and I N , respectively.

•
Suppose that H = MP with M ∈ R N×K and P ∈ R K×Q . Hence, one can set λ > 0 and λ > 0 such that It appears that if it is possible to draw merely random vectors y

Application to Multichannel Image Recovery in the Presence of Gaussian Noise
We now discuss the performance of the proposed DA strategies in the context of restoration of multichannel images (MCIs). Such images are widely used in many application areas, such as medical imaging and remote sensing [69][70][71]. Several medical modalities provide color images, including cervicography, dermoscopy, and gastrointestinal endoscopy [72]. Moreover, in the field of brain exploration with neuro-imaging tools, multichannel magnetic resonance images are widely used for multiple sclerosis lesion segmentation [73]. Indeed, the multicomponent images correspond to different magnetic resonance intensities (e.g., T1, T2, FLAIR). They contain different information on the underlying tissue classes that enable discrimination of the lesions from the background. Multiple channel components typically result from imaging a single scene by sensors operating in different spectral ranges. For instance, about a dozen radiometers may be on-board remote sensing satellites. Most of the time, MCIs are corrupted with noise and blur arising from the acquisition process and transmission steps. Therefore, restoring MCIs is of primary importance as a preliminary step before addressing analysis tasks such as classification, segmentation, or object recognition [74]. Several works dedicated to MCI processing rely on wavelet-based approaches [70,75]. In this section, we propose the adoption of a Bayesian framework for recovering the wavelet coefficients of deteriorated MCI, with the aim of analyzing the performance of the aforementioned hybrid Gibbs samplers.

Problem Formulation
Let us consider the problem of recovering a multicomponent image with B componentsȳ 1 , . . . ,ȳ B in R R (the images being columnwise reshaped) from some observations z 1 , . . . , z B which have been degraded by spatially-invariant blurring operators B 1 , . . . , B B and corrupted by independent zero-mean additive white Gaussian noises having the same known variance σ 2 . As already stated, here we propose addressing the restoration problem in a transform domain where the target images are assumed to have a sparse representation. Let us introduce a set of tight frame synthesis operators F * 1 , . . . , F * B [76] such that (∀b ∈ {1, . . . , where for every b ∈ {1, . . . , B}, F * b is a linear operator from R K to R R with K R andx b is the vector of frame coefficients of the imageȳ b . Each frame transform operator decomposes the image into M oriented subbands at multiple scales with sizes K m , m ∈ {1, . . . , M}, such that ∑ M m=1 K m = K: Then, the problem can be formulated as (4), that is: where w ∼ N (0 N , and We propose exploitation of the cross-component similarities by jointly estimating the frame coefficients at a specific orientation and scale through all the B components. In this respect, for every m ∈ {1, . . . , M}, for every k ∈ {1, . . . , K m }, let x m,k = (x b,m,k ) 1 b B ∈ R B be the vector of frame coefficients for a given wavelet subband m at a spatial position k through all the B components. Note that this vector can be easily obtained through x m,k = P m,k x, where P m,k ∈ R B×Q is a sparse matrix containing B lines of a suitable permutation matrix. To promote the sparsity of the wavelet coefficients and the inter-component dependency, following [70], we assume that for every m ∈ {1, . . . , M}, the vectors x m,1 , . . ., x m,K m are realizations of a random vector following a generalized multivariate exponential power (GME P) distribution with scale matrix Σ m , shape parameter β m , and smoothing parameter δ m . Thus, the minus-log of the prior likelihood is given up to an additive constant by where for every m ∈ {1, . . . , M}, a m ∈ R B , and for all t ∈ R, ψ m (t) = 1 2 (t 2 + δ m ) β m . Our goal is to compute the posterior mean estimate of the target image as well as the unknown regularization parameters using MCMC sampling algorithms accelerated thanks to our proposed DA strategies. In the following, we will denote by Θ the vector of unknown regularization parameters to be estimated jointly with x in the Gibbs sampling algorithm.

Sampling from the Posterior Distribution of the Wavelet Coefficients
One can expect that the standard sampling algorithms fail to efficiently explore the target posterior not only because of the high dimensionality of the problem, but also because of the anisotropic nature of the wavelet coefficients. In fact, the coefficients belonging to different scales are assumed to follow GME P priors with different shapes β m , m ∈ {1, . . . , M}. For instance, coefficients belonging to the low-resolution subband are generally assumed to be driven from a Gaussian distribution (i.e., β m = 1), while GME P priors with very small shape parameter (i.e., β m 1 2 ) are generally assigned to high-resolution subbands at the first level of decomposition in order to promote sparsity. Therein, one can better explore the directions of interest separately by using different amplitudes than sampling them jointly. However, the observation matrix causes high spatial dependencies between the coefficients, and thus hinders processing the different wavelet subbands independently. The DA approaches we introduced in Section 3 allow this preconditioning problem to be tackled by adding auxiliary variables to the data fidelity term. More specifically, following Algorithm 2, we propose the introduction of an auxiliary variable v ∈ R Q such that: where µ B 2 S F 2 S < 1. Since the set of hyperparameters Θ is independent of the auxiliary variable v when conditioned to x, each iteration t ∈ N of the proposed Gibbs sampling algorithm contains the following steps: Sample v (t+1) from P v|x (t) ,z .
If B is circulant (by assuming periodic boundary conditions of the blur kernel), the first sampling step can be easily done by passing to the Fourier domain. In particular, if F is orthonormal (that is, FF * = F * F = I Q ), samples of the auxiliary variables can be obtained by first drawing Gaussian random variables in the Fourier domain and then passing to the wavelet domain. Otherwise, if F is a non-orthonormal transform, sampling can be performed using our results stated in (59) and (62).
Note that in the new augmented space, the restoration problem reduces to a denoising problem with zero-mean Gaussian noise of variance µ, and the posterior density reads: where It follows that we can draw samples of vectors x m,k , m ∈ {1, . . . , M}, k ∈ {1, . . . , K m } in an independent manner. Thus, the resolution of the initial high-dimensional problem of size Q = KB reduces to the resolution of K parallel subproblems of size B. This is particularly interesting in the case of MCIs where we generally have K B. Instead of processing all the different wavelet coefficients at the same time, the proposed method allows each subproblem to be dealt with independently. This avoids sampling problems related to the heterogeneous prior distribution. Different sampling algorithms may be chosen according to the properties of the target distribution in each subproblem. Specifically, for each sampling subproblem, we propose to use either RW or MALA algorithms [17,77].
In the following, we will discuss the practical implementation of the third step of the Gibbs algorithm; namely, sampling from the posterior distribution of Θ.

Separation Strategy
For every m ∈ {1, . . . , M}, β m controls the shape of the GME P distribution, allowing for heavier tails than the Laplace distribution (β m < 0.5) and approaching the normal distribution when β m tends to 1. In this work, we assume that for every m ∈ {1, . . . , M}, β m and δ m are fixed. Actually, the shape parameter is set to different values with respect to the resolution level, spanning from very small values (β m < 0.5) in order to enforce sparsity in the detail subbands at the first levels of decomposition to relatively higher values (0.5 < β m < 1) for detail subband at higher resolution levels, whereas a Gaussian distribution is generally assigned to the low-frequency subband. Furthermore, we set δ m to a positive small value, ensuring that (78) is differentiable [70]. As already mentioned, the scale matrices (Σ m ) 1≤m≤M will be estimated. Let us define P Σ m the prior distribution of the scale matrix for each subband m ∈ {1, . . . , M} and p(Σ m ) its related density. The associated posterior density reads When β m = 1, the GME P prior reduces to a Gaussian distribution. In this case, a common choice of P Σ m is an inverse Wishart distribution and (72) is also an inverse Wishart distribution [78]. However, when 0 < β m < 1, (72) does not belong to classical families of matrix distributions. In that respect, rather than estimating the scale matrices directly, we resort to a separation strategy. More specifically, we propose the independent estimation of the standard deviations and the correlation terms. Let us decompose the scale matrix for each subband m ∈ {1, . . . , M} as follows [79]: where R m ∈ R B×B is the correlation matrix (whose diagonal elements are equal to 1 and the remaining ones define the correlation between the coefficients and have absolute value smaller than 1), s m ∈ R B is a vector formed by the square root of the precision parameters (the inverse of standard deviations), and C β m ,δ m is a multiplicative constant that depends on β m and δ m [70]. The advantage of this factorization can be explained by the fact that the estimation of the correlation matrix will not alter the estimation of the variances. For every m ∈ {1, . . . , M}, we decompose the precision vector as follows: where γ m is positive and n m ∈ R B is a vector of positive coefficients whose sum is equal to 1. Then, n m can be seen as the vector containing positive normalized weights of all the B components in the subband m. For simplicity, let us assume that the different components of the image have the same correlation and weights in all subbands; i.e., R = R m and n m = n for every m ∈ {1, . . . , M}. Furthermore, let us suppose that n is known. We then have

Prior and Posterior Distribution for the Hyperparameters
One can construct the correlation matrix R by sampling from an inverse Wishart distribution. Specifically, let C ∼ IW (A, c) where A is an appropriate positive definite matrix of R B×B and c > 0. Then, we can write R = ∆C∆, where ∆ is the diagonal matrix whose elements are given by , for every i ∈ {1, . . . , B}. Following [79], we can show that the prior density of R reads: In the following, we will use the notation R ∼ SS(A, c) to denote this prior. In particular, when A = I B , individual correlations have the marginal density p(ρ i, for every (i, j) ∈ {1, . . . , B} 2 such that i = j, which can be seen as a rectangular Beta distribution on the interval [−1, 1] with both parameters equal to (c − B + 1)/2. For c = B + 1, we obtain marginally uniformly distributed correlations, whereas by setting B c < B + 1 (or B + 1 < c), we get marginal priors with heavier (or lighter) tails than the uniform distribution-that is, distributions that promote high correlation values around the extremity of the intervals (or near-zero values), respectively [79]. Thus, the posterior distribution of R is given by where Here we propose to sample from (77) at each iteration t ∈ N using an MH algorithm with proposal SS(Ã,c), whereÃ is set to the current value of R at iteration t andc is chosen to achieve reasonable acceptance probabilities.
For every m ∈ {1, . . . , M}, we assume a Gamma prior for γ m ; that is, γ m ∼ G(a γ m , b γ m ), where a γ m > 0 and b γ m > 0 [80]. Then, the posterior distribution of γ m is given by: Note that if δ m = 0, then (79) reduces to a Gamma distribution with parameters: When δ m > 0, sampling from (79) will be performed using an independent MH algorithm with a Gamma proposal of parameters (80) and (81).

Initialization
We propose to set the prior distributions of R, γ 1 , . . . , γ M using empirical estimators from the degraded image. In particular, a rough estimator of R can be computed from the subband containing the low-resolution wavelet coefficients at the highest level of decomposition. In the case when F is orthonormal, the variance of wavelet coefficients of the original image are approximately related to those of the degraded image through: where [.] m designates the wavelet coefficients belonging to the subband m and α m is a positive constant which depends on the subband index m and on the blur matrix. Expression (82) is derived from the considered observation model (65) by assuming a constant approximation of the impulse response of the blur filter in each wavelet subband. Note that α m can be calculated beforehand as follows. Given noise-free data, we compute the original empirical variance for each wavelet subband. Then, we calculate again the new variances of the subbands when the data is blurred using matrix B.
The coefficients α m are finally estimated for each wavelet subband by computing the ratio of the two variances by a linear regression. When α m is not too small with respect to 1, estimators of var([x b ] m ) can be reliably computed from α m and var([F b z b ] m ) using (82). We propose the use of this method to compute estimators of the variances in subbands at the highest levels of decomposition and then deduce the variances of the remaining subbands by using some properties of multiresolution wavelet decompositions. Note that each detail subband m corresponds to a given orientation l (horizontal, vertical, diagonal) and a given scale j (related resolution level). Actually, the variances of the detail subbands can be assumed to follow a power law with respect to the scale of the subband, which can be expressed as follows [81]: where l and l are constants depending on the orientation l of the subband m. Once the variances of subbands in the two highest levels of decomposition have been computed using (82), we can calculate l and l for each orientation l using the slope and the intercept of these variances from a log plot with respect to the scale j. The remaining variances are then estimated by using (83). We then deduce from these variances an empirical estimator of n, and set the parameters of the prior distributions of γ 1 , . . . , γ M .

Experimental Results
In these experiments, we consider the Hydice hyperspectral (https://engineering.purdue.edu/ biehl/MultiSpec/hyperspectral.html) data composed of 191 components in the 0.4 to 2.4 µm region of the visible and infrared spectrum. The test image was constructed by taking only a portion of size 256 × 256 and B = 6 components of Hydice using the channels 52, 67, 82, 97, 112, and 127. Hence, the problem dimension was N = 393, 216. The original image was artificially degraded by a uniform blur of size 5 × 5 and an additive zero-mean white Gaussian noise with variance σ 2 = 9 so that the initial signal-to-noise ratio (SNR) was 11.16 dB. We performed an orthonormal wavelet decomposition using the Symlet wavelet of order 3, carried out over three resolution levels, hence M = 10 and Q = N. For the subband corresponding to the approximation coefficients (m = 10), we chose a Gaussian prior (i.e., β m = 2, δ m = 0). For the remaining subbands (m ∈ {1, . . . , M − 1}), we set δ m = 10 −4 . Moreover, we set β m = 0.2 for the detail subbands corresponding to the lowest level of decomposition, β m = 0.4 for the second level of decomposition, and β m = 0.5 for the third level of decomposition.
We ran the Gibbs sampling Algorithm 2 with a sufficient number of iterations to reach stability. The obtained samples of the wavelet coefficients after the burn-in period were then used to compute the empirical MMSE estimator for the original image. Table 2 reports the results obtained for the different components in terms of SNR, PSNR (peak signal-to-noise ratio), BSNR (blurred signal to noise ratio), and SSIM (structural similarity). It can be noticed that the values of both the objective metrics and the perceptual ones were significantly improved by our method for all the spectral components. For instance, the PSNR values were increased on average by around 4.15 dB, and the SSIM by around 0.23. The achieved gains indicate that the MMSE estimator yielded good numerical results. This can also be corroborated by Figure 1, showing the visual improvements for the different components of the multichannel image. One can observe that all the recovered images were correctly deblurred. Furthermore, small objects were enhanced in all the displayed components. Table 2. Restoration results. SNR: signal-to-noise ratio; BSNR: blurred SNR; PSNR: peak SNR; MMSE: minimum mean square error; SSIM: structural similarity.  We propose to compare the performance of the Gibbs sampler with auxiliary variables when the posterior law of the wavelet coefficients is explored using either RW or MALA [17,77] algorithms. We also compared the speed of our proposed approaches with standard RW and MALA without the use of auxiliary variables. Figure 2 shows the evolution-with respect to the computational time-of the scale parameter γ m in the horizontal subband for the first level of decomposition using the various algorithms. The results associated with the proposed algorithms appear in solid lines, while those associated with standard algorithms without use of auxiliary variables are in dashed lines. It can be observed that the proposed algorithms reached stability much faster than the standard methods. Indeed, since the problem dimension is large, the stepsize ε in the standard algorithms was constrained to take very small values to allow appropriate acceptance probabilities, whereas in the new augmented space the subproblems dimension was smaller allowing large moves to be accepted with high probability values. Note that the MALA algorithm with auxiliary variables exhibited the best performance in terms of convergence speed. We summarize the obtained samples using the proposed algorithms by showing the marginal means and standard deviations of the hyperparameters in Table 3. It can be noted that the two proposed algorithms provided similar estimation results.  It is worth noting that for larger-dimensional problems (i.e., larger values of B), one could further improve the efficiency of the proposed algorithm by exploiting the parallel structure of the sampling tasks.

Problem Formulation
In this second experiment, we consider the observation problem defined in (29), where H corresponds to a spatially invariant blur with periodic boundary conditions and the noise is a two-term mixed Gaussian variable; i.e., for every i ∈ {1, . . . , N}, w i ∼ N (0, σ 2 i ) such that where κ 1 , κ 2 are positive, 0 < β < 1 is the probability that the variance of the noise σ i equals κ 2 , and δ κ 1 and δ κ 2 denote the discrete measures concentrated at the values κ 1 and κ 2 , respectively. Model (84) can approximate, for example, mixed impulse Gaussian noise arising in radar, acoustic, and mobile radio applications [82,83]. In this case, the impulse noise is approximated with a Gaussian one with a large variance κ 2 κ 1 , and β represents the probability of occurrence of the impulse noise. In the following, we assume without loss of generality that κ 2 κ 1 . We address the problem of estimating x, σ, β, κ 1 , and κ 2 from the observations z.

Prior Distributions
We propose to use conjugate priors for the unknown variances, namely inverse Gamma distributions; i.e., κ 2 i ∼ IG(a i , b i ), i ∈ {1, 2}, where a i and b i are positive constants. Here, a 1 , a 2 , b 1 , and b 2 are set in practice to small values to ensure weakly informative priors. For the occurrence probability β, we chose a uniform prior distribution (i.e., β ∼ U (0, 1)). Furthermore, the target image was assumed to follow a zero-mean Gaussian prior with a covariance matrix G −1 known up to a precision parameter γ > 0; i.e., Different covariance matrices may be chosen depending on which properties one wants to impose on the estimated image. In this example, we propose to enforce smoothness by setting L = δI Q − ∇ 2 , where ∇ 2 is the circulant convolution matrix associated with a Laplacian filter and δ > 0 is a small constant that aims to ensure the positive definiteness of the prior covariance matrix. We further assume that the regularization parameter γ follows an inverse Gamma prior with parameters a γ > 0 and b γ > 0. The resulting hierarchical model is displayed in Figure 3.

Sampling from the Posterior Distribution of x
In the Gibbs algorithm, we need to draw samples from the multivariate Gaussian distribution of parameters (86) and (87) changing along the sampling iterations. In particular, even if H and L are circulant matrices, sampling cannot be done in the Fourier domain because of the presence of D. In the sequel, we will use the method proposed in Section 3.3 to sample from this multivariate Gaussian distribution. More specifically, we exploit the flexibility of the proposed approach by resorting to two variants. In the first variant, we take advantage of the fact that L and H are diagonalizable in the Fourier domain, and we propose to add the auxiliary variable to the data fidelity term in order to get rid of the coupling caused by D when passing to the Fourier domain. In the second variant, we introduce auxiliary variables for both the data fidelity and the prior terms in order to eliminate the coupling effects induced by all linear operators in the posterior distribution of the target image.

First Variant
We introduce the variable v whose conditional distribution-given the set of main parameters of the problem-is the Gaussian distribution of mean 1 µ I N − D Hx and covariance matrix 1 µ I N − D , where µ = D −1 S with 0 < < 1. In practice, we set = 0.99. It follows that the new conditional distribution of the target signal is where m and G are defined as follows: It is worth noting that the auxiliary variable v depends on x, and also on σ through µ and D, but does not depend on β, κ 1 , κ 2 , γ when conditioned to x, σ, and z. Thus, we propose to use the partially collapsed Gibbs sampling algorithm in order to collapse the auxiliary variables in the sampling step of σ. At each iteration t ∈ N, the proposed algorithm goes through the following steps in an ordered manner:

Second Variant
Another strategy is to introduce two independent auxiliary variables v 1 and v 2 in R Q following Gaussian distributions of means Γ 1 x and Γ 2 x and covariance matrices Γ 1 and Γ 2 , respectively, where and In practice, we set µ 1 = H −2 S D −1 S and µ 2 = L −2 S , where = 0.99. Then, the posterior distribution of x conditioned to σ, β, κ 2 1 , κ 2 2 , γ, v 1 , v 2 , and z is Gaussian with mean m and precision matrix G defined as and The auxiliary variable v 1 depends implicitly on σ through D and µ, but does not depend on the remaining parameters when conditioned to x, σ, and z. Similarly, v 2 does not depend on σ, β, κ 2 1 , κ 2 2 , v 1 , γ when conditioned to x and z. We propose a PCGS algorithm that allows to collapse v 1 in the sampling step of σ. Each iteration t ∈ N of the proposed PCGS algorithm is composed of the following arranged sampling steps.
Note that since H and L are circulant matrices and D is diagonal, sampling the auxiliary variables in the proposed methods can be easily performed following Section 3.4.

Experimental Results
We consider a set of three test images denoted byx 1 ,x 2 , andx 3 , of size 512 × 512. These images were artificially degraded by a spatially-invariant blur with point spread function h and further corrupted with mixed Gaussian noise. The Gibbs algorithms were run for 6000 iterations and a burn-in period of 4000 iterations was considered. Estimators of the unknown parameters were then computed using the empirical mean over the 2000 obtained samples. Visual results are displayed in Figure 4 as well as estimates of hyper-parameters using AuxV1. We focus now on imagex 1 in order to compare the two variants of our proposed method with the Reversible Jump Perturbation Optimization (RJPO) algorithm [32]. For this method, we used the conjugate gradient algorithm as a linear solver at each iteration whose maximal number of iterations and tolerance were adjusted to correspond to an acceptance probability close to 0.9. We used the same initialization for all compared algorithms. Figures 5-8 display samples of hyperparameters as a function of iteration or time. By visually examining the trace plots, we can notice that all algorithms were stabilized after an appropriate burn-in period. In particular, RJPO and AuxV1 showed approximately the same iterative behavior, while AuxV2 required about 3000 iterations to reach iconvergence. This corresponds to twice the burn-in length of RJPO and AuxV1. However, each iteration of the RJPO is time consuming since an iterative algorithm is run until convergence at each iteration. Adding auxiliary variables to the model allows the signal to be sampled in a computationally efficient way in the enlarged state space, so that the computational cost of each iteration was highly reduced for both proposed algorithms, and the total time needed to converge was noticeably shortened compared with RJPO. Regarding the stabilization phase, we consider samples generated after the burn-in period (namely, the last 2000 samples for each algorithm). First, we aimed to study the accuracy of estimators of the unknown variables from these samples. More specifically, we computed empirical estimators of the marginal posterior mean and standard deviation of the target parameters as well as those of a randomly chosen pixel x i . Table 4 reports the obtained results. It can be noted that parameters β, κ 1 , and κ 2 were correctly estimated by all the algorithms, while the remaining parameters had similar estimated values. Second, in order to evaluate the mixing properties of the chains at convergence, we computed an empirical estimation of the mean square jump in stationary state from the obtained samples. This indicator can be seen as an estimation of the average distance between two successive samples in the parameter space. It was computed after the burn-in period t 0 = 5000 using P = 2000 last samples as follows: Note that maximizing the mean square jump is equivalent to minimizing a weighted sum of the 1-lag autocorrelations. In Table 5, we show estimates of the mean square jump per second in stationary state, which is defined as the ratio of the mean square jump and the computational time per iteration. This can be seen as an estimation of the average speed of the algorithm for exploring the parameter space at convergence. We also compared the statistical efficiency of the different samplers with respect to RJPO, defined as the mean square jump per second of each sampler over the mean square jump per second of RJPO. We can notice that the speed improvement of the proposed algorithms came at the expense of a deterioration of the quality of the generated samples. In fact, both proposed algorithms yielded lower values of mean square jump than the RJPO algorithm, which indicates that correlation between successive samples was increased. Furthermore, AuxV1 appeared to have better mixing properties compared with AuxV2. However, the generation of every sample in RJPO is very costly, so its efficiency remained globally poorer compared with AuxV1 and AuxV2. The best trade-off between convergence speed and mixing properties of the chain was achieved by the proposed AuxV1 algorithm.

Conclusions
In this paper, we have proposed an approach for sampling from probability distributions in large-scale problems. By adding some auxiliary variables to the model, we succeeded in separately addressing the different sources of correlations in the target posterior density. We have illustrated the usefulness of the proposed Gibbs sampling algorithms in two application examples. In the first application, we proposed a wavelet-based Bayesian method to restore multichannel images degraded by blur and Gaussian noise. We adopted a multivariate prior model that takes advantage of the cross-component correlation. Moreover, a separation strategy has been applied to construct prior models of the related prior hyperparameters. We then employed the proposed Gibbs algorithm with auxiliary variables to derive optimal estimators for both the image and the unknown hyperparameters. In the new augmented space, the resulting model makes sampling much easier since the coefficients of the target image are no longer updated jointly, but in a parallel manner. Experiments carried out on a set of multispectral satellite images showed the good performance of the proposed approach with respect to standard algorithms. Several issues could be investigated as future work, such as the ability of the proposed algorithm to deal with inter-scale dependencies in addition to the cross-channel ones. In the second application, we have applied the proposed method to the recovery of signals corrupted with mixed Gaussian noise. When compared to a state-of-the-art method for sampling from high dimensional scale Gaussian distributions, the proposed algorithms achieve a good tradeoff between the convergence speed and the mixing properties of the Markov chain, even if the generated samples are not independent. Note that the proposed method can be applied to a wide class of applications in inverse problems-in particular, those including conditional Gaussian models either for the noise or the target signal.