On Default Priors for Robust Bayesian Estimation with Divergences

This paper presents objective priors for robust Bayesian estimation against outliers based on divergences. The minimum γ-divergence estimator is well-known to work well in estimation against heavy contamination. The robust Bayesian methods by using quasi-posterior distributions based on divergences have been also proposed in recent years. In the objective Bayesian framework, the selection of default prior distributions under such quasi-posterior distributions is an important problem. In this study, we provide some properties of reference and moment matching priors under the quasi-posterior distribution based on the γ-divergence. In particular, we show that the proposed priors are approximately robust under the condition on the contamination distribution without assuming any conditions on the contamination ratio. Some simulation studies are also presented.


Introduction
The problem of the robust parameter estimation against outliers has a long history. For example, Huber and Ronchetti [1] provided an excellent review of the classical robust estimation theory. It is well-known that the maximum likelihood estimator (MLE) is not robust against outliers because it is obtained by minimizing the Kullback-Leibler (KL) divergence between the true and empirical distributions. To overcome this problem, we may use other (robust) divergences instead of the KL divergence. The robust parameter estimation based on divergences has been one of the central topics in modern robust statistics (e.g., [2]). Such a method was firstly proposed by [3], who referred to it as the minimum density power divergence estimator. Reference [4] also proposed the "type 0 divergence", which is a modified version of the density power divergence, and Reference [5] showed that it has good robustness properties. The type 0 divergence is also known as the γ-divergence, and statistical methods based on the γ-divergence have been presented by many authors (e.g., [6][7][8]).
In Bayesian statistics, the robustness against outliers is also an important issue, and divergence-based Bayesian methods have been proposed in recent years. Such methods are known as quasi-Bayes (or general Bayes) methods in some studies, and the corresponding posterior distributions are called quasi-posterior (or general posterior) distributions. To overcome the model misspecification problem (see [9]), the quasi-posterior distributions are based on a general loss function rather than the usual log-likelihood function. In general, such general loss functions may not depend on an assumed statistical model. However, in this study, we use loss functions that depend on the assumed model because we are interested in the robust estimation problem against outliers, that is the model is not misspecified, but the data generating distribution is wrong. In other words, we use divergences or scoring rules as a loss function for the quasi-posterior distribution (see also [10][11][12][13][14]). For example, Reference [10] used the Hellinger divergence. Reference [11] used the density power divergence. References [12,14] used the γ-divergence.
In particular, the quasi-posterior distribution based on the γ-divergence was referred to as the γ-posterior in [12], and they showed that the γ-posterior has good robustness properties to overcome the problems in [11].
Although the selection of priors is an important issue in Bayesian statistics, we often have no prior information in some practical situations. In such cases, we may use priors called default or objective priors, and we should select an appropriate objective prior in a given context. In particular, we consider the reference and moment matching priors in this paper. The reference prior was firstly proposed by [15], and the moment matching prior was proposed by [16]. However, such objective priors generally depend on an unknown data generating distribution when we cannot assume that the contamination ratio is approximately zero. For example, if we assume the ε-contamination model (see, e.g., [1]) as a data generating distribution, many objective priors depend on the unknown contamination ratio and unknown contamination distribution because these objective priors involve the expectations under the data generating distribution. Although [17] derived some kinds of reference priors under the quasi-posterior distributions based on some kinds of scoring rules, they only discussed the robustness of such reference priors when the contamination ratio ε is approximately zero. Furthermore, their simulation studies largely depended on the assumption for the contamination ratio. In other words, they indirectly assumed that the contamination ratio ε is approximately zero. The current study derives the moment matching priors under the quasi-posterior distribution in a similar way as [16], and we show that the reference and moment matching priors based on the γ-divergence do not approximately depend on such unknown quantities under a certain assumption for the contamination distribution even if the contamination ratio is not small.
The rest of this paper is organized as follows. In Section 2, we review robust Bayesian estimation based on divergences referring to some previous studies. We derive moment matching priors based on the quasi-posterior distribution using an asymptotic expansion of the quasi-posterior distribution given by [17] in Section 3. Furthermore, we show that the reference and moment matching priors based on the γ-posterior do not depend on the contamination ratio and the contamination distribution. In Section 4, we compare the empirical bias and mean squared error of posterior means through some simulation studies. Some discussion about the selection of tuning parameters is also provided.

Robust Bayesian Estimation Using Divergences
In this section, we review the framework of robust estimation in the seminal paper by Fujisawa and Eguchi [5], and we introduce the robust Bayesian estimation using divergences. Let X 1 , . . . , X n be independent and identically distributed (iid) random variables according to a distribution G with the probability density function g on Ω, and let X n = (X 1 , . . . , X n ). We assume the parametric model f θ = f (x, θ) (θ ∈ Θ ⊂ R p ) and consider the estimation problem for θ.
Then, the γ-divergence between two probability densities g and f is defined by:

Framework of Robustness
Fujisawa and Eguchi [5] introduced a new framework of robustness, which is different from the classical one. When some of the data values are regarded as outliers, we need a robust estimation procedure. Typically, an observation that takes a large value is regarded as an outlier. Under this convention, many robust parameter estimation procedures have been proposed to reduce the bias caused by an outlier. An influence function is one of the methods to measure the sensitivity of models against outliers. It is known that the bias of an estimator is approximately proportional to the influence function when the contamination ratio ε is small. However, when ε is not small, the bias cannot be approximately proportional to the influence function. Reference [5] showed that the likelihood function based on the γ-divergence gives a sufficiently small bias under heavy contamination. Suppose that observations are generated from a mixture distribution is the underlying density, δ(x) is another density function, and ε is the contamination ratio. In Section 3, we assume that the condition: holds for a constant γ 0 > 0 (see [5]). When x 0 is generated from δ(x), we call x 0 the outlier. We note that we do not assume that the contamination ratio ε is sufficiently small. This condition means that the contamination distribution δ(x) mostly lies on the tail of the underlying density f (x). In other words, for an outlier x 0 , it holds that f (x 0 ) ≈ 0. We note that the condition (1) is also a basis to prove the robustness against outliers for the minimum γ-divergence estimator in [5]. Furthermore, Reference [18] provided some theoretical results for the γ-divergence, and related works in the frequentist setting have been also developed (e.g., [6][7][8], and so on).
In the rest of this section, we give a brief review of the general Bayesian updating and introduce some previous works that are closely related to this paper.

General Bayesian Updating
We consider the same framework as [9,13]. We are interested in θ = θ(G) (θ ∈ Θ ⊆ R p ), and we define a loss function θ (x) := (θ, x). Further, let θ * = arg min θ∈Θ E G θ (X) be the target parameter. We define the risk function by E G θ (X), and its empirical risk is defined by R n (θ) = (1/n) ∑ n i=1 θ (X i ). For the prior distribution π(θ), the quasiposterior density is defined by: where ω > 0 is a tuning parameter called the learning rate. We note that the quasi-posterior is also called the general posterior or Gibbs posterior. In this paper, we fix ω = 1 for the same reason as [13]. For example, if we set µ (x) = |x − µ|, we can estimate the median of the distribution without assuming the statistical model. However, we consider the modeldependent loss function, which is based on statistical divergence (or the scoring rule) in this study (see also [11][12][13][14]). The unified framework of inference using the quasi-posterior distribution was discussed by [9].

Assumptions and Previous Works
Let d(·, ·) be a cross-entropy induced by a divergence, and let { f θ : θ ∈ Θ} be a statistical model. In general, the quasi-posterior distribution based on the cross-entropy is defined by: where d(ḡ, f θ ) is the empirically estimated cross-entropy andḡ is the empirical density function. In robust statistics based on divergences, we may use the cross-entropy induced by a robust divergence (e.g., [3][4][5]). In this paper, we mainly use the γ-cross-entropy proposed by [4,5]. Recently, Reference [12] proposed the γ-posterior based on the monotone transformation of the γ-cross-entropy: (2). On the other hand, Reference [11] proposed the R (α) -posterior based on the density power cross-entropy: (2). Note that crossentropies d α (·, ·) andd γ (·, ·) converge to the negative log-likelihood function as α → 0 and γ → 0, respectively. Hence, we can establish that they are some kind of generalization of the negative log-likelihood function. It is known that the posterior mean based on the R (α) -posterior works well for the estimation of a location parameter in the presence of outliers. However, this is known to be unstable in the case of the estimation for a scale parameter (see [12]). Nakagawa and Hashimoto [12] showed that the posterior mean under the γ-posterior has a small bias under heavy contamination for both location and scale parameters in some simulation studies. Let θ g := arg min θ∈Θ d(g, f θ ) be the target parameter. We now assume the following regularity conditions on the density function f θ (x) = f (x; θ) (θ ∈ Θ ⊂ R p ). We use indices to denote derivatives ofD(θ) = d(ḡ, f θ ) with respect to the components of the parameter θ. For example,D ijk (θ) The support of the density function does not depend on unknown parameter θ, and f θ is fifth-order differentiable with respect to θ in neighborhood U of θ g . (A2) The interchange of the order of integration with respect to x and differentiation as θ g is justified. The expectations: are all finite, and M ijk s (x) exists such that: is the expectation of X with respect to a probability density function g.
For any δ > 0, with probability one : for some ε > 0 and for all sufficiently large n. The matrices I (d) (θ) and J (d) (θ) are defined by: respectively. We also assume that I (d) (θ) and J (d) (θ) are positive definite matrices. Under these conditions, References [11,12] discussed several asymptotic properties of the quasiposterior distributions and the corresponding posterior means.
In terms of the higher order asymptotic theory, Giummolè et al. [17] derived the asymptotic expansion of such quasi-posterior distributions. We now introduce the notation that will be used in the rest of the paper. Reference [17] presented the following theorem.
Theorem 1 (Giummolè et al. [17]). Under the conditions (A1)-(A3), we assume thatθ n is a consistent solution of ∂d(ḡ, f θ ) = 0 andθ (d) n p − → θ g as n → ∞. Then, for any prior density function π(θ) that is third-order differentiable and positive at θ g , it holds that: where π * (d) (t n |X n ) is the quasi-posterior density function of the normalized random variable t n = (t 1 , . . . , n ) given X n , φ(·; A) is the density function of a p-variate normal distribution with a zero mean vector and covariance matrix A,J = J (d) (θ (d) n ),J −1 = (J ij ), and: Proof. The proof is given in the Appendix A of [17].
As previously mentioned, quasi-posterior distributions depend on the cross-entropy induced by a divergence and a prior distribution. If we have some information about unknown parameters θ, we can use a prior distribution that takes such prior information into account. However, in the absence of prior information, we often use prior distributions known as default or objective priors. Reference [17] proposed the reference prior for quasi-posterior distributions, which is a type of objective prior (see [15]). The reference prior π R is obtained by asymptotically maximizing the expected KL divergence between prior and posterior distributions. As a generalization of the reference prior, Reference [19] discussed such priors under a general divergence measure known as the α-divergence (see also [20,21]). The reference prior under the α-divergence is given by asymptotically maximizing the expected α-divergence: where D (α) is the α-divergence defined as: which corresponds to the KL divergence as α → 0, the Hellinger divergence for α = 1/2, and the χ 2 -divergence for α = −1. Reference [17] derived reference priors with the αdivergence under the quasi-posterior based on some kinds of proper scoring rules such as the Tsallis scoring rule and the Hyvärinen scoring rule. We note that the former rule is the same as the density power score of [3] with minor notational modifications. Theorem 2 (Giummolè et al. [17]). When |α| < 1, the reference prior that asymptotically maximizes the expected α-divergence between the quasi-posterior and prior distributions is given by: The result of Theorem 2 is similar to that of [19,20]. Objective priors such as the above theorem are useful because they can be determined by the data generating model. However, such priors do not have a statistical guarantee when the model is misspecified such as Huber's ε-contamination model. In other words, the reference prior in Theorem 2 depends on data generating distribution g because of when the contamination ratio ε is not small such as for heavy contamination cases. We now consider some objective priors under the γ-posterior, which is robust against such unknown quantities, in the next section.

Main Results
In this section, we show our main results. Our contributions are as follows. We derive moment matching priors for quasi-posterior distributions (Theorem 3). We prove that the proposed priors are robust under the condition on the tail of the contamination distribution (Theorem 4).

Moment Matching Priors
The moment matching priors proposed by [16] are priors that match the posterior mean and MLE up to the higher order (see also [22]). In this section, we attempt to extend the results of [16] to the context of quasi-posterior distributions. Our goal is to identify a prior such that the difference between the quasi-posterior meanθ . Under the same assumptions as Theorem 1, it holds that: . Furthermore, if we set a prior that satisfies: for all = 1, . . . , p, then it holds that: Hereafter, the prior that satisfies Equation (4) up to the order of o p (n −1 ) for all = 1, . . . , p is referred to as a moment matching prior, and we denote it by π M .
Proof. From the asymptotic expansion of the posterior density (3), we have the asymptotic expansion of the posterior mean for θ as: for = 1, . . . , p. The integral in the above equation is calculated by: From (5) and (6), we have: . . , p. By using the consistency of the estimatorθ n , we then have the following asymptotic difference betweenθ (d) andθ (d) : In general, it is not easy to obtain the moment matching priors explicitly. Two examples are given as follows.
Example 1. When p = 1, the moment matching prior is given by: for a constant C, where g 3 is a third derivation of g. This prior is very similar to that of [16], but the quantities g (d) Example 2. When p = 2, we put: where θ = (θ 1 , θ 2 ) . If u (θ 1 , θ 2 ) only depends on θ for all = 1, 2 and does not depend on other parameters θ k (k = ), we have: Then, we can solve the differential equation given by (4), and the moment matching prior is obtained by

Robustness of Objective Priors
For data that may be heavily contaminated, we cannot assume that the contamination ratio ε is approximately zero. In general, reference and moment matching priors depend on the contamination ratio and distribution. Therefore, we cannot directly use such objective priors for the quasi-posterior distributions because the contamination ratio ε and the contamination distribution δ(x) are unknown. In this subsection, we prove that priors based on the γ-divergence are robust against these unknown quantities. In addition to (1), we assume the following condition of the contamination distribution: for all θ ∈ Θ and an appropriately large constant γ 0 > 0 (see also [5]). Note that the assumption (7) is also a basis to prove the robustness against outliers for the minimum γ-divergence estimator in [5]. Then, we have the following theorem.
Theorem 4. Assume the condition (7). Let: , and let: Then, it holds that: for γ + 1 ≤ γ 0 , where ν := max{ν f , sup θ∈Θ ν θ }. The notation O(εν γ ) is the same use as that of [5]. Furthermore, from the above results, the reference prior and Equation (4) are approximately given by: where First, from Hölder's inequality and Lyapunonv's inequality, it holds that: , for i, j, k = 1, . . . , p. Using (10) and the results in Appendix A, we have: where: Similarly, it also holds that: the proof of (8) is complete. It is also easy to see the result of (9) from (8).
It should be noted that (8) looks like the results for Theorem 5.1 in [5]. However, q (γ) (x; θ), and its derivative functions are different formulae from those of [5], so that the derivative functions and the proof of (8) are given in the Appendix A. Theorem 4 shows that expectations on the right-hand side of J (γ) ij (θ) and g (γ) ijk (θ) only depend on the underlying model f θ , but do not depend on the contamination distribution. Furthermore, reference and moment matching priors for the γ-posterior are obtained by the parametric model f θ , that is, these do not depend on the contamination ratio and the contamination distribution. For example, for a normal distribution N(µ, σ 2 ), reference and moment matching priors are given by: However, reference and moment matching priors under the R (α) -posterior depend on unknown quantities in the data generating distribution unless ε ≈ 0, since J ijk (θ) have the following forms: where: The priors given by (11) can be practically used under the condition (7) even if the contamination ratio ε is not small.

Setting and Results
We present the performance of posterior means under reference and moment matching priors through some simulation studies. In this section, we assume that the parametric model is the normal distribution with mean µ and variance σ 2 and consider the joint estimation problem for µ and σ 2 . We assume that the true values of µ and σ 2 are zero and one, respectively. We also assume that the contamination distribution is the normal distribution with mean ν and variance one. In other words, the data generating distribution is expressed by: where ε is the contamination ratio and n is the sample size. We compare the performances of estimators in terms of empirical bias and mean squared error (MSE) among three methods, which include the ordinary KL divergence-based posterior, R (α) -posterior, and γ-posterior (our proposal). We also employ three prior distributions for (µ, σ), namely (i) uniform prior, (ii) reference prior, and (iii) moment matching prior.
Since exact calculations of posterior means are not easy, we use the importance sampling Monte Carlo algorithm using the proposal distributions N(x, s 2 ) for µ and IG(6, 5s) for σ (the inverse gamma distribution with parameters a and b is denoted by IG(a, b)), wherex = n −1 ∑ n i=1 x i and s 2 = (n − 1) −1 ∑ n i=1 (x i −x) 2 (for the details of the importance sampling, see, e.g., [23]). We carry out the importance sampling with 10,000 steps, and we compute the empirical bias and MSE for posterior means (μ,σ) of (µ, σ) by 10,000 iterations. The simulation results are reported in Tables 1-4. The reference and the moment matching priors for the γ-posterior are given by (11), and those for the R (α) -posterior are "formally" given as follows: where C M is a constant given by: The term "formally" means that since the reference and the moment matching priors for the R (α) -posterior strictly depend on an unknown contamination ratio and contamination distribution, we set ε = 0 in these priors. On the other hand, our proposed objective priors do not need such an assumption, but we assume only the condition (7). We note that [17] also used the same formal reference prior in their simulation studies. The simulation results of the empirical bias and MSE of posterior means of µ and σ are provided by Tables 1-4. We consider three prior distributions for (µ, σ), namely uniform, reference, and moment matching priors. In these tables, we set ν = 6, ε = 0.00, 0.05, 0.20 and n = 20, 50, 100. We also set the tuning parameters for the R (α) -and γ-posteriors as 0.2, 0.3, 0.5, 0.7. Tables 1 and 3 show the empirical bias and MSE of the posterior means of mean parameter µ based on the standard posterior and the R (α) -and γ-posteriors. The empirical bias and MSE for the two robust methods are smaller than those of the standard posterior mean (denoted by "Bayes" in Tables 1-4) in the presence of outliers for a large sample size. When there are no outliers (ε = 0), it seems that the three methods are comparable. On the other hand, when ε = 0.05 and ε = 0.20, the standard posterior mean gets worse, while the performances of the posterior means based on the R (α) -posterior and the γ-posterior are comparable for both empirical bias and MSE.
We also present the results of the estimation for variance parameter σ in Tables 2 and 4. When there are no outliers, the performances of robust Bayes estimators under the uniform prior are slightly worse. On the other hand, the reference and moment matching priors provide relatively reasonable results even if the sample size is small and ε = 0. The empirical bias and MSE of the R (α) -posterior and the γ-posterior means for α, γ = 0.5, 0.7 remain small even if the contamination ratio ε is not small. In particular, the empirical bias and MSE of the γ-posterior means for σ are shown to be drastically smaller than those of the R (α) -posterior.  Table 2. Empirical biases of the posterior means for σ.   Table 4. Empirical MSEs of the posterior means for σ.    Figure 1 shows the results of the empirical bias and MSE of the posterior means of µ and σ under the uniform, reference, and moment matching priors when ν = 6 (fixed) and the contamination ratio ε varies from 0.00 to 0.30. In all cases, we can find that the standard posterior means (i.e., cases α, γ = 0) do not work well. For the estimation of mean parameter µ, the R (α) -and γ-posterior means seems to be reasonable for the value of ε between 0.0 and 0.20. In particular, the γ-posterior means under reference and moment matching priors have better performance even if ε = 0.30. For the estimation of variance parameter σ, the R (α) -posterior means under the uniform prior have larger bias and MSE than the other methods. The γ-posterior mean with γ = 1.0 still may be better than other competitors for any ε ∈ [0, 0.30]. For α, γ = 0.5, the R (α) -and γ-posterior means seem to be comparable. Figure 2 also presents the results of the empirical bias and MSE of the posterior means of µ and σ under the same priors as Figure 1 when the contamination ratio is ε = 0.20 (fixed) and ν varies from 0.0 to 10.0. For the estimation of mean parameter µ in Figure 2, the empirical bias and MSE for the robust estimators seem to be nice regardless of ν except for the case of the R (α) -posterior under the uniform prior. Although we can find that some differences appear near ν = 4, the γ-posterior means with γ = 1.0 have better performance for the estimation of both mean µ and variance σ for all ν ∈ [0, 10].
In these simulation studies, the γ-posterior mean under the reference and moment matching priors seems to have better performance for the joint estimation of (µ, σ) in most scenarios. Although we provide the results for the univariate normal distribution, the other distribution (including the multivariate distribution) should be also considered in the future.

Selection of Tuning Parameters
The selection of a tuning parameter γ (or α) is very challenging, and to the best of our knowledge, there is no optimal choice of γ. The tuning parameter γ controls the degree of robustness, that is, if we set large γ, we obtain higher robustness. However, there is a trade-off between the robustness and efficiency of estimators. One of the solutions for this problem is to use the asymptotic relative efficiency (ARE) (see, e.g., [11]). It should be noted that [11] only dealt with a one parameter case. In general, the asymptotic relative efficiency of the robust posterior meanθ (γ) of p-dimensional parameter θ relative to the usual posterior meanθ is defined by: ARE(θ (γ) ,θ) := det(V(θ)) det V (γ) (θ) 1/p (see, e.g., [24]). This is the ratio of the determinants of the covariance matrices, raised to the power of 1/p, where p is the dimension of the parameter θ. We now calculate the ARE(θ (γ) ,θ) in our simulation setting. After some calculations, the asymptotic relative efficiency is given by: ARE(θ (γ) ,θ) = 2 (1 + γ) 6 (1 + 2γ)(2 + 4γ + 3γ 2 ) for γ > 0. We note that it holds h(γ) → 1 as γ → 0. Hence, we may be able to choose γ to allow for the small inflation of the efficiency. For example, if we require the value of the asymptotic relative efficiency ARE = 0.95, we may choose the value of γ as the solution of the equation h(γ) = 0.95 (see Table 5). The curve of the function h(γ) is also given in Figure 3. Several authors have provided methods for the selection of the tuning parameters (e.g., [25][26][27]). Reference [5] focused on the reduction of the latent bias of the estimator, and they recommended setting γ = 1 for the normal mean-variance estimation problem; however, it seems to be unreasonable in terms of the asymptotic relative efficiency (see Table 5 and Figure 3). To the best of our knowledge, there is no method that is robust and efficient under the heavy contamination setting. Hence, other methods that have higher efficiency under heavy contamination should be considered in the future.

Concluding Remarks
We consider objective priors for divergence-based robust Bayesian estimation. In particular, we prove that the reference and moment matching priors under quasi-posterior based on the γ-divergence are robust against unknown quantities in a data generating distribution. The performance of the corresponding posterior means is illustrated through some simulation studies. However, the proposed objective priors are often improper, and showing their posterior propriety remains as future research. Our results should be extended to other settings. For example, Kanamori and Fujisawa [28] proposed the estimation of the contamination ratio using an unnormalized model. Examining such a problem from the Bayesian perspective is also challenging because there is the problem of how to set a prior distribution for an unknown contamination ratio. Furthermore, it would also be interesting to consider an optimal data-dependent choice of tuning parameter γ.