A Parametric Bayesian Approach in Density Ratio Estimation

: This paper is concerned with estimating the ratio of two distributions with different parameters and common supports. We consider a Bayesian approach based on the log–Huber loss function, which is resistant to outliers and useful for ﬁnding robust M-estimators. We propose two different types of Bayesian density ratio estimators and compare their performance in terms of frequentist risk function. Some applications, such as classiﬁcation and divergence function estimation, are addressed.


Introduction
The problem of estimating the ratio of two densities appears in many areas of statistical and computer science. Density ratio estimation (DRE) is widely considered to be the most important factor in machine learning and information theory (Sugiyama et al. [1], and Kanamori et al. [2]). In a series of papers, Sugiyama et al. [3,4] developed DRE in different statistical data analysis problems. Some useful applications of DRE are as follows: non-stationary adaptation (Sugiyama and Müller [5] and Quiñonero-Candela et al. [6]), variable selection (Suzuki et al. [7]), dimension reduction (Suzuki and Sugiyama, [8]), conditional density estimation (Sugiyama et al. [9]), outlier detection (Hido et al. [10]), and posterior density estimation (Thomas et al. [11]), among others. This paper addresses cases where the density belongs to the parametric distributions (e.g., exponential family of distributions). Parametric methods are usually favorable thanks to the existence of closed forms (mostly simple and explicit formulas) and hence help enhance computational efficiency.
Several estimators of the ratio of two densities have recently been proposed. One of the simplest approaches to estimate the density ratio p/q, where p and q are two probability density (or mass) functions (PDFs or PMFs), is called "plug-in", and the ratio of the estimated densities is computed. Alternatively, one can estimate the ratio of two densities directly. Several approaches have recently been explored to estimate the ratio, including the moment matching approach (Gretton et al. [12]) and the density matching approach (Sugiyama et al. [13]). There are other works, such as Nguyen [14] and Deledalle [15], which studied the application of DRE in estimating Kullback-Leibler (KL) divergence (or more generally the α-divergence function, also known as f -divergence) and vice versa. The main objective of this paper is to address Bayesian parametric estimation with some commonly used loss functions for the ratio of p/q and to compare the proposed estimators with other estimators in the literature. The remainder of the paper is organized as follows. In Section 2 , we discuss the DRE methodology and introduce some useful definitions and related examples. Section 3 discusses how to find a Bayesian DRE under several loss functions for any arbitrary prior density on the parameter, and provides some interesting examples from the exponential families. In Section 4 we study some of the DRE applications. Some numerical illustrations for considering the efficiency of the proposed DREs are given in Section 5. Finally, we make some concluding remarks in Section 6.

Density Ratio Estimation for the Exponential Distribution Family
Let X| η and Y| γ be conditionally independent multivariate random variables where η, γ ∈ R d are natural parameters, s(·) is a sufficient statistic, and κ(·) = log c(·) is the log-normalizer, which ensures that the distribution integrates to one, that is, c(a) = h(x) exp{a s(x)} dx, for a = η and γ.
Consider the problem of estimating the density ratio which is obviously proportional to exp (η − γ) s(t) . So, one can merge two natural parameters η and γ into one single parameter β = η − γ (it can be a vector), and write the density ratio in (3) as follows where α = κ(γ) − κ(η), and θ = (α, β ) are parameters of interest. Note that since r(t; θ) itself belongs to the exponential family, the normalization term α can be considered as − log N(β), where N(β) = q(t) exp β s(t) dt, which guarantees q(t | γ)r(t; θ) dt = 1 and hence q(t)r(t; θ) becomes a valid PDF (PMF).
However if our response Z = {0, 1, . . . , m − 1}, we can extend Equation (5) to the multinomial logit model, and we have Analogously, letting P k (t) = P(t | Z = k) and π k = P(Z = k) for k = 1, . . . , m − 1, we have In fact, a conceptually simple solution to the DRE is to separately estimate both densities and calculate the ratio. This can be done by replacing the estimators of the parameters into each density. This approach is known as a plug-in density ratio estimation, defined below: whereη(·) andγ(·) are estimates of parameters of p η (·) and q γ (·) based on a sample of size n p and n q , respectively .

Bayesian DRE
Consider the loss function (θ, δ(t)) and its average under long-term (repeated) use of δ(t). In estimating θ it is called frequentist risk and given by where E T|θ (·) is the expectation with respect to the arbitrary measurable cumulative density function (CDF) T ∼ P(t | θ). Given any prior distribution π(·), with the CDF Π(·), one can define the integrated risk (Bayes risk), which is the frequentist risk averaged over the values of θ according to prior distribution π(θ) and posterior distribution π(θ | t).
Finally, the Bayes estimator is the minimizer of the Bayes risk (10). It can be shown that the minimizer of the above expression also minimizes the posterior risk function E θ|t [ (θ, δ(t))], and hence is the Bayes estimator (see Lehman and Casella [17]).
Next, we will address the log-Huber loss and reason beyond choosing such a loss function, in order to study the efficiency of the DRE.

log-Huber's Robust Loss
One of the weakness of a squared error loss function z 2 , with z = δ − θ, is that it can over-emphasize outliers. Other loss functions can be used to avoid this issue. For instance, one can use the absolute loss function |z|, but since it is not a differentiable function at the origin, Huber [18] proposed the following function instead: which is less sensitive to outliers and more robust and equivalent to the squared error loss (L 2 loss) for errors that are smaller than 1, and absolute error loss (L 1 loss) for larger errors (see Figure 1). This loss borrows the advantages of L 1 and L 2 losses, and does not have their disadvantage. That is, it is not sensitive to outliers (as opposed to L 2 ) and it is differentiable everywhere (as opposed to L 1 ). In practice, optimizing the Huber loss (and consequently, log-Huber) is much faster because it enables us to use standard smooth optimization methods (e.g., quasi-Newton) instead of linear programming. Thanks to its robustness, the Huber loss is widely used in robust regression problems (e.g., Huber and Ronchetti [19], Chapter 7).
The corresponding frequentist risk function to the log-Huber loss function is given by where θ = (η, γ).

Log-L 2 Loss Function
The log-L 2 loss (log z) 2 , with z =ˆr r , puts a small weight whenever the density ratio estimatorr and the true density ratio r are close, and puts proportionally more weight when they are significantly different. Lemma 2 provides the Bayesian DRE of the density ratio r in (3). An interesting representation of Bayesian DRE is the Bayesian DRE that can be expressed in terms of the plug-in DRE in (7). Theorem 1. For any PDF p and q belonging to (1) and (2), respectively, the Bayesian DRE of under log-L 2 loss L(r, r) = (logˆr r ) 2 , and prior distribution π(θ), for θ = (η, γ), is given bŷ wherer plug is obtained by replacing the posterior expectations of unknown parameters for given t: the correction factor is given by Proof. The Bayesian DREr π (t) is the minimizer of the posterior risk and implies logr π (t) = E η|X=t,Y=t log r π , or equivalently, This completes the proof.
In Table 1 we explore the correction factor H(·) in (15) for certain densities which belong to the exponential family. Let the likelihood functions below, with sub indices i = 1, 2 representing the underlying distributions, be drawn from p and q according to (1) and (2), respectively. Note that posing constraints on the hyper-parameters leads to having H(t) as a constant function in t. In fact, H(t) = 1 induces the Bayesian DREr π to coincide with the plug-in DREr plug .
In the exponential distribution family in (1), Bregman divergence associated with a real-valued strictly convex and differentiable function c(·) is defined by where ·, · is the inner product and it can be shown that for all regular exponential families (see Brown [20]), κ(·) is strictly convex, and furthermore B(η, γ) = log c(η) c(γ) − η − γ, E q (s(T)) . (1) is equal to the Bregman divergence between natural parameters. That is:

Lemma 2.
Consider E p (logr(t) − log r(t)) 2 as a loss function where the random variable T follows from p as in (1) and the prior function π(θ) on θ = (η, γ), then the Bayesian DRE of r(t; θ) is given bŷ and in addition for the natural exponential family model (1) it can also be expressed aŝ Proof. For simplicity, assume notationsr π (t) and r(t; η, γ) asr and r respectively. The Bayesian estimatorr in estimating r is the minimizer of E( (r, r) | X = t, Y = t), with (r, r) = E p (logˆr r ) 2 , and therefore Applying Lemma 1 and the fact that ∂ 2 ∂ logr 2 E( E p (logr − log r) 2 | X = t, Y = t) ≥ 0, completes the proof.

Log-L 1 Loss Function
For some larger errors in loss function (12), one needs to consider the log-L 1 loss function (r, r) = 2| log r r | − 2. Letr be the corresponding Bayesian DRE. As in Theorem 1 we can also expressr π (t) in terms of the product of the correction factor and the plug-in DRE. That is, under log-L 1 loss L(r, r) = | logr r |, and prior distribution π(θ), for θ = (η, γ), we havẽ wherer plug (t) and H (t) are obtained in the same fashion in (14) and (15) except applying median Notice that for instance the results in Table 1 hold wherever the posterior densities turn out to be symmetric about their means (or medians). Similar calculation to Lemma 2 , with loss function E p | logr(t) − log r(t)|, suggests Similar to (17) and (18), we can writer π (t) = exp {M(KL(p, q) | X = t, Y = t)} , whose expectations are replaced by medians.

Examples of the Bayesian DRE and Some Applications
For instance, we consider the problem of finding Bayesian DRE under log-L 1 and log-L 2 losses for two different normal densities.
, with the known variances and independent prior distributions θ i ∼ N(ξ i , τ 2 i ) for i = 1, 2. Hence, the posterior densities are given by yielding the Bayesian DRE under either loss functions L(r, r) = (logˆr r ) 2 or L(r, r) = | logr r | 2 , as below: If we assume the multivariate case, that is, X|θ 1 ∼ N p (θ 1 , Σ 1 ) and Y|θ 2 ∼ N p (θ 2 , Σ 2 ), where both covariance matrices are known, since the conjugate prior for the mean is p-variate normal, θ i ∼ N p (ξ i , V i ) for i = 1, 2, hence the posterior is p-variate normal. The results are analogous to the univariate case and Therefore the Bayesian DRE for the ratio of two multivariate normal densities equals to As we saw in the previous section, the key point is to find the KL divergence between two densities p and q (or equivalently, Bregman divergence in the cases of distributions belonging to exponential family), and we have a closed form for the divergence. Table 2 represents the KL divergence between p and q. Example 2. Let X|σ 2 1 ∼ Ray(σ 2 1 ) and Y|σ 2 2 ∼ Ray(σ 2 2 ) be two independent random variables from a Rayleigh distribution with the PDF p(x|σ 2 ) = x σ 2 exp(− x 2 2σ 2 ) for x > 0 and σ 2 > 0. The KL divergence between p(x|σ 2 1 ) and p(y|σ 2 2 ) is given in Table 2. Assuming λ i = 2σ 2 i , and choosing the inverse gamma conjugate prior distribution (IG) with parameters α and β, with the PDF π(λ|α, β) = β α γ(α) λ −(α+1) exp(− β λ ) for λ > 0, yields the posterior distribution as follows: Therefore, the Bayesian DRE for the ratio of two Rayleigh densities is the exponential function of the conditional expectation of their KL loss divergence function. That is, By making use of properties of λ ∼ IG(α, β), such as E(log λ) = log β − ψ(α), where ψ(α) is a "digamma" function, the Bayesian DRE under loss function E p (logr(t) − log r(t)) 2 is given bŷ Since the median of an inverse gamma distribution does not have a closed form, we cannot express an explicit formula for the Bayesian DRE under loss function E p | logr(t) − log r(t)|, and consequently for the log-Huber loss function, and they must be calculated iteratively.
The following remarks are some clarifications related to Table 2 and the obtained Bayes estimators in the previous examples in Section 3.2.

Remark 1.
The Bayes DREr π (t) is connected to samples X and Y via the posterior density η | X = t and γ | Y = t. Table 2 it can be seen that the KL divergence between two log-normal PDFs is the same as in normal distributions, since it is known that KL is invariant under parameter transformations.

Example 3.
Suppose X ∼ Uni f orm(0, θ 1 ) is independent of Y ∼ Uni f orm(0, θ 2 ), with θ 1 < θ 2 . It is easy to check that the Pareto distribution is conjugate to a uniform. Hence, assume that π( Moreover, assuming p and q are the PDFs of X and Y, respectively, KL(p, q) = log θ 2 − log θ 1 (see Table 2). Employing the fact that the transformation log(z/b) has an exponential distribution with mean of a, given Z ∼ Pareto(a, b), and hence E Z log Z = a + log b, after some calculation we have the Bayes DRE associated with loss function E p (logr(t) − log r(t)) 2 :r Similar to Example 2, the Bayes DRE under E p | logr(t) − log r(t)| must be calculated numerically.

Other Applications
Here, we discuss some of other applications of the DRE method. 1

Estimating the α-divergence function between two probability densities:
A discrepancy measure between densities p η and q γ applicable to the class of Ali-Silvey [23] distance, also known as α-divergence (Csiszàr,[24]), is given by where for α = 1.
Note that some of the notable divergence functions, such as Kullback-Leibler, reverse Kullback-Leibler (RKL), and Hellinger divergence functions correspond to α = 1, −1, and 0, respectively, and belong to this class. So, if r(t) is estimated byr π (t) under log-L 2 (orr π (t) under log-L 1 losses), then applying the Monte Carlo approximations method, the α-divergence is also estimated by where t i are drawn from p(· | η). Note that there are other works in order to estimate the α-divergence loss function (e.g., Póczos and Schneider [25]), but our estimator in (23) is based on the Bayesian parametric method based on the DRE. In the next section we show the performance of the proposal estimator d α . 2

Plug-in-type estimation of the density ratio under KL loss function:
Consider the plug-in density estimator, say pη(t), for estimating p η in the exponential family (1), based on the KL loss. We have Next, finding the plug-in density estimator p(t,η(t)) that minimizes KL loss is equivalent to finding the point estimatorη(t) which minimizes the posterior expectation associated with the loss function in (24). Therefore, ∂ ∂η E η|t KL(p(t | η), p(t |η)) = 0 implies c 1 (η) = E η|t c(η) η , and hence the Bayes estimator of η is given byη Similar arguments can be applied to qγ(t) for estimating q γ in the exponential family (2), and we have:γ By setting the case when both densities follow the identical distribution from (1) (e.g., the ratio of two normal or two Poisson, etc.), substituting the Bayes estimators obtained in (25) and (26) into the plug-in estimator of r(t) giveŝ Note thatr(t) can be obtained similarly by replacing posterior mediansη i instead of posterior expectationsη i for i = 1, 2 above.

Numerical Illustrations
We conclude this section with some numerical illustrations of log-Huber risk performance of the Bayesian and plug-in when both p and q are two normal models (belong to model (1) and (2)) with a common location parameter θ. We show that the performance of plug-in and Bayes are quite similar and by selecting the hyper-parameters these two density ratio estimators coincide and hence have the same frequentist risk. We start with comparing risk performance under the log-L 2 first and then extend them to log-Huber loss. Figure 2 exhibits the frequentist risk performance of the plug-in DREr plug and the Bayes DREr π under log-L 2 for all possible values of θ using Theorem 1.  N(θ, 1), q = N(θ, 2) and corresponding hyper-parameters ξ = 1, ξ = 2 = 0, τ 1 = τ 2 = 1 corresponding to Example 1. Figure 3 shows the changing frequentist risk performance of the plug-in DREr plug and the Bayes DREr π under log-L 2 when the ratio of variances σ 2 1 /σ 2 2 varies (means are known) using Theorem 1. It can be seen the performance of plug-in estimation is better for larger ratios of variances.
Finally, Figures 4 and 5 depict the risk functions of the Bayes and the plug-in DRE under the log-Huber loss over possible ranges of the mean and variance, respectively, in Example 1. It is clear from Figure 5, in contrast to Figure 3 (risk functions of Bayes DRE under log-L2 and plug-in DRE estimator), that the Bayes estimator DRE has a lower risk with respect to the plug-in DRE, and hence is more robust under the variation of variances.  2 ) and hyper-parameters ξ = 1, ξ = 2 = 0, τ 1 = τ 2 = 1 in Example 1.   ) and corresponding hyper-parameters ξ = 1, ξ = 2 = 0, τ 1 = τ 2 = 1, for the whole possible range of σ 2 in Example 1.

Concluding Remarks
Estimation pf the ratio of two or more densities has received widespread attention in recent years. Until now, most of the methods have been concentrated on solving this problem via nonparametric approaches. In this paper, we focused on a parametric Bayesian approach, when distributions come from the canonical form of the exponential family. We applied the log-Huber loss function to investigate the utility of the Bayesian and plug-in DREs. Our results confirm that Bayesian DRE along with the plug-in DRE (based on posterior expectations) perform similarly under log-Huber loss functions with the possibility of being exactly equal when the correction factor H = 1. This is a somewhat different result from a non-parametric point of view, which often happens when the plug-in estimators perform poorly as opposed to empirical non-parametric Bayesian methods which typically include stochastic processes such as the Gaussian process and the Dirichlet process. There are instances (see Krnjajic et al. [26] and the references therein) where for certain types of count data, the nonparametric Bayesian methodology provides enhanced flexibility to fit the data, provides rich posterior inferences, and provides finer predictive inference under a set of carefully selected criteria. However, there is an apparent major drawback. These processes have an infinite number of dimensions, and thus naive algorithmic approaches to compute posteriors are generally infeasible. Finally, the application to estimate the α-divergence between two PDFs was discussed.
Author Contributions: Methodology, calculations and writing the draft, A.S.; adding critical and constructive comments as well as editorial advice, Y.P. and C.D.L.
Funding: This research received no external funding.