Variational Inference via Rényi Bound Optimization and Multiple-Source Adaptation

Variational inference provides a way to approximate probability densities through optimization. It does so by optimizing an upper or a lower bound of the likelihood of the observed data (the evidence). The classic variational inference approach suggests maximizing the Evidence Lower Bound (ELBO). Recent studies proposed to optimize the variational Rényi bound (VR) and the χ upper bound. However, these estimates, which are based on the Monte Carlo (MC) approximation, either underestimate the bound or exhibit a high variance. In this work, we introduce a new upper bound, termed the Variational Rényi Log Upper bound (VRLU), which is based on the existing VR bound. In contrast to the existing VR bound, the MC approximation of the VRLU bound maintains the upper bound property. Furthermore, we devise a (sandwiched) upper–lower bound variational inference method, termed the Variational Rényi Sandwich (VRS), to jointly optimize the upper and lower bounds. We present a set of experiments, designed to evaluate the new VRLU bound and to compare the VRS method with the classic Variational Autoencoder (VAE) and the VR methods. Next, we apply the VRS approximation to the Multiple-Source Adaptation problem (MSA). MSA is a real-world scenario where data are collected from multiple sources that differ from one another by their probability distribution over the input space. The main aim is to combine fairly accurate predictive models from these sources and create an accurate model for new, mixed target domains. However, many domain adaptation methods assume prior knowledge of the data distribution in the source domains. In this work, we apply the suggested VRS density estimate to the Multiple-Source Adaptation problem (MSA) and show, both theoretically and empirically, that it provides tighter error bounds and improved performance, compared to leading MSA methods.


Introduction
In numerous practical situations, we encounter probability distributions that are challenging to calculate.This occurs especially when the distribution includes hidden variables.Therefore, it becomes necessary to employ approaches that can estimate or approximate such distributions.Variational inference (VI) is a technique used to accomplish this task.VI is a compelling approach for approximating posterior distributions in latent variable models [1].It can handle intractable and possibly high-dimensional posteriors, and it makes Bayesian inference computationally efficient and scalable to large datasets.To this end, VI defines a simple distribution family, called the variational family, and then finds the optimal member of the variational family that is closest to the true posterior distribution.This transforms the posterior inference into an optimization problem concerning the variational distribution.
One of the most successful applications of VI in the deep neural network realm is the Variational Autoencoder (VAE) [2], which is a deep generative model that implements a probabilistic model and variational Bayesian inference.Many techniques have been suggested to improve the accuracy and efficiency of variational methods (cf.[3][4][5][6][7]).Recent trends in variational inference have focused on the following aspects: • Scalability: includes stochastic approximations.• Generalization: extends the applicability of VI to a large class of otherwise intractable models, such as non-conjugate models.• Accuracy: includes variational models beyond the mean field approximation.• Amortization: implements the inference over local latent variables with inference networks.• Robustness: generating a reliable representation of particular data types in the encoded space when using corrupted training data and detecting anomalies.
There are other methods for improving approximation such as Monte Carlo methods for VI and black-box methods [8].
In this work, we focus on the accuracy of the VAE models.An essential aspect of the VI methodology revolves around selecting an appropriate divergence method.This divergence measure allows us to approximate the true posterior distribution with a simpler variational distribution.Consequently, the selection of the divergence measure can have a notable impact on the accuracy of the approximation.Furthermore, using the selected divergence measure, one can devise lower and upper bounds, and estimate the true posterior.
Accordingly, we propose a new upper bound for the evidence, termed the Variational Rényi Log Upper bound (VRLU), based on the Variational Rényi (VR) bound suggested by Li and Turner [3].Further, we devise a (sandwiched) upper-lower bound variational inference method, termed VRS, to jointly optimize the Rényi upper and lower bounds.The VRS loss function combines the VR lower bound and our new upper bound, thus providing a tighter estimate for the log evidence.
Next, we will demonstrate the practical effectiveness of VRS by applying it to the domain adaptation problem.Through this application, we aim to showcase the tangible benefits and practical relevance of our approach.
Domain adaptation is a scenario that arises when we aim to learn from a source data distribution; a well-performing model on a different (but related) target data distribution.A real-world example of domain adaptation is the common spam filtering problem.This problem consists of adapting a model from one user (the source distribution) to a new user who receives significantly different emails (the target distribution).
In the context of domain adaptation, the terms "source" and "target" domains are used to refer to the training and test sets, respectively.These sets can have distinct feature spaces, which can occur when the statistical properties of a domain change over time or when new samples are collected from various sources, resulting in domain shifts.Multiple-Source Adaptation (MSA) addresses scenarios where there are multiple source domains and one target domain.The central question is whether the learner can effectively combine relatively accurate predictors from each source domain to create an accurate predictor for any new target domain that may consist of a mixture of these sources.
In contrast to the majority of machine learning research, where models are trained and tested on data drawn from the same distribution, domain adaptation involves using data from different distributions for training and testing.When the train and test sets share the same distribution, the uniform convergence theory ensures that a model's empirical training error closely approximates its true error.This assumption is not guaranteed in the MSA problem.
In this work, we have focused on two main ideas: • Improving the estimation of the domain distribution using VAE.

•
Using the improved estimated distributions in the algorithm presented in [9] to solve the MSA problem.
The rest of the paper is organized as follows: Section 2 provides a review of variational inference for probabilistic modeling, and discusses different divergence methods such as KL Divergence, Rényi Divergence, and χ Divergence, for bounding the log evidence.
In Section 3, we present our novel approach, called Variational Rényi Log Upper bound (VRLU), which offers an improved bound for the log evidence.Additionally, we introduce an optimized technique, referred to as the Variational Rényi Sandwich (VRS), that leverages both upper and lower bounds.Section 4 offers a comprehensive overview of the domain adaptation problem and illustrates the application of the approximated distributions in calculating its loss function.Finally, in Section 5, we present a series of experiments conducted to evaluate the effectiveness of our proposed methods, VRLU and VRS, in the context of both log evidence estimation and domain adaptation.

Divergence Methods in Variational Inference for Probabilistic Modeling
In probabilistic modeling, we aim to devise a probabilistic model, p θ , that best explains the data.This is commonly done by maximizing the log-likelihood of the data (also known as log evidence ), with respect to the model's parameter θ, i.e., Maximum Likelihood Estimation (MLE).For a latent model, where we assume that the observed data, x, depend on a latent variable z, the MLE takes the following form: For many latent models, the log evidence integral is unavailable in closed form or it is too complex to compute.A leading approach to handle such intractable cases is variational inference (VI).One of the most successful applications of VI in the deep neural network realm is the Variational Autoencoder (VAE).

Variational Autoencoder and the Kulback-Leibler Divergence
A Variational Autoencoder is a deep generative model that implements a probabilistic model and variational Bayesian inference.Introduced by Kingma and Welling [2], a VAE model is an autoencoder, designed to stochastically encode the input data into a constrained multivariate latent space (encoding), and then to reconstruct it as accurately as possible (decoding).To turn the intractable posterior inference into a solvable problem, we use a parametric inference model q φ (z|x) which is also called an encoder.We optimize the variational parameters φ such that q φ (z|x) ∼ p θ (z|x).The VAE loss function is composed of a "reconstruction term" (to ensure the decoded data are close to the original data) and a "regularisation term".The goal of the regularisation term is to ensure that the distributions returned by the encoder are close to a standard normal distribution.That is expressed as the Kulback-Leibler divergence between the returned distribution and a standard Gaussian.Definition 1. Kulback-Leibler (KL) divergence [10,11].For discrete probability distributions p and q, defined on the same probability space, the KL divergence from q to p is defined to be: Since the true posterior p θ (z|x) is intractable, we aim to approximate it with a Gaussian distribution q φ (z|x), in the KL divergence sense.It follows that: Definition 2. Evidence Lower Bound (ELBO): We note that the KL divergence is non-negative, thus maximizing the ELBO results with the minimization of the KL divergence between q φ (z|x) and the true posterior p θ (z|x).
ELBO optimization is a well-known method that has been studied in depth, and is applicable in many models, especially in VAE [12].Nevertheless, using the ELBO can give rise to some drawbacks.First, the ELBO is not always very tight, and maximizing the bound instead of the actual likelihood can lead to bias.Typically this leads to a simpler model q φ , which approximates the real posterior.Second, the D KL (q φ (z|x)||p θ (z|x)) does not always lead to the best results-it tends to favor approximate distributions q φ that underestimate the entropy of the true posterior ("zero-forcing").Namely, D KL (q φ (z|x)||p θ (z|x)) is infinite when p θ (z|x) = 0 and q φ (z|x) > 0. Therefore, the optimal variational distribution q will be 0 when p θ (z|x) = 0.This "zero-forcing" behavior leads to degenerate solutions during optimization.

Rényi Divergence
One of the core parts of probabilistic models is the selection of the method for estimating the approximation of the distribution.In the previous section, we introduced Kulback-Leibler (KL) divergence.In this section, we will present the Rényi divergence (also known as α divergence), which measures the difference between two distributions p and q, and is defined by: Rényi divergence was initially defined for α ∈ {α > 0 , α = 1}.The definition was extended to α = 0, 1, +∞ by continuity.There are certain α values for which Rényi divergence has a wider application than the others.Of particular interest are the values 0, 1  2 , 1, 2, and ∞, presented in Table 1.We note that for α → 1: lim α→1 D α (p||q) = D KL (p||q), the KL divergence is recovered.

χ Divergence
Similarly to the KL divergence and the Rényi divergence, one can use the χ 2 -divergence (or in general the χ n -divergence) and develop a bound of the log evidence [14].
By monotonicity of log and non-negativity of the χ 2 -divergence, this quantity is an upper bound of the log evidence: Definition 6. χ upper bound (CUBO): Using χ n -divergence for general n, CUBO n provides a family of bounds.We note the strong connection between the CUBO n and the Rényi bound VR α : when n = 1 − α, the VR bound is revealed.∀n ∀n > 1 : CUBO n is a non-decreasing function of the n order χ-divergence.
Using Theorem 5, one can estimate log p θ (x) with both upper and lower bounds, which may provide a better approximation for the log evidence.
The χ upper bound has many advantages: It is a black-box inference algorithm in that it does not need model-specific derivations and it is easy to apply to a wide class of models.In addition, it is useful when the KL divergence is not a good objective, and it is guaranteed to converge [14].

Monte Carlo Approximation
So far, we have discussed KL divergence, Rényi divergence, and χ divergence, and have demonstrated how each of these measurements can be used to construct a bound for the log evidence.However, calculating these bounds is computationally intractable, due to the stochastic nature of the latent space and the exponential number of random variables.In real-world situations, where datasets are typically limited and contain a finite number of data points, empirical estimations become necessary.A popular method for estimating these bounds is the Monte Carlo (MC) approximation [15,16].Typically, the MC method involves random sampling from certain probability distributions.
The Monte Carlo (MC) approximation of the Kullback-Leibler (KL) divergence is unbiased, guaranteeing the convergence of the optimization process for the Evidence Lower Bound (ELBO).However, the MC approximation for the Rényi bound introduces bias, leading to an underestimation of the true expectation.In the case of positive values of α, this implies a relatively looser bound, but it should still be effective.On the other hand, for negative values of α, this becomes a significant issue as it underestimates an upper bound.More precisely, the MC approximation for the Rényi bound is: For this to be unbiased, the expectation should be equal to the true value, By Jensen's inequality: Thus, the approximation is actually an underestimate of the true bound.This characteristic was also discussed in [3], where the authors suggested improving the approximation quality by using more samples and using negative α values to improve the accuracy, at the cost of losing the upper-bound guarantee.
Other papers have suggested different approaches to keep the upper bounding property intact [8,14,17].Of particular interest is the generic χ upper bound, CUBO n , which also suffers from the same problem of biased estimation using MC approximation.In [14], the authors suggested an approach to avoid the biased approximation, by exponentiation: Applying MC approximation to L provides an unbiased upper bound.However, this change affects the variance of the gradients, which may damage the quality of the approximation result.It may result in high variance estimates and requires a large number of samples in order to serve as a reliable upper bound [18].

Variational Rényi Log Upper Bound (VRLU)
We suggest a different approach for estimating the upper bound while preserving the upper bound property.Consider the following inequalities: Where equality holds on both sides if and only if x = 1.

Definition 7. Variational Rényi Log Upper bound (VRLU)
: For negative α, VRLU α is an estimation of the Rényi upper bound, and an upper bound of the log evidence: Note that the inequalities in ( 24) become tighter as the argument of the log is closer to 1.
In the Rényi bound approximation (20), this argument is 1/k ∑(p θ (z, x)/q φ (z|x)) 1−α .Thus, the approximation becomes tighter as the variational distribution, q φ , is getting closer to the true distribution p θ (the lower the divergence, the tighter the approximation), which is exactly the goal of the optimization.
We evaluated the bias of MC approximations for both bounds, VR α and VRLU α , over a range of negative α values.To this end, we fixed the distributions p and q to both be Gaussian: p ∼ N(0, 1), q ∼ N(1.5, 1).The bounds VR α and VRLU α were estimated using the MC approximation (cf.( 26) and ( 20)) and we evaluated the quality of the approximation for different values of MC samples, denoted by K.
Figure 2 shows the empirical results.We can see that the MC approximations for VR α are biased and get better as the sample size K increases.Furthermore, the bias results in an underestimation of VR α for α ≤ 0, which makes it unattractive to be used as an upper bound at the negative α range.On the other hand, the MC approximation for VRLU α preserves the upper bound property and has a relatively low variance.As a result, VRLU α is a more suitable choice as an upper bound for negative α and may be used as an objective for risk minimization.Figure 3 presents the comparison between VR α (p||q) and VRLU α (p||q) over different values of q.To this end, we fixed p ∼ N(1, 2) and set q ∼ N(µ, 2) while varying µ in the range [−5, 10].We can see that as closer q is to p, both VR α (p||q) and VRLU α (p||q) values are decreasing, and for p = q, VR α (p||q) = VRLU α (p||q) = 0 for all α values.Furthermore, as α is farther away from 0, the steeper the graph becomes.In conclusion, we empirically evaluated the VRLU α upper bound and matched it against the VR α upper bound, for varying values of negative α.The divergence curve of the VRLU α upper bound is steeper than the VR α upper bound, and the variance is much lower, suggesting a higher convergence as the variational distribution is getting closer to the true posterior.

Upper-Lower Bound Optimization
Using the new upper bound, VRLU α , we devised VRS α + ,α − ; a (sandwiched) upperlower bound variational inference algorithm for jointly minimizing the Rényi upper and lower bounds.VRS α + ,α − combined both the upper and lower Rényi bounds, where the lower bound VR α is computed as in Equation ( 20) for a constant positive α, and the upper bound VRLU α is computed as in Equation ( 26) for a constant negative α.The overall VRS α + ,α − loss is the average of both terms, i.e.,

VRS α
− loss provides a useful estimate for the log-likelihood of the evidence.

Probability Approximation
Recall that our objective is to develop a probabilistic model, denoted as p θ , that effectively captures and explains the underlying data.In variational inference (VI), we tackle an optimization problem that seeks to find a simpler distribution that closely approximates the original data distribution, also known as the evidence.In this section, we will inspect the approximate distribution, denoted as pθ , that minimizes the divergence d α ( pθ ||p θ ).Our aim is to find an approximation that accurately represents the true data distribution.
We will evaluate two methods of approximating p θ .One using VR bound: (30) and one using our VRS method: We notice that for both estimators, pθ is indeed a probability.Given that both VR α and VRS α + ,α − estimate the log evidence, we will use the exponent of these estimates to approximate p θ .Let us denote α + > 0. Using Equation ( 11), Using both upper and lower bounds we will find that: We will define multiplication factors for both our approximations as follows: Note that e VRS α + ,α − (x) = p θ (x) • VRS MF and e VR α (x) = p θ (x) • VR MF .Thus, our goal is to achieve a multiplication factor as close to one as possible.We examine these values using the fixed distribution p ∼ N(0, 2), and distribution q ∼ N(µ, 2), where −3 < µ < 3. When µ = 0, p = q.We used different α + and α − values.The results are presented in Figure 4. We can see that for every α + , VRS MF is closer to one for all different α − values compared to VR MF with the same α + value.In addition, when α − and α + are symmetric around zero, the multiplication factor of VRS α + ,α − is closest to one.This indicates that the pθ (x) approximation calculated using VRS α + ,α − is more accurate among the two methods.

Multiple-Source Adaptation (MSA)
In statistical learning, there are numerous settings that require an accurate estimation of the data distribution to find effective solutions.One such task is known as domain adaptation.In the preceding section, we introduced VRS as an enhanced method to obtain accurate approximations of the data distribution.In this section, we will apply these estimated distributions to the domain adaptation objective, thus demonstrating the effectiveness and practicality of the VRS method to yield accurate solutions.
Domain adaptation is a scenario where we aim to train a classifier on one dataset (referred to as the source domain) for which labels or annotations are available and achieve good performance on another dataset (referred to as the target domain) for which labels or annotations are not available.A common example of a domain adaptation application is spam filtering, where a model trained on one user's emails (the source domain) is adapted and used to filter spam for a different user who receives distinct emails (the target domain).
In this work, our focus is on the Multi-Source Domain Adaptation (MSA) problem, where there are multiple source domains available in addition to only one target domain.The target domain can be considered as either an exact mixture of the source domains, or it might be well approximated by such a mixture.The goal is to leverage the information provided by the source domains to improve the performance on the target domain, where annotations or labels are not available.
In many real-world scenarios, the learner may not have access to all of the source data at once, due to privacy or storage constraints.Therefore, the learner cannot simply combine all of the source data together to train a predictor.A possible solution to this problem is the Mixture of Experts (MOE) approach.MOE is an ensemble learning technique that involves training multiple experts on different sub-tasks of a predictive modeling problem.Each expert concentrates on a specific part of the modeling problem space.A gating network then combines the outputs of the various experts.In the domain adaptation problem, this concept can be applied by modeling the domain relationship with an MOE approach.
The MSA problem was theoretically analyzed by Mansour, Mohri, and Rostamizadeh in [19].In their paper, the authors presented the domain adaptation problem setup and proved that for any target domain, there exists a hypothesis, referred to as the distribution weighted combining rule, which is capable of achieving a low error rate with respect to the target domain.However, it should be noted that the authors did not provide a method for determining or learning the aforementioned hypothesis.
In the paper by Hoffman, Mohri and Zhang [9], the authors extended the definition of the weighted combination rule to solve probabilistic models as well, using cross-entropy loss.Additionally, the authors introduced an iterative algorithm based on Difference of Convex (DC) programming, that constructs the weighted combination rule.Nonetheless, the algorithm proposed in the paper assumes either prior knowledge of the probabilities associated with the data samples or relies on accurate estimates of these probabilities.The authors evaluated the performance of their model by employing the Rényi divergence, which quantifies the discrepancy between the true distribution and the approximated distribution.As a result, the effectiveness of their model is contingent upon the accuracy of the probability approximations as well.
In order to circumvent the need for good estimates of the data distribution, Cortes et al. [20] proposed a discriminative technique using an estimate of the conditional probabilities p(i|x) for each source domain i ∈ {1, ..., k} (that is, the probability that an instance x belongs to source i).To this end, they had to modify the DC algorithm proposed in [9], in order to adapt to their new distribution calculation.
In this study, we will build upon the algorithm introduced by Hoffman, Mohri, and Zhang [9], and enhance it with a refined approximation of the source distribution via variational inference.

MSA Problem Setup
We refer to a probability model where there is a distribution over the input space X.Each data point x ∈ X has a corresponding label y ∈ Y, where Y denotes the space of labels.Our objective function describes the correspondence between the data point and its label f : X → Y.We will focus on the adaptation problem with k source domains and a single target domain.For each domain i ∈ {1, ..., k}, we have a source distribution p i and corresponding hypotheses h i (x, y) → [0, 1].More precisely, h i returns the probability that f (x) = y.Definition 8. Let L : R → R be a loss function penalizing errors with respect to f .The loss of hypothesis h with respect to the objective function f and a distribution p is denoted by L(h, p, f ) and defined as: For simplicity, we will denote L(h(x, f (x))) as L(h, f ) throughout this paper.We will assume that the following properties hold for the loss function L: For each domain i, the hypothesis h i is a relatively accurate predictor for domain i with the distribution p i ; i.e., there exists > 0 such that: Proposition 2. We will denote the simplex: The distribution of the target domain p T is assumed to be a mixture of the k source distributions p 1 , ..., p k , that is:

Existence of a Good Hypothesis
The goal of solving the MSA problem is to establish a good predictor (a good predictor: a predictor that provides a small error with respect to the target domain) for the target domain, given the source domain's predictors.A common assumption is that there exists some relationship between the target domain and the distributions of the source domains (See Proposition 2).It can be demonstrated that conventional convex combinations of source predictors may yield suboptimal results in certain scenarios.In particular, studies have indicated that even if the source predictors possess zero loss, no convex combination can attain a loss lower than a specific constant for a uniform mixture of the source distributions.
Alternately, Mansour, Mohri, and Rostamizadeh [19] proposed a distribution-weighted solution and defined the distribution-weighted combination hypothesis for a regression model.Hoffman and Mohri [9] extended the distribution-weighted combination hypothesis to a probabilistic model, as follows: Definition 9. Distribution-weighted combination hypothesis.For any λ ∈ ∆, η > 0 and (x, y) ∈ X × Y: where U(x) is the uniform distribution over X.
In the probabilistic model case, we will use L as the binary cross entropy loss: which maintain all of the required properties stated in Section 4.1.
The proof of Theorem 6 is detailed in [19].From this Theorem, it can be inferred that for any fixed target function f , the distribution-weighted combination hypothesis is a good hypothesis for the target domain.

A Good Hypothesis with Estimated Probabilities
On closer inspection of Definition 9, it is evident that constructing h η w requires access to the distributions of all domains, represented by p i (x) ∀i ∈ 1, ..., k.Yet, in practical settings, the true distributions p i may not be directly available to the learner.Instead, the learner relies on estimates pi derived from the available data.Thus, addressing the application of domain adaptation becomes essential for real-world scenarios where the true distributions remain unknown.
Our objective is to minimize the value of L(h i , pi , f ).To accomplish this, we will develop an upper bound for this loss function (similar to previous research [9,21]).By doing so, we can examine the impact of utilizing estimated distributions pi on the efficacy of our model and gain insights into the application of domain adaptation in real-world scenarios.First, let us recall Holder's inequality: Theorem 7. Holder's inequality: For any s and t in the open interval (1, ∞) with 1 s + 1 t = 1, and for {x j } and {y j } j ∈ {1, ..., k} be certain sets of real numbers, we have: Corollary 1.Let pi be an estimation of the original domain distribution p i .The following inequality holds for any α > 1: Proof of Corollary 1.For any hypothesis h and any distributions p, q, and for any α > 1, the following holds (the proof is based on a similar corollary proven in [9]): By Holder's inequality for s = α, and By Definition 3 For each i ∈ {1, ..., k}, by setting p := p i , q := pi and h := h i , we will find that:

By Proposition 1
Corollary 1 provides us an upper bound of the loss using the estimated distributions pi .When pi → p i , d α ( pi ||p i ) → 1 and we will remain with α−1 α M 1 α .We will set M = 1, since we use the loss function L(h, f ) = − log(h(x, f (x))) as the cross-entropy loss (log-loss).
Thus, when pi → p i , (d α ( pi ||p i ) ) By performing the aforementioned calculation with α < 1, it is possible to derive a lower bound for L(h i , pi , f ).This lower bound serves as a confirmation that the utilization of approximated probabilities does not lead to significant errors.For instance, if the lower bound exhibits a considerably large value, it indicates that our approximation is inadequate.Conversely, if the lower bound demonstrates a small value, it signifies the effectiveness of our approximation.Moreover, by employing both upper and lower bounds, we can obtain a more precise estimation of the loss.Theorem 8. Generalization of Holder's inequality [22]: Let 0 < s < 1 and t ∈ R with 1 s + 1 t = 1, and for {x j } and {y j } j ∈ {1, ..., n} be certain sets of real numbers, we have: Corollary 2. Let pi be an estimation of the original domain distribution p i .The following inequality holds for any α < 1: where Proof of Corollary 2. First, we will prove for 0 < α < 1, and then for α < 0. Let us set 0 < α < 1, s = α and t = α α−1 .For any hypothesis h and any distributions p, q, the following holds: By the generalization of Holder's inequality for

By Definition 3
Next, let us set α < 0, t = α and s = α α−1 (notice that α < 0 → 0 < s < 1).For any hypothesis h and any distributions p, q, the following holds: By the generalization of Holder's inequality for

By Definition 3
For each i ∈ {1, ..., k}, by setting p := p i , q := pi and h := h i , we will find that: We contend that the value of ψ = ∑ x∈X p i (x)L(h i , f )  As we can observe, as the estimated distribution p approaches the true distribution p (i.e., as µ approaches 3), the bounds on the loss function become increasingly similar.We can also see that the value of the lower bounds is not significantly large, which means that we can consider using the probability approximation to solve the MSA problem.It is also worth noting that when α deviates significantly from 1, the bounds move away from the actual value.Theorem 9. Let p T be an arbitrary target distribution.For any δ > 0, there exists η > 0 and w ∈ ∆, such that the following inequality holds for any α > 1 and any mixture parameter λ: Proof of Theorem 9. Let δ > 0. In the proof for Corollary 1, we showed that for any hypothesis h and any distributions p, q, and for any α > 1, the following holds: Hence, for q = p T , p = p λ and h = h η w we will find that: By Theorem 6, given δ > 0, there exist η > 0 and w ∈ ∆ such that L(h η w , p λ , f ) ≤ + δ for any mixture parameter λ.Therefore: Corollary 3. Let p T be an arbitrary target distribution.For any δ > 0, there exists η > 0 and w ∈ ∆, such that the following inequality holds for any α > 1 and any mixture parameter λ ∈ ∆: where pλ = ∑ k i=1 λ i pi (x) and ĥη w is our good hypothesis from Definition 9 but calculated with the estimated probabilities pi .

•
For every i ∈ {1, ..., k}: We can repeat the proof of Theorem 9 with * instead of , pi instead of p i and ĥη w instead of h η w .
In summary, we demonstrated that it is possible to use approximate distributions to calculate a good distribution-weighted combining rule.We have established that the error introduced by using estimated distributions is bounded.Thus, we can address the Multi-Source Adaptation (MSA) problem in real-world applications.

MSA Algorithm
Alongside the unknown probabilities, another crucial aspect is determining an appropriate vector of weights, denoted as w, to fully establish the distribution-weighted combining rule.The paper by Hoffman, Mohri, and Zhang [9] presents a new algorithm for determining the distribution-weighted combination solution for cross-entropy loss and other losses, based on Difference of Convex (DC) programming.Lemma 2. For any target function f ∈ F and any η, η ≥ 0, there exists w ∈ ∆ with w i = 0 for all i ∈ {1, ..., k}, such that the following holds: ∀i ∈ {1, ..., k} L(h where:γ = ∑ k j=1 w j L(h η w , p j , f ) The proof of Lemma 2 is detailed in [19].
Corollary 4. For any target function f ∈ F and any η ≥ 0, there exists w ∈ ∆ with w i = 0 for all i ∈ {1, ..., k}, such that the following holds: Proof of Corollary 4. By Lemma 2, we obtain: ∀i ∈ {1, ..., k} L(h Corollary 4 provides a single upper bound for the loss with respect to every p i .Thus, our problem consists of finding a parameter w verifying this property.This, in turn, can be formulated as the following optimization problem: Definition 10.DC Function [23]: Let C be a convex subset of R n .A real-valued function f : C → R is called DC on C, if there exist two convex functions g, h : C → R such that f can be expressed in the form: DC programming problems are programming problems dealing with DC functions.An important class of DC problems is the following: where gand h are two convex functions in R n , and X is a closed convex subset of R n .
Proposition 3. Assume that the problem w * is solvable.Then, a point x * ∈ X is an optimal solution to w * if and only if there is t * ∈ R, such that: Horst and Thoai [23] developed an algorithm for solving DC programming problems such as w * based on the above optimality condition.The assumptions in Proposition 3 apply to the MSA problem, since we know there is an optimal solution.The key lies in identifying two convex functions whose difference coincides with the solution of the MSA problem.Let us define the following functions: Note that: Let us define the following convex functions: ) is convex as a composition of the convex function − log with an affine function J w .Similarly, − log(K w ) is convex, which shows that the second term in the expression of v i (w, f ) is a convex function.The first term can be written in terms of the unnormalized relative entropy (the unnormalized relative entropy of P and Q is defined by: B(p||q) = ∑ x p(x) log p(x) q(x) + ∑ x (q(x) − p(x))).It can be shown that the relative entropy is jointly convex using the so-called log-sum inequality (based on the explanation in [9]).
Let us be reminded of our regression loss function: Proposition 4. Let L be the cross-entropy loss.Then, for i ∈ {1, ..., k} Proof of Proposition 4.

L(h
L is the cross entropy loss.
Using the proof above, our optimization problem min w∈∆,ρ∈R is a DC programming problem, since it is the difference between two convex functions.
In light of all of the above, our optimization problem can be cast as the following variational form of a DC-programming problem: let us set (w t ) to be the sequence defined by repeatedly solving the following convex optimization problem: • Target function: min ρ.• Constraints: 1.
where w 0 ∈ ∆ is an arbitrary starting value.Then, (w t ) is guaranteed to converge to a local minimum of the optimization problem [9].Given the fact that an optimal hypothesis h η w exists, we converted the MSA problem into an optimization problem and cast it to a DC programming form in order to find a local optimum.This way, we are able to find the parameter w which is used in the distribution-weighted combination rule.

Empirical Results
In this section we present two sets of experiments.The first set is designed to evaluate the accuracy of approximating distributions using the VRLU and VRS methods, and the second set demonstrates the application of these estimates for the MSA problem.

VRLU and VRS Experiments
We present a series of experiments conducted to evaluate the performance of VRLUα and VRSα + , α − and compare them to the performance of existing methods such as the Evidence Lower Bound (ELBO) and Rényi upper and lower bounds.The goal of these experiments is to assess the effectiveness of the proposed methods and to determine their advantages and limitations.The methods we will examine in this section are detailed below: All of our experiments were conducted using PyTorch.Throughout the experiments, we used K = 50 samples for Monte Carlo (MC) approximation; trained the VAE models using the ADAM optimizer [24]; and set the learning rate to 0.001 and the batch size to 128 for the training set, and 32 for the test set.Our VAE model includes a total of 6 linear layers.The first 3 are the encoder layers, and the last 3 are the decoder layers.The dimension of the latent space is 50.We suggest two perspectives to evaluate and compare performances: • Quality of the decoded signal-Reconstruction error, measured by Mean Square Error (MSE) and Cross-Entropy (CE).

•
Quality of the evidence approximation-Maximizing the evidence log-likelihood, log p(x); the higher the better.

Digits Experiment
In the following experiment, we used the 'MNIST', 'USPS', and 'SVHN' datasets, all of which contain digit images (See Figure 6).They all share 10 classes of digits.The 'USPS' dataset consists of 7291 training images and 2007 test images of size 16 × 16.The 'MNIST' dataset consists of 60,000 training images and 10,000 test images of size 28 × 28.'SVHN' is obtained from house numbers in Google Street View images.It has 73,257 training images and 26,032 test images of size 32 × 32.If we look at Figure 6, we can see that the graphical representation of digits in 'USPS', 'SVHN', and 'MNIST' is very diverse; hence, each domain has a very different distribution.We compared the learning curves of VRS α + ,α − with α − ∈ {−0.5, −2} and α + ∈ {0.5, 2} and VR α with α ∈ {0.5, 2, 5} over the 'MNIST' dataset.Figure 7 demonstrates that VRS α + ,α − converged faster than VR α and the resulting loss value is smaller for both α values.Also, we can see that VR 0.5 performs better than VR 2 , and VR 2 performs better than VR 5 .This observation is in sync with the results reported in [3].

Faces Experiment
We performed a similar experiment on a dataset of facial expressions known as PIE.The PIE dataset consists of a few parts, each corresponding to a different posture.Specifically, we choose PIE05 (left pose), PIE07 (top pose), and PIE09 (bottom pose).In each subset (pose), all face images were taken under different lighting, illumination, and expression conditions (see Figure 9).We divided each dataset into training and testing sets, in a ratio of 2:1.We created VAE, VR α and VRS α + ,α − models for each 'PIE' domain (left pose, up position, and down position).Each model was trained on its corresponding training set.We calculated the log-likelihood estimations for each domain and compared them.The results are presented in Figure 10.We can see that the VRS α + ,α − model achieved the best results.In addition, for α + = 0.5, we obtained slightly better results than for α + = 2, which is compatible with all previous results.
To summarize, we demonstrated the performance of the VRS α + ,α − algorithm on the digits datasets ('MNIST', 'USPS', 'SVHN') and 'PIE' datasets (left pose, up position, and down position), and compared them against the (KL divergence-based) VAE, the Variational Rényi VR α upper and lower bounds, and the VRLU α upper bound minimization.In all cases, the VRS α + ,α − algorithm presented good results, many of which are the best performances compared to the other methods.

MSA Experiments
In this section, we review a series of experiments designed to tackle the MSA problem.In all of the experiments, we used the DC-programming algorithm, presented in [9], to provide a solution.We used real-world datasets: the digit dataset and Office31 dataset.For all of the datasets, the probability distributions p i are not readily available to the learner.Thus, we used the VAE, VR α and VRS α + ,α − models to approximate the probabilities pi .More concretely, given an MSA scenario, where we have k source domains and one target domain, we train a variational inference model for each source domain i.We then use the estimated distributions as input to the DC programming algorithm, which, in turn, finds the optimal vector w used to construct the distribution-weighted combination hypothesis h η w (Definition 9) for the target domain.We term the technique described above as VRS-MSA.Finally, we compared our performances to the results presented by Cortes et al. in [20].

Digits Experiment
In the following experiment, we used the digits datasets, SVHN, MNIST, and USPS, as our source domains.For each domain, we trained a convolutional neural network (CNN) of the same architecture as in [25], and used the output from the softmax score layer as our base predictors h i .We also trained the VAE, VR α and VRS α + ,α − models for each domain using the respective training sets.We used these trained models to approximate the domains' distributions pi .
For the DC-programming algorithm, we used 1000 image-label pairs from each domain, thus being a total of 3000 labeled pairs, to learn the parameter w.We compared our VRS-MSA algorithm against the results presented in [20], and report performances on each of the three test datasets, on combinations of two test datasets, and on all test datasets combined.
Table 2 details the accuracy scores obtained by running our VRS-MSA model and the following models: The GMSA model: a generative MSA model using the DC programming algorithm.
To obtain the data distribution, GMSA used the last layer before softmax from each of the domains' classifiers.

•
The DMSA model: this is based on a discriminative technique using an estimate of the conditional probabilities (the probability that point x belongs to source i).Our VRS-MSA model demonstrates competitive performance, with particularly strong results on the union of the SVHN and MNIST test sets and the union of the SVHN, MNIST, and USPS test sets.Moreover, among VI models, VRS 0.5,−0.5 achieved the best average score.This result is consistent with our previous results, which state that the closer α is to zero, the better the approximation of the log evidence.
However, the performance on the SVHN domain is lower in comparison to the other classifiers.Taking a closer look at the parameter w = (w MN IST : 0.73, w USPS : 0.19, w SV HN : 0.08) reveals that the value assigned to the SVHN domain, denoted as w SV HN , is relatively low at 0.08.Since the distribution weighted combining rule is a weighted combination of all source hypotheses with weights assignment w, this indicates that the SVHN domain has a minimal impact on the calculation of h η w .Additionally, the log probability obtained for the SVHN domain using the VI models is quite low compared to the other domains.These low values result in very small probabilities when taking the exponent, which can be difficult to work with in practice.
Furthermore, we devised a method that uses Stochastic Gradient Descent (SGD), rather than DC programming, to get a good classifier for the target domain.For each image x, every possible label y 1 , ..., y c , and every source domain, we created the following input: (p 1 (x, y 1 ), ..., p 1 (x, y c ), ..., p k (x, y 1 ), ..., p k (x, y c ), h 1 (x, y 1 ), ..., h 1 (x, y c ), ..., h k (x, y 1 ), ..., h k (x, y c )) Given image x, the SGD model learns a matching between the input vector above and the true label of x.This method is termed VRS-SGD.Similarly to VRS-MSA, we used 1000 images from each domain to train the SGD model.The results of the VRS-SGD are reported at the last section of Table 2.
The SGD score for the SVHN test set stands out as the highest, leading to an improvement in the combined test set that includes both SVHN and USPS.One advantage of the VRS-SGD method is its ability to overcome the issue of misalignment among different VRS models by adjusting its learned weights to match the input scale.This makes the VRS-SGD method particularly valuable when working with source domains where the probabilities are smaller compared to other domains.

Office Experiment
In the following experiment, we used the Office31 dataset, which is used mainly in domain adaptation scenarios.The Office31 dataset contains 31 object categories in three domains: Amazon, DSLR, and Webcam (see Figure 11).The 31 categories in the dataset consist of objects commonly encountered in office settings, such as keyboards, file cabinets, and laptops.The Amazon domain contains on average 90 images per class and 2817 images in total.As these images were captured from a website of online merchants, they are captured against a clean background and at a unified scale.The DSLR domain contains 498 low-noise high-resolution images (4288 × 2848).There are 5 objects per category.Each object was captured from different viewpoints on average 3 times.For Webcam, the 795 images of low resolution (640 × 480) exhibit significant noise and color as well as white balance artifacts.We carried out the VRS-MSA experiment on Office31 dataset.We divided the dataset into two splits following [26].For the training data, we used 20 samples per category for Amazon and 7 for both DSLR and Webcam.We used the rest of the samples as test data.For each domain, we used ResNet50 architecture pre-trained on ImageNet, and trained it over the domain's training set.We extracted the penultimate layer output from ResNet50 architecture and trained our variational inference models VAE, VR α and VRS α + ,α − on this pre-trained feature.The VI models were used to approximate the distributions p i .For our predictors h i , we extracted the output from the ResNet50 architecture and used softmax layer to calculate the probabilities.We used a batch size of 32 in the training set and 16 in the test set.
We measured the performance of these baselines on each of the three test sets, on combinations of two test sets, and all test sets combined.We compared our VRS-MSA model against previous results presented by Cortes et al. [20].While Cortes et al.only provided results for individual test sets, we additionally presented results for various combinations of test sets, providing a more comprehensive comparison of the performance of VI models.Among the models tested, our VRS 0.5,−0.5 model achieved the highest results in most test set combinations and had the best overall score, which supports our previous findings that a value of α close to zero leads to a better approximation of the log evidence.
We compared our results to the DMSA algorithm, each source predictor (CNN for Amazon, DSLR and Webcam), the uniform combination, CNN-unif, a network jointly trained on all source data combined, CNN-joint, and GMSA with kernel density estimation [9].The results are reported in Table 3.Our VRS-MSA model demonstrates competitive achievements, with particularly strong results on the test set DSLR.We note that the DSLR's high score comes at the expense of Amazon's and Webcam's high scores.This is because the vector w = (w Amazon : 0.25, w DSLR : 0.71, w Webcam : 0.04) learned in the DC programming algorithm determined the weight of each domain.When the DSLR domain receives more weight, it comes at the expense of the weight given to the other domains.
Likewise, the VRS-SGD method achieved competitive scores compared to the models using the DC algorithm.We can see that the VRS-SGD score for the Amazon test set is the highest and, as a result, the scores on test sets that include Amazon were also improved.

Summary
In this study, we reviewed and analyzed the methods to estimate data probabilities where traditional computation methods have failed.Specifically, we examined variational inference (VI) models, such as Variational Autoencoder (VAE) [27], which we aimed to improve using different divergence methods.We examined the properties of the Kullback-Leibler divergence, the Rényi divergence (which is essentially a family of divergences parameterized by α ∈ R), and the χ divergence.We derived the ELBO, the VR, and the CUBO bounds for the log evidence, and presented a new upper bound, termed VRLU, for which its MC approximation remains an upper.We used VRLU to devise a new (sandwiched) upper-lower bound variational inference method (VRS).The VRS loss function combines the VR lower bound (with positive α) and the new VRLU upper bound (with negative α), thus providing a tighter estimate for the log evidence.
We performed several experiments designed to test the performance of the new VRS model.We compared VAE, VR, VRLU, and VRS models over the digits datasets and PIE datasets, using different values of positive and negative α.In all cases, the VRS algorithm presented good results, many of which are the best performances compared to the other methods.We note, in passing, that the selection of the α value may depend on the data, an observation that was indicated in previous studies, as well [3,14].
In addition, we demonstrated the usage of VRS in MSA applications.We combined the DC-programming algorithm (suggested in [9]) with our VRS model, to obtain more accurate density estimates and improve the accuracy of the hypothesis for the target domain.We performed experiments to compare the accuracy of the resulting hypothesis in two MSA datasets: the digits and Office31 datasets.We compared our new model using VAE, VR, and VRS to the previous models, GMSA and DMSA, presented in [20].
Our empirical evaluation revealed that the proposed VRS-MSA model demonstrated competitive performance, and in certain instances even surpassed the performance of models reported in previous studies.Additionally, among the VI models tested, the VRS model achieved the highest overall score, which supports the conclusion that accurate probability estimates are necessary for the success of the weighted combination hypothesis h η w .Nonetheless, it is important to note that the VRS-MSA model achieved lower scores in certain individual test sets, where the weight parameter w was assigned a low value for that particular domain.When the weight parameter is low, it is important to take into account both the probability p i (x) and the domain-specific hypothesis h i .For example, if the image x is from the SVHN domain, the probabilities p mnist (x) and p usps (x) should be relatively low in comparison to p svhn (x), such that the value of h svhn is the most prominent in the weighted combination hypothesis.Our VRS-MSA model operates by training a VRS model for each domain, which learns its latent space vectors based on a Gaussian distribution, and outputs the probability in relation to these latent vectors p θ (x|z).Consequently, for each domain, the Gaussian distribution may have slight variations in variance, which can influence the log evidence value output from the VRS model.Therefore, the DC programming model, which takes into account the probabilities from all domains simultaneously, may be affected by the different scales of the probability measurements across the domains.
Looking forward, further work is required to disentangle the complexities of the aforementioned VRS-MSA.Specifically, in this work, we have not formed a connection between the latent variables of each VRS model of the different domains.It will be interesting to see how such a connection (of normalization, scaling of the probability measurements, or latent space alignment) will affect the compatibility of the probabilities.In addition, some researchers suggest even using a common latent feature space in the autoencoder models [28].Building such a network using our VRS loss might improve the results of the VRS-MSA model.However, it is worth noting that such a common model would lack the separation and privacy of domains that we have achieved using distinct VRS models.
We would also like to extend our experiments on the VRS model: First, it will be interesting to examine the different values of negative and positive α values and search for the best combination of α − and α + .Second, since α may be data-dependent, it will be interesting to explore the possibility to make α a trainable parameter.It can also be used to adjust the degree of relative risk aversion.These directions are left for future research efforts.

Figure 1
Figure 1 illustrates d α and D α .One can see that d α achieves high values very quickly.D α (p||q) and d α (p||q) are non-decreasing as functions of α, and:

Figure 4 .
Figure 4. Comparison between VR α and VRS α + ,α − multiplication factors over fixed distribution p and different q distributions.

1 α
when examining the loss bound.As previously mentioned, we assume that L(h i , f ) ≤ M, where we have set M = 1.Consequently, we are left with (∑ x∈X p i (x)) α−.Since p i is a distribution, the sum equals 1.Let us set L α ( p, p) := (d α ( p||p)) α−1 α .We would like to present an example of different L α ( p, p) values calculated with a constant distribution p ∼ N(3, 10), and a distribution p ∼ N(µ, 10), where 0 < µ < 6.When µ = 3, p = p.The results are shown in Figure 5.

•
VAE-minimizing KL divergence-maximizing the ELBO.• VR-minimizing Rényi divergence using variational Rényi upper / lower bound with MC, for different values of α. • VRLU-minimizing Rényi divergence using our variational Rényi log upper bound with MC for different values of negative α. • VRS-minimizing Rényi divergence using the (sandwich) upper-lower bound with MC for different values of negative and positive α.

Figure 7 .
Figure 7.Comparison between VR α and VRS α + ,α − learning curves over 'MNIST' dataset.Training with different values of α.The y axis detailed the values of the VR and VRS bounds, which is the approximation of the log evidence (the higher the better).

Figure 8
Figure 8 depicts the mean squared error (MSE) for the different learning methods.We can see that the MSE reconstruction error of all Variational Rényi methods, and specifically VRS 0.5,−0.5 , are better than VAE reconstruction error in all of the datasets.

Figure 10 .
Figure 10.Comparison between the log-likelihood estimates, calculated using the models VAE, VR α and VRS α + ,α − with different values of α.Each model was trained on a specific domain of 'PIE'.

•
CNN-s, CNN-m, and CNN-u: each trained on the single source domain SVHN, MNIST, and USPS, respectively.• CNN-unif: a classifier trained on a uniform combination of the source domains' data.• CNN-joint: a global classifier trained on all of the source domains' data combined.•

Table 1 .
Special cases in the Rényi divergence family.