Elastic Information Bottleneck

Information bottleneck is an information-theoretic principle of representation learning that aims to learn a maximally compressed representation that preserves as much information about labels as possible. Under this principle, two different methods have been proposed, i.e., information bottleneck (IB) and deterministic information bottleneck (DIB), and have gained significant progress in explaining the representation mechanisms of deep learning algorithms. However, these theoretical and empirical successes are only valid with the assumption that training and test data are drawn from the same distribution, which is clearly not satisfied in many real-world applications. In this paper, we study their generalization abilities within a transfer learning scenario, where the target error could be decomposed into three components, i.e., source empirical error, source generalization gap (SG), and representation discrepancy (RD). Comparing IB and DIB on these terms, we prove that DIB's SG bound is tighter than IB's while DIB's RD is larger than IB's. Therefore, it is difficult to tell which one is better. To balance the trade-off between SG and the RD, we propose an elastic information bottleneck (EIB) to interpolate between the IB and DIB regularizers, which guarantees a Pareto frontier within the IB framework. Additionally, simulations and real data experiments show that EIB has the ability to achieve better domain adaptation results than IB and DIB, which validates the correctness of our theories.


Introduction
Representation learning has recently become a core problem in machine learning, especially with the development of deep learning methods.Different from other statistical representation learning approaches, the information bottleneck principle formalizes the extraction of relevant features about Y from X as an information-theoretic optimization problem: min p(t|x) L = min p(t|x) f (X; T) − βg(Y; T), where p(t|x) amounts to the encoder of the input signal X, f stands for the compression of representation T with respect to input X, g stands for the preserved information of T with respect to output Y, Y ↔ X ↔ T forms a Markov chain, and β is the tradeoff parameter.The basic idea of the information bottleneck principle is to obtain the information that X provides about Y through a 'bottleneck' representation T. The Markov constraint requires that T is a (possibly stochastic) function of X and can only obtain information about Y through X.
Under this principle, various methods have been proposed, such as information bottleneck (IB) [1], conditional entropy bottleneck (CEB) [2], Gaussian IB [3], multivariate IB [4], distributed IB [5], squared IB [6], deterministic information bottleneck (DIB) [7], etc. Almost all previous methods use the mutual information I(Y; T) as the information preserving function g.As for the compression function f , there are two typical functions, which categorize these methods into two groups.The first group uses the mutual information I(X; T), a common measure of representation cost in channel coding, as the compression function.Typical examples include IB, CEB, Gaussian IB, multivariate IB, squared IB, etc. Please note that CEB uses the conditional mutual information I(X; T|Y) as the compression function, however, it has been proven to be equivalent to mutual information.Similarly, squared IB uses the square of the mutual information I 2 (X; T) as the compression function; still, we put it into the same category.Instead of the mutual information, the second group, including DIB, uses entropy H(T) as the compression function, which is another common measure to quantify the representation cost as in source coding.The reason for this is that entropy is directly related to the quantity to be constrained such as the number of clusters, which is expected to achieve better compression results.
IB has been extensively studied over recent years in theory and application.In theory, IB has been proven to be able to enhance generalization [8] and adversarial robustness [9] by providing a generalization bound and an adversarial robustness bound.In application, IB has been successfully applied in evaluating the representations learned by deep neural networks (DNN) [10][11][12] and completing various tasks such as geometric clustering by iteratively solving a set of self-consistent equations to obtain the optimal solution of IB optimization problem [13,14], classification [2,15], and generation [16] by serving as the loss function of DNN via variational methods.Recent research also shows that if we merge the loss function of IB with other models, such as BERT and Graph Neural Network, better generalization and adversarial robustness results will be obtained [9,17].Moreover, IB inspires new methods that improve generalization [18][19][20].Likewise, DIB has been applied in geometric clustering [21] and has the potential to be used in similar applications to IB.More wide-ranging work on IB in deep learning and communication is comprehensively summarized in the surveys found in [22][23][24].
Still, the theoretical results are only valid when the training and test data are drawn from the same distribution, but this is rarely the case in practice.Therefore, it is unclear, especially theoretically, whether IB and DIB are able to learn a representation from a source domain that performs well on the target domain.Moreover, it is worth studying which objective function is better.This is exactly the motivation of this paper.To this end, we formulate the problem as transfer learning, and the target domain test error could be decomposed into three parts: the source domain training error, the source domain generalization gap (SG), and the representation discrepancy (RD), based on the transfer learning theory [25].Without loss of generality, we assume that both IB and DIB have the ability to achieve small source domain training errors, so our goals become calculating SG and RD and comparing the two methods on these terms.
For SG, Shamir et al. [8] have provided an upper bound related to I(X; T), indicating that minimizing I(X; T) leads to better generalization.However, this theory is not applicable for comparing IB and DIB because DIB minimizes H(T) = I(X; T) + H(T|X), and it is unclear whether further minimizing H(T|X) may bring some advantages.Therefore, we need to derive a new bound.The difficulty lies in that the new bound needs to not only include both I(X, T) and H(T|X) for convenient comparison but also be tighter than the previous one.Since the previous bound in Shamir et al. [8] was represented as a φ function of the variance, we tackle this problem by introducing a different analysis of these two factors.Specifically for the variance, different from relating the variance to the L 1 distance; then, to KL-divergence; and lastly, to mutual information in the previous proof, we bound the variances by functions of expectations and successfully relate them to entropy H(T).Furthermore, we prove a tighter bound for the φ function.Consequently, we prove a tighter generalization bound, suggesting that minimizing H(T) is better than I(X; T).Therefore, our results indicate that DIB may generalize better than IB in the source domain.
As for RD, it is measured by the H∆H-distance, as in Ben-David et al. [26].However, this term is difficult to compute because H is the hypothesis space of classifiers and is diverse for different models.Inspired by the fact that IB and DIB solutions are mainly different with the variance in the representations, we assume that data are generated with a Gaussian distribution.Therefore, we define a pair-wise L 1 distance to bound the H∆H-distance and relate RD with the variance in representations.Specifically, IB's representations have larger randomness and thus have smaller RDs.Moreover, the closer the two domains are, the greater the difference between IB and DIB on RD is.
From the above theoretical findings, we conclude that there exists a better objective function under the IB principle.However, how to obtain the optimal objective function remains a challenging problem.Inspired by the trade-off between SG and RD, we propose an elastic information bottleneck (EIB) to interpolate between IB and DIB, i.e., min L EIB = (1 − α)H(T) + αI(X; T) − βI(Y; T).We can see that EIB includes IB and DIB as special cases.In addition, we provide a variational method to optimize EIB by DNN.We conduct both simulations and real data experiments in this paper.Our results show that EIB is more flexible to different kinds of data and achieves better accuracy than IB and DIB on classification tasks.We also provide an example of combining EIB with a previous domain adaptation algorithm by substituting the cross entropy loss with the EIB objective function, which suggests a promising application of EIB.Our contributions are summarized as follows: • We derive a transfer learning theory for IB frameworks and find a trade-off between SG and RD.Consequently, We propose a novel representation learning method EIB for better transfer learning results under the IB principle, which is flexible to suit different kinds of data and can be merged into domain adaptation algorithms as a regularizer.

•
In the study of SG, we provide a tighter SG upper bound, which serves as a theoretical guarantee for DIB and further develop the generalization theory in the IB framework.

•
Comprehensive simulations and experiments validate our theoretic results and demonstrate that EIB outperforms IB and DIB in many cases.

Problem Formulation
In domain adaptation, a scenario of transductive transfer learning, we have labeled training data from a source domain, and we wish to learn an algorithm on the training data that performs well on a target domain [27].The learning task is the same on the two domains, but the population distribution of the source domain and the target domain are different.Specifically, the source and the target instances come from the same instance space X , and the same instance corresponds to the same label, even if they lie in different domains.However, the population instance distribution on the source and the target domain are different, denoted as p(X) and q(X).(We use "instance" (X) to distinguish from "label" (Y) and "example" (X,Y) and use "population distribution" (p or q) to distinguish from "empirical distribution" ( p or q)).Assume that there exists a generic feature space so it can be utilized to transfer knowledge between different domains.That is to say, both the IB and DIB methods involve an encoder with a transition probability p(t|x) to convert instance X to an intermediate feature T ∈ T .The induced marginal distribution is then denoted as p(t) and q(t) for the source and target domains, respectively.The IB and DIB methods also have a decoder or classifier h ∈ H to map from the feature space T to the label space Y. p(y|x) denotes the ground-truth labeling function and p(y|t) = ∑ x p(y|x) p(t|x)p(x) ∑ x p(t|x )p(x ) is a labeling function induced by p(y|x) and p(t|x).In classification scenarios, deterministic labels are commonly used.Therefore, we define the deterministic ground-truth-induced labeling function as f : T → Y, f (t) = argmax y p(y|t).If the maximum probability is not unique, i.e., {y 1 , . . ., y n } = argmax y p(y|t), randomly choose a y i,i∈{1...n} as the output.With the above definitions, the expected error on the source domain could be written as S (h) = E t∼p(t) I { f (t) =h(t)} , where I {•} is the indicator function.Similarly, the expected error on target domain is T (h) = E t∼q(t) I { f (t) =h(t)} .Then, our problem could be formulated as follows: When the IB and DIB methods are trained on the source domain, which method achieves a lower target domain expected error T (h)?
According to previous work [26], the target domain error could be decomposed to three parts, as shown in the following theorem.A detailed proof is provided in Appendix A.1.
Please note that we assume that the ground-truth labeling function on the two domains remains the same, which is common, as used in many previous studies.If there is a conditional shift, i.e., a distance between f S and f T , where f S and f T are the deterministic groundtruth-induced labeling function on the source and target domains, respectively, as defined in Zhao et al. [28], the above decomposition is not valid any more.The assumption is reasonable since the conditional shift phenomenon is not observed in our experiments.
According to the above theorem, we need to minimize the following three terms to achieve a low expected error, i.e., the training error on source domain, and SG and RD between the marginal distributions on the two domains (RD).Assume that both the IB and DIB methods have the ability to achieve comparable small source training errors, so we focus on SG and RD to compare the two methods.

Source Generalization Study of IB and DIB
In this subsection, we study the generalization of IB and DIB on the source domain.The generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit the training data.The generalization error gap in statistical learning theory is defined as the expected difference between the population risk and the empirical risk.Russo and Zou [29], Xu and Raginsky [30] provided a generalization upper bound with mutual information of the input and output.Sefidgaran et al. [31,32] further related the mutual information with two other studies of generalization error, i.e., compressibility and fractal dimensions, providing a unifying framework for the three directions of studies.
In the theoretic study of generalization and adversarial robustness of IB, however, the measure is not exactly the same for the convenience of illustrating the characteristic of IB [8,9].Specifically, in Section 4.1 in Shamir et al. [8], the classification error is proven to be exponentially upper bounded by −I(Y; T), indicating that I(Y; T) is a measure of performance.Therefore, the difference between the population performance I(Y; T) and empirical performance Î(Y; T) is used to measure the ability of generalization.Shamir et al. [8] provided the following generalization upper bound, which is relevant to the mutual information I(X; T).
Theorem 2 (The previous generalization bound).Denote the empirical estimates by •.For any probability distribution p(x, y), with a probability of at least 1 − δ over the draw of the sample of size m from p(x,y), we have that for all T, where B 0 is a constant and B 1 , B 2 , B 3 only depend on m, δ, |Y |, min x p(x), and min y p(y), where min refers to the minimum value other than 0. However, this upper bound cannot be directly used to compare IB and DIB because both IB and DIB minimizes I(X; T).Moreover, the regularizer of DIB: H(T) = I(X; T) + H(T|X) further minimizes the term H(T|X).However, it is not clear whether it may bring any advantage from the previous theoretical result.To tackle this problem, we prove a new generalization bound in this paper, as shown in the following theorem.
Theorem 3 (Our generalization bound).For any probability distribution p(x, y), with a probability of at least 1 − δ over the draw of the sample of size m from p(x, y), we have that for all T, where C Proof.Here, we show a proof sketch to help us understand how our bound is different from the previous one.The complete proof is in the appendix.Similarly to the previous proof, SG is firstly divided into three parts.The main idea of the previous proof is to bound these parts by φ functions of sample variances in ( 5) and ( 9), where V x (p(t|x)) 2 , and φ(x) is defined by (A1).The variances are then related to the L 1 distances in ( 6) and ( 10); KLdivergence in (7); and lastly, mutual information in ( 8) and (11).

|I(Y;
We also use the idea of bounding by φ functions of variances, as shown in ( 12), (15), and (18).However, the variance is an exact one, instead of the previous estimation on the samples, i.e., Var X (p(t|x)) E p(x) [p(t|x) − E p(x) [p(t|x)]] 2 .Consequently, we can obtain a tighter bound.Specifically, we first use Lemma A4 to convert the variance to the expectation in ( 13), (16), and (19).Then, with the help of Lemma A5, we turn the φ functions into entropy, like from ( 13) to (14).With ( 14), (17), and (20), we finish the proof of Theorem 3.
We can see that this bound is similar to the previous one, but the new bound is related to the regularization term H(T), H(T|Y) and Ĥ(T|Y), instead of the mutual information I(X; T).Note that H(T|Y) and Ĥ(T|Y) are closely related by |H(T|Y) − Ĥ(T|Y)| ≤ (17) + (20).Additionally, H(T) is equivalent to H(T|Y) as a regularizer because . With these two observations above, this bound is directly dependent on the entropy H(T), i.e., the regularizer of DIB.That is to say, compressing the entropy of the representation is beneficial to reducing SG.Now we will illustrate why DIB generalizes better.Intuitively, since DIB directly minimizes H(T) while IB only minimizes a part of them, i.e., I(X; T), DIB may have a better generalization than IB.Owing to the lack of close form solutions, we cannot explicitly see the difference between IB and DIB with respect to the concerning information quantities.However, we can solve the IB and DIB problems through a self-consistent iterative algorithm, with α = 0 or 1 in Appendix B.1.The IB and DIB solutions are shown on the information plain in Appendix C with α = 0 or 1, showing that the IB solutions have larger H(T) than DIB solutions, and thus, DIB generalizes better in the sense of Theorem 3 in our paper.
Furthermore, our bound is tighter than the previous one.We provide a comparison of the order in Appendix A.2, which shows that the IB bound is |X| times larger than the DIB bound for the first two terms and |Y| times larger for the third term.There is also an empirical comparison in Section 4.1.2.Moreover, Theorems 2 and 3 require some constraints for sample size.They are summarized in the Appendix A.2 too.The empirical results show that our bound requires a smaller sample size.
To sum up, we provide a tighter generalization bound in the IB framework, which serves as a theoretical support for DIB's generalization performance.Experiments on MNIST will validate this theoretical result in Section 4.2.1.

Representation Discrepancy of IB and DIB
In this section, we compare the RD of IB and DIB.According to the target error decomposition theorem in Section 1, RD is measured by H∆H-distance, which is however difficult for direct computation because of the complex hypothesis space H.To remove the dependence of H, we propose to bound RD by a pair-wise L 1 distance.
To start with the simplest case, assume that the sample sizes on the source and target domains are both m, and the samples on two domains have a one-to-one correspondence (i.e., semantically close to each other).Then, RD can be bounded by the following pair-wise L 1 distance.
Proposition 1.The distance of overall representations on the source and target domains is bounded by the distance of individual instance representations.
where CP stands for the set of all correspondence pairs and , which is small when the sample size is large.
In fact, Proposition 1 is valid for any (x S , x T ) pair, so there can be different upper bounds.To obtain the lowest upper bound, we define correspondent pairs such that We parameterize p(t|x) to be a d-dimensional Gaussian with a diagonal covariance matrix, which is usually assumed in some variational methods such as Alemi et al. [15].Compared with IB, DIB additionally reduces H(T|X), so its p(t|x) are almost deterministic.Therefore, the variance in p DIB (t|x) in each dimension are significantly smaller than that of p IB (t|x).On the other hand, since the only difference between IB and DIB is altering H(T|X) and the entropy of Gaussian random variable is only dependent on its variance, the expectations of p IB (t|x) and p DIB (t|x) are comparable.Therefore, we need to find how the variances affect RD.
Consider the representations of the instances from the source and target domains in the same model.∀(x S , x T ) ∈ CP, since x S and x T are semantically close, the expectations of their representations are close.Additionally, the discrepancy between their variances are small compared to the discrepancy in the representation variances between IB and DIB, so we can neglect them.Therefore, denote that p(t|x S ) where Φ is the cumulative distribution function of a standard Gaussian distribution.
As discussed above, compared with IB, DIB has significantly smaller variances in the representations and µ 1 − µ 2 are comparable for IB and DIB.Therefore, the term |µ 1i −µ 2i | 2σ i for IB is remarkably smaller than DIB.With Φ monotonically increasing, p(t|x S ) − p(t|x T ) for IB is smaller than that for DIB. Figure 2 provides an intuitive understanding about how randomness helps reduce RD.Suppose that the blue and red lines are p(t|x S ) and p(t|x T ), where (x S , x T ) ∈ CP.We can see clearly that their L 1 distance drops with the growth in variances.Moreover, because the derivative of φ monotonically decreases on [0, +∞], when µ 2 − µ 1 is smaller, the difference between IB and DIB will be larger.This phenomenon is also found in simulations; see Sections 4.1.1 and 4.1.3.When the sample size on the two domains are different and the "pair-wise" correspondence does not hold, we can take the correlated instances in the two domains as a general form of correspondent pairs; then, the above comparison result is also valid.Details are found in Appendix A. 3.
Please note that the assumption of correspondent pairs is reasonable.In transfer learning, it is widely assumed that there exists shared (high level) features and common distributions of representation on the two domains.The correspondent pairs can be viewed as the instance pairs with the closest distributions of representations from the two domains.When the distributions are distinct, feature alignment is usually implemented in practice.
According to the above theoretical results, we can obtain the following comparisons between IB and DIB, with respect to the three terms for the target error decomposition, as illustrated in Table 1.Clearly, there is a trade-off between IB and DIB, in terms of SG and RD.

Elastic Information Bottleneck Method
From the previous results, IB and DIB still have room for improvement, respectively, on SG and RD.To obtain a Pareto optimal solution under the IB principle, we propose a new bottleneck method as follows, namely EIB.Please note that the generalized IB objective function in [7] is the same as the objective function of elastic IB, but they are derived from different perspectives.The generalized IB is constructed for the convenience of solving the DIB problem, while EIB is proposed to balance SG and RD.

Definition 1 (Elastic Information bottleneck).
The objective function of the elastic information bottleneck method is as follows: We can see that EIB is a linear combination of IB and DIB, which covers IB and DIB as special cases.Specifically, EIB reduces to IB when α = 1 and DIB when α = 0. Since IB and DIB perform better than the other one on SG or RD, respectively, a linear combination may lead to better target performance .In fact, the global optimal solution of the linear combination of the objectives is in the Pareto optimal solution set.Therefore, by adjusting α in [0, 1], we can obtain a balance between SG and RD and achieve a Pareto frontier within the IB framework.
As a bottleneck method similar to IB, its optimal solution can be calculated by iterative algorithms when the joint distribution of instances and labels are known.The algorithm and EIB's optimal solutions on information planes are provided in Appendix B. However, the iterative methods are intractable when the data are numerous and high-dimensional.Therefore, we use the variational approaches similar to the method in Fischer [2] to optimize IB, DIB, and EIB objective functions by neural networks.Assume the representation is a k-dimension Gaussian distribution and all dimensions are independent.The network contains an encoder p(t|x), a decoder q(y|t), and a backward encoder b(t|y).To be specific, the encoder is an MLP which outputs the K-dimensional expectations µ 1 and K-dimensional diagonal elements of covariance matrix σ 1 and then yield the Gaussian representation with the reparameterization trick [15].The decoder is a simple logistic regression model.The backward encoder is an MLP that uses the onehot encoding of the classify outcome as an input and the K-dimensional expectations µ 2 and K-dimensional diagonal elements of covariance matrix σ 2 as an output.The loss function of variational EIB is as follows: where CE is the cross entropy loss.Our simulations and real data experiments show that EIB outperforms IB and DIB.It is also worth noticing that EIB can be plugged into previous transfer learning algorithms to expand its application.An example is given in the next section.

Experiments
In this section, we evaluate IB, DIB, and EIB by both toy data simulations and real data experiments.

Toy Data Simulations
We conduct simulations to study the performance of EIB in the transfer learning scenario, a comparison of RD between IB and DIB, and the comparison between our generalization bound and the previous one.

Performance of EIB in Transfer Learning
We design a toy binary classification problem, where instance X ∈ {0, 1} 10 and label Y ∈ {0, 1}.Define X 0 = [1,1,1,1,1,0,0,0,0,0], X 1 = [0,0,0,0,0,1,1,1,1,1], Y 0 = 0, and Y 1 = 1.First, in the source and target datasets, half of the examples is (X 0 , Y 0 ) and the other half is (X 1 , Y 1 ).Then we introduce random noise into the instances in the following way.We define the reverse digit operation as first choosing one digit in the instance and then reversing the digit (changing from 0 to 1 or the opposite direction).For each instance, we perform the reverse digit operation N times (round down function), where N is a real-valued uniform random variable N ∼ U[0, R].R is named as the noise level because as R increases, more instances are very different from X 0 and X 1 in the dataset.As a result, we can adjust the similarity between two domains by modifying the parameter R. Lastly, we set the R ∈ (1, 3) for the source domain and R = 3 for the target domain so that we obtain the toy data in a transfer learning scenario.
We test our EIB model on the toy data.The parameter α ranges in [0,1], indicating different EIB models.The other parameter β is chosen to be 10 4 for R = 2, 1.5, 1.433 and β = 5 × 10 3 for R = 1.375, 1.25 in order to discriminate the model performance and to obtain high accuracy.The results are shown in Table 2. From the results, we can see that EIB outperforms IB and DIB when R = 1.5, 1.433, 1.375, 1.25.To be more specific, when the two domains become more similar, i.e., the noise level R of the source domain becomes closer to R = 3 of the target domain, α of the best model changes from 0 to 1, i.e., IB gradually becomes more advantageous.This is in accordance with our theory that when the two domains are similar, the effect of RD is apparent and IB has more advantages in transfer learning.This result also provides an empirical rule to tune the parameter α.In this simulation, we compare the DIB generalization bound (ours) and the IB generalization bound [8] in terms of number size and required sample size.
Assume that the data, representations, and labels are discrete, i.e., |X |=3,|T | = 2, |Y | = 2, δ = 0.1.The sample size increases exponentially from 10 1 to 10 6 .The bound and error rate are computed as follows.First, we generate distributions p(x, y), p(t|x) and sample an empirical distribution p(x, y).To compare the bounds in a general case, p(x, y) is randomly valued by numbers generated from a uniform or a normal distribution and the value of p(t|x) is also randomly generated from a uniform distribution or a normal distribution.p(x, y) is the empirical probability of p(x, y).Second, we determine whether the constraints for the bound in Appendix A.3 are met.If so, we calculate the value of the bound.If not, we record the number of times the constraints are not satisfied.Third, we repeat this process for 100 times to obtain the rate of constraint violations (error rate) and two average bound values.Finally, we repeat the previous three steps under different sample sizes.Figure 3 (left) shows that our bound consistently needs less samples than the previous one to satisfy the constraints.Moreover, the more uniform the distribution is, the less samples are needed to satisfy the constraints, which is obvious from the formula of constraints.Figure 3 (right) shows that the generalization bound decreases as the sample size increases, and the DIB generalization bound is always smaller than the IB generalization bound.

Comparison of RD between IB and DIB
To support the claims about RD, we approximate RD by using a classifier to predict which domain the representation sample t comes from, which is proven to be an adequate approximation of RD in [26].
The data generation, model, and parameters (R,β) are consistent with the simulation in Section 4.1.1.The data are generated two times under different noise levels R and are trained by EIB models with α = 0 or 1.Then, five samples are drawn from the Gaussian representations p(t|x) for each instance x.If x comes from the source domain, we label the samples as positive; otherwise, we label it as negative.Then, we train a linear classifier to classify the samples.Each classifier is trained 20 times.The average error rate is shown in Table 3.The standard error of the mean (SEM) is in parentheses.The results show that IB has a larger error rate than DIB in most cases, indicating that IB has a smaller RD, which is consistent with the main result in Section 3.2.Furthermore, the gap between IB's and DIB's error rates becomes larger with the growth of R, which validates that when two domains become more similar, IB's advantage over RD becomes more significant, as is claimed in Sections 3.2 and 4.1.1.This trend does not seem to be consistent between R = 1.433 and R = 1.375 because β, which is chosen to minimizes the error rate, is different.

Real Data Experiments
Similar to IB, EIB can also be utilized in many representation learning algorithms as a regularizer.As an example, we combine EIB with DFA-MCD in Wang et al. [33], which is an adversarial domain adaptation algorithm with feature alignment.We replace the features in DFA-MCD with Gaussian representations and add a backward encoder after the source classifier as in variational EIB.Since the feaures are already Gaussian, we remove the KLD regularizer in DFA-MCD, which is the KL-divergence penalty between source domain feature and a Gaussian prior.Then, we substitute cross entropy loss with EIB loss.The adversary training and feature alignment designs are retained.
We test the model on a common transfer learning task, where the source dataset is MNIST [34] and the target dataset is USPS [35].We only use 2/55 of the training data, and λ = 15, β = 10 9 , other parameters remain the same as in Wang et al. [33].The experiment is repeated with six random seeds, and the results are shown in Figure 4 (left).For each box, the central mark indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively.The whiskers extend to the most extreme data points not considered outliers, and the outliers are plotted individually using the '+' symbol.It is automatically generated by the boxplot function in matlab.We can see that the model with EIB performs better than IB and DIB in most cases, e.g., α = 0.2, 0.3, 0.4, 0.6, 0.7, 0.8, and some of them beats the DFA-MCD baseline model, which suggests that EIB works as an effective regularizer for transfer learning.Please note that this example is utilized to demonsrate EIB can be combined with domain adaptation algorithms and perform better than IB and DIB, so we simply inherit the structure and hyper-parameters of the original networks in Wang et al. [33].Further parameter tuning can be conducted to achieve better experimental results.

Source Generalization Analysis
We use variational EIB on MNIST to compare SG of IB and DIB.The networks were trained for 100 epochs and converge at about 60 epochs.The average generalization gap is calculated by the mean discrepancy of the training error and the testing error of 25 epochs with the least testing error.We randomly initialize the network 14 times and utilize EIB models with α = 0 (DIB), α = 0.5, α = 1 (IB) under the same initialization.The results are shown in Figure 4 (right) and Table 4. Baseline is a model without a regularizer i.e., β = +∞ in EIB.First, let us analyze the results in terms of β.When β is small, models over-compress the representations so that the error rate is large.When β is large, the weight of regularization term in objective function is small so the three models' performance become close.When β = 10 5 , the three models have the best accuracy.Then, in terms of α, with the decrease in α, the generalization gap becomes smaller.When β = 10 4 , 5 × 10 4 , 10 5 , the p value of the t-test on "DIB's SG < IB's SG" is less than 0.05, indicating that the source generalization error of DIB is statistically significantly smaller than that of IB.(The p value is a term in statistical test that is defined as the probability of obtaining a test statistic as extreme or more extreme than the test statistic of actual observations if the null hypothesis is true.Usually, if p value < 0.05, the null hypothesis should be rejected and the result is considered statistically significant) This validates that DIB generalizes better than IB.

Conclusions and Future Work
This work studies the two objective functions of the information bottleneck principle.The motivation comes from our theoretical analysis that neither IB nor DIB is the optimal solution in terms of the generalization ability under the transfer learning scenario.Specifically, we theoretically analyze SG and RD of IB and DIB, and find that there is a trade-off between them.To tackle this problem, we propose a new method, EIB, to interpolate IB and DIB.Consequently, EIB can not only achieve better transfer learning performances but also be plugged into existing domain adaptation methods as a regularizer to suit different kinds of data, which have been shown by our simulations and real data experiments.
We believe that our results take an important step towards understanding different information bottleneck methods and provide some insights into the design of stronger deep domain adaptation algorithms.We qualitatively suggest choosing α, but in practice, when the distance of the two domain is fixed, the optimum still needs careful tuning.Therefore, how to choose the best parameters α and β efficiently in EIB remain questions that we plan to study in the future.
(2) If there exists x i > 1/2, since ∑ n i x i = 1, there is at most one x i that is larger than 1/2.We denote it by x n .It is easy to verify that f The process of these three parts has some similarities, and we use the first part as an example to illustrate where our proof differs from the original proof.For the first term of (A3),

|H(T) −
Our proof begins to deviate from the original proof in (A6), where the original proof add the sample mean and then derive the sample variation: Since here, the two proofs take two totally different ways to process the variance, leading to two different bounds, we treat the variance with our lemmas, while the proof of the original bound uses triangle inequality and an inequality linking KL-divergence and the L1 norm.To grasp the detailed distinction between the two proofs, we recommend reading them.Now, we first continue our proof from (A10) and then show how Shamir et al. [8] processed the variance afterwards.
From (A18), (A31), and (A36), we conclude Theorem 3: Here we provide some discussions of the results.A. A comparison of the order of the previous bound and our bound.Then, with Equation ( 22) and the analysis in our paper, IB's RD is smaller than that of DIB.  for batch=1 to batch-number do 3: Step 1: Update E, D 1 , D 2 ,BE 1 ,BE 2 to minimize where E stands for the encoder, D 1 and D 2 stand for the decoders, BE 1 and BE 2 stand for the backward encoders, Ŷt1 and Ŷt2 stand for the target label predictions by the two decoders, and Xs and Xt stand for the target and source reconstructed instances.The dimension of Gaussian representations is K = 768.The model was trained with 200 epochs, a 128 batch size, and a 0.0002 initial learning rate, and the experiments were repeated with different random seeds.An OUr computing infrastructure was used to run the experiments: GPU model; memory: 24,268 MiB; operating system: linux; and Pytorch: 1.9.0+cu111.

Figure 1 .
Figure 1.The previous proof vs. our proof.

Figure 2 .
Figure 2. The effect of randomness on RD.

Figure 3 .
Figure 3. Pre.denotes the previous bound.Left: Comparisons of the two bounds with respect to the constraint error as the sample size grows.Right: Comparisons of the two bounds with respect to the tightness when samples are sufficient.

Figure 4 .
Figure 4. Left: Target domain accuracy by EIB in transferring MNIST to USPS.BL (baseline) is the DFA-MCD method.Right: SG on MNIST by EIB.

Figure A2 .
Figure A2.Top: α = 1; β takes 500 points evenly in [1.5, 51.5]; β from small to large corresponds to the colors started from blue to red.Bottom: β = 4.5; α takes 201 points evenly in [0, 1]; α from small to large corresponds to the colors starting from blue to red.The blue horizontal line is H(Y), which is the upper bound of I(Y,T).

Figure A3 .
Figure A3.Left: Each curve is drawn by altering β.Right: Each curve is drawn by altering α.Appendix D. Experimental DetailsOur codes of variational EIB are based on the code of VIB[15].For variational EIB on toy data, the dimension of Gaussian representations is K = 5.The model was trained with 50 epochs, a 100 batch size, and a 0.01 initial learning rate, and the experiments were repeated with different random seeds.For variational EIB on MNIST, the dimension of Gaussian representations is K = 256.We trained the variational EIB model, with 50 training epochs, a 100 batch size, and a 1e-4 initial learning rate, and the experiments were repeated with different random seeds.Our codes of EIB-DFA-MCD are based on the code of DFA-MCD Wang et al.[33].The pseudocode of EIB-DFA-MCD is shown in Algorithm A2: 1 , C 2 , C 3 , C 4 , C 5 only depend on δ, |Y |, min x,y p(x|y), min t,y p(t|y), min x p(x), and min t p(t).

Table 1 .
A comparison of IB and DIB.

Table 2 .
Accuracy of elastic information bottleneck (EIB) on simulated data with different noise levels R. The highest accuracy under each noise levels are marked in bold.

Table 3 .
The average error rate of classification between representation samples on two domains with difference noise levels.A larger error rate indicates smaller RD.The highest error rate under each noise levels are marked in bold.

Table 4 .
T-test on "DIB's SG < IB's SG".Statistically significant results are marked in bold.