Exploring the Trade-Off in the Variational Information Bottleneck for Regression with a Single Training Run

Kudo, Sota; Ono, Naoaki; Kanaya, Shigehiko; Huang, Ming

doi:10.3390/e26121043

Open AccessArticle

Exploring the Trade-Off in the Variational Information Bottleneck for Regression with a Single Training Run

¹

Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma 630-0192, Japan

²

Institute of Advanced Computing and Digital Engineering, Shenzhen Institute of Advanced Technology, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(12), 1043; https://doi.org/10.3390/e26121043

Submission received: 6 October 2024 / Revised: 21 November 2024 / Accepted: 29 November 2024 / Published: 30 November 2024

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

An information bottleneck (IB) enables the acquisition of useful representations from data by retaining necessary information while reducing unnecessary information. In its objective function, the Lagrange multiplier

β

controls the trade-off between retention and reduction. This study analyzes the Variational Information Bottleneck (VIB), a standard IB method in deep learning, in the settings of regression problems and derives its optimal solution. Based on this analysis, we propose a framework for regression problems that can obtain the optimal solution of the VIB for all

β

values with a single training run. This is in contrast to conventional methods that require one training run for each

β

. The optimization performance of this framework is theoretically discussed and experimentally demonstrated. Our approach not only enhances the efficiency of exploring

β

in regression problems but also deepens the understanding of the IB’s behavior and its effects in this setting.

Keywords:

information bottleneck; deep learning; regression model; supervised learning

1. Introduction

1.1. Information Bottleneck

Information extraction refers to obtaining a new representation by retaining the necessary information from a given input while reducing unnecessary information. An information bottleneck (IB) [1] formalizes this process from an information-theoretic perspective. Consider a source random variable X and a target random variable Y, with their joint distribution assumed to be known. The objective of the IB is to derive a random variable Z from X that compresses the information contained in X while retaining as much information as possible about Y. Formally, with mutual information

I (\cdot; \cdot)

, this can be expressed as follows:

max_{Z \in Δ} I (Z; Y) s . t . I (X; Z) \leq r

(1)

Here,

Δ

represents the set of all random variables Z that satisfy the Markov chain

Y \leftrightarrow X \leftrightarrow Z

. We call this the IB objective. In many practical cases, to avoid constrained optimization, instead of directly optimizing the IB objective, the following objective is used. This is the Lagrangian relaxation [2] of the IB objective and is called the IB Lagrangian [3].

L_{I B} (Z; β) = I (Z; Y) - β I (X; Z)

(2)

β > 0

is called the Lagrange multiplier. The advantage of the IB method lies in its ability to control the trade-off between compression

I (X; Z)

and prediction

I (Z; Y)

through

β

. When

β

is small, the resulting representation tends to be more predictive, while a larger

β

leads to a more concise representation.

1.2. Methods of IB

Mutual information often involves integrals, which are difficult to calculate, making it challenging to solve the IB Lagrangian in general situations. Consequently, early IB methods were proposed for limited scenarios, such as when X and Y take on a relatively small number of discrete values [1] or when they follow a joint Gaussian distribution [4]. Later, the Variational Information Bottleneck (VIB) [5] introduced a method to solve the IB Lagrangian in more general situations using variational approximations and deep neural networks (DNNs). This approach has become a standard method of solving IBs in DNN-based applications. The details of the VIB are introduced in Section 3. While this paper focuses on analyzing the VIB, it is important to note that there are other methods for IBs [6,7,8,9,10], as well as other objective functions inspired by IBs [11,12,13,14,15].

1.3. Effects and Applications of IB in DNNs

There has been extensive research on the use of IBs in supervised learning. As we will see later, maximizing

I (Z; Y)

is related to likelihood maximization. In addition to this likelihood maximization, an IB can remove unnecessary information, which can sometimes be detrimental to robustness, through compression. This can have a beneficial effect on deep learning models, which can occasionally become overly flexible. Theoretically, ref. [16] has explored IB-based statistical learning theory using DNNs, and [17] has provided a generalization gap between true mutual information and empirical mutual information in probabilistic models. In particular, the IB framework has shown many experimental benefits in classification problems with DNNs. These include improvements in generalization performance [5,8], invariance to nuisance factors [18], robustness against adversarial attacks [5,8,9], out-of-distribution detection [9,19], domain generalization [20,21], and calibration [19]. There are some works that have applied IBs in regression tasks [6,10,22,23,24]. Ngampruetikorn and Schwab [22] studied the quantification of overfitting in the context of IBs specifically in regression tasks with a linear data generation process. Recently, ref. [10] proposed a superior IB method for regression tasks using Cauchy–Schwarz divergence. While the specific effects of compression in regression problems have not been explored as extensively as in classification, this area holds promise for future research. In addition to supervised learning, the IB has also been used in unsupervised learning [25,26] and as a tool to understand DNNs [27,28,29].

1.4. Efficient $β$ Exploration

In general, the optimal value of

β

to be used during training is not known beforehand. This is because the optimal trade-off between prediction and compression, which maximizes generalization, depends on the task-specific distribution [17]. Therefore, practitioners usually follow a procedure whereby they train multiple models with varying

β

values and select the model with the best properties. However, the drawback of this approach is that it requires multiple training runs and is computationally expensive.

To make this search more efficient, several studies have been conducted. Ref. [30] showed, that even within the range

β \in (0, 1)

, inappropriate values of

β

can render the IB Lagrangian unlearnable (i.e., the optimal Z becomes independent of X). They theoretically established sufficient conditions for

β

to be learnable and proposed an algorithm to estimate the range of

β

values that are likely to be learnable. This research is valuable in narrowing down the range of

β

values to be explored from the perspective of the search efficiency. However, finding a useful

β

still requires multiple training runs. Ref. [31] developed a method to achieve the desired compression rate r in a single training run by establishing a one-to-one mapping between

β

and r. However, the challenge of identifying the appropriate compression rate ultimately still requires trial and error. Additionally, ref. [9] used supervised disentangling to obtain a maximally compressed representation Z from the training data without reducing

I (Z; Y)

. Although this training process does not depend on

β

, it is important to note that this representation is not necessarily the most useful. This is because our goal is often to maximize the true

I (Z; Y)

, not the empirical

I (Z; Y)

observed in the training data. In fact, similar to the bias–variance trade-off, the true

I (Z; Y)

is determined by the trade-off between the empirical

I (Z; Y)

and

I (X; Z)

in the training data [17]. Previously, we proposed a framework called the Flexible Variational Information Bottleneck (FVIB) [32] that enables the VIB objective to be learned for all values of

β

in a single training run for classification tasks. The FVIB allows models with all

β

values to be obtained without requiring additional parameters or training time, as compared to a single training process for the VIB, thereby enabling efficient

β

exploration.

1.5. Contributions and the Structure of This Paper

In this paper, we propose FVIB-Regression (FVIB-R) for regression tasks, which allows the VIB objective to be learned for all

β

values in a single training run. First, in Section 3, we derive the closed-form solution for the VIB in regression tasks. This provides insights into the behavior of the VIB in regression problems and is expected to be useful for future theoretical analysis and improvements to the IB or VIB in this setting. Next, in Section 4, we use this analysis to design FVIB-R and theoretically discuss its optimization performance. Finally, in Section 5.1.1, we experimentally demonstrate the optimization performance of FVIB-R.

2. Related Work

2.1. Theory of IB

This study analytically derives the optimal solution of the VIB, and its findings have the potential to contribute to the theoretical understanding of both the VIB and IB in the future. Therefore, we discuss the relationship between this work and previous theoretical studies on IBs. Ref. [33] discovered that, in IBs, when

β

is varied, there is a point at which a significant change in the prediction accuracy occurs, which they referred to as the IB phase transition. They also demonstrated that new correlations are learned during this phase transition. Our study elucidates the behavior of the VIB optimal solution concerning

β

, which could potentially connect with the theoretical research on the IB phase transition. Kolchinsky et al. [34] studied IBs in the settings of stochastic models and deterministic scenarios. While our research is constrained to regression problems, it does not rely on the deterministic scenario constraint, meaning that it also addresses datasets with multiple different labels for the same input data. Amjad and Geiger [35] theoretically demonstrated several issues when training classification problems using IBs with deterministic DNNs, and they pointed out that stochastic DNNs such as the VIB can avoid these problems. In contrast, our study focuses on the VIB in the context of regression problems. There are many works that study the relationship between information-theoretic quantities and generalization [16,36,37,38,39,40,41,42,43]. In particular, ref. [16] explored statistical learning theory using mutual information in the context of deterministic deep neural networks. Ref. [17] provided a generalization gap between the true mutual information and the empirical mutual information in stochastic models. This research confirms that the VIB is effective for generalization as a stochastic model.

2.2. Variational Autoencoders

Variational autoencoders (VAE) [44] and

β

-VAE [45] have loss functions that include both distortion (prediction) and rate (compression) terms, and they can be interpreted as special cases of the VIB [5,46,47]. Therefore, our analysis is related to the existing analyses of VAE. There have been many studies conducted to understand the characteristics of the optimal solutions in

β

-VAE [48,49,50]. Our analysis revisits these studies in the context of the VIB, which is a supervised learning framework. By doing so, our analysis extends the insights gained from VAE to the domain of IBs and has the potential to provide a profound understanding of how the interplay between distortion and rate terms can be leveraged in supervised learning scenarios.

3. Analysis of VIB in Regression Tasks

In this section, we analyze the VIB in the settings of regression tasks. We begin by explaining the VIB, followed by the introduction of a specific model setup for our analysis. Finally, we derive the optimal solution for the VIB with this model.

3.1. Variational Information Bottleneck

The VIB enables the learning of IBs in general settings by providing a lower bound on Equation (2) through variational approximation. When predicting Y from X, the random variable Z is obtained via the feature extractor

p_{θ} (Z | X)

. First, let us consider the prediction term, i.e., the first term in Equation (2). By using a new model

q_{ϕ} (Y | Z)

as a variational approximation of

p_{θ} (Y | Z)

, the following variational lower bound is obtained:

\begin{matrix} I (Z; Y) \geq \int d x d y d z p (x, y) p_{θ} (z | x) & log q_{ϕ} (y | z) + H (Y) \end{matrix}

(3)

Here,

H (Y)

represents the entropy of Y (or the differential entropy in the case of a regression problem). Estimating

H (Y)

from observed data is not a straightforward task when Y is a continuous variable. However, since this value remains constant throughout the learning process, it can be ignored during training [51]. Next, regarding the compression term

I (X; Z)

in Equation (2), by using

r (Z)

as a variational approximation of

p_{θ} (Z)

, an upper bound can be obtained:

I (X; Z) \leq \int d x p (x) D_{K L} [p_{θ} (Z | x) | | r (Z)]

(4)

In practice, a fixed distribution is used for

r (Z)

. By combining these elements, a lower bound on the IB Lagrangian can be obtained. Given data {

(x_{1}, y_{1}), \dots, (x_{N}, y_{N})

}, and by using the empirical distribution as the joint distribution of X and Y, the objective function of the VIB is derived:

\begin{matrix} L_{V I B} (θ, ϕ; β) : = & \frac{1}{N} \sum_{i = 1}^{N} [E_{p_{θ} (Z | x_{i})} [log q_{ϕ} (y_{i} | Z)] - β D_{K L} [p_{θ} (Z | x_{i}) | | r (Z)]] \end{matrix}

(5)

The first term, which is the prediction term, corresponds to the expected value of the log-likelihood, revealing the relationship between maximizing

I (Z; Y)

and maximum likelihood estimation.

3.2. Model Settings for Analysis

Below, the model setup for the analysis is introduced. All subsequent lemmas and theorems are based on the following setup. Consider a d-dimensional regression task, using input data

x \in X

and the continuous label

y \in R^{d}

. The random variable Z is defined as

z \in R^{κ}

with

κ \geq d

. We define

p_{θ} (Z | x)

as follows:

p_{θ} (Z | x) = N (μ (x), Σ (x))

(6)

Here, let

μ : X \to R^{κ}

and

Σ : X \to {A \in R^{κ \times κ} | A ≻ 0}

, where

Σ

represents the covariance matrix. Although the covariance matrix is typically restricted to a diagonal matrix, our analysis does not impose this restriction. Based on these settings, the parameters are defined as

θ = {μ, Σ}

. The distribution

r (Z)

is typically set as

r (Z) = N (0, I)

, and we adopt this in our study as well.

The predictor

q_{ϕ}

is often composed of a single fully connected layer in practice. With this in mind, in our analysis, we define the fully connected layer as

v (z) : = W z + b

, where

W \in R^{d \times κ}

and

b \in R^{d}

. Using this,

q_{ϕ}

is defined as follows:

q_{ϕ} (y | z) = N (v (z), \frac{1}{2} I)

(7)

This setting corresponds to the mean squared error (MSE):

log q_{ϕ} (y | z) = - | | {v (z) - y | |}_{2}^{2} - \frac{d}{2} log π

(8)

Therefore,

ϕ = {W, b}

. It should be noted that the second term is constant during training.

3.3. The Optimal Solution

Under the above settings, we analyze the VIB objective. To this end, we introduce several new definitions. We split the data indices into equivalence classes defined by

[i] : = {j \in {1, 2 \dots, N} | x_{j} = x_{i}}

(9)

This means that we are grouping data indices with the same input into a single set. Using this, for each label, we reassign the average of the labels whose indices belong to the same equivalence class:

{\tilde{y}}_{i} : = \frac{1}{| [i] |} Σ_{j \in [i]} y_{j}

(10)

We define the mean and the covariance matrix for

{\tilde{y}}_{i}

:

m_{\tilde{y}} : = \frac{1}{N} \sum_{i = 1}^{N} {\tilde{y}}_{i}

(11)

S_{\tilde{y}} : = \frac{1}{N} \sum_{i = 1}^{N} ({\tilde{y}}_{i} - m_{\tilde{y}}) ({{\tilde{y}}_{i} - m_{\tilde{y}})}^{⊤}

(12)

Additionally, in the following,

diag (a)

denotes a diagonal matrix with the vector

a

as its diagonal elements, and

{[a_{i}]}_{i}

represents a vector with its i-th component being

a_{i}

. Using these definitions, we derive the optimal solution for the VIB objective in the context of regression tasks.

Lemma 1.

Consider the model settings in Section 3.2. For any

β > 0

, the feature dimension

κ \geq d

does not affect the value of

{max}_{θ, ϕ} L_{V I B} (θ, ϕ; β)

. Below, we consider the case where

κ = d

. For any

β > 0

, if the parameters satisfy the following conditions,

L_{V I B} (θ, ϕ; β)

is maximized.

$μ (x_{i}) = \frac{2}{β} ({I + \frac{2}{β} W^{⊤} W)}^{- 1} W^{⊤} ({\tilde{y}}_{i} - m_{\tilde{y}})$ for all $i = 1, 2 \dots, N$ .
$Σ (x_{i}) = ({I + \frac{2}{β} W^{⊤} W)}^{- 1}$ for all $i = 1, 2 \dots, N$ .
$W = Pdiag ([{\pm \sqrt{max (0, λ_{i} - \frac{β}{2})}]}_{i}) R$ , where $S_{\tilde{y}} = : Pdiag ([{λ_{i}]}_{i}) P^{⊤}$ by orthogonal diagonalization and R is an arbitrary orthogonal matrix.
$b = m_{\tilde{y}}$ .

Refer to Appendix A.1 for the proof. Below, the properties of the optimal solution of the VIB in regression tasks are considered based on Lemma 1. We discuss how well the optimal solution of the VIB optimizes the IB objective, as given by Equation (1). It is important to note that the joint distribution of X and Y considered here is not the true distribution but the empirical distribution of the training data. The IB framework assumes a Markov chain

Y \leftrightarrow X \leftrightarrow Z

, and, according to the data processing inequality, we have

I (Z; Y) \leq I (X; Z)

. This means that when the IB objective maximizes

I (Z; Y)

with the limitation of a given

I (X; Z)

, the feasible region is constrained by

I (Z; Y) \leq I (X; Z)

. Furthermore, the equality holds if and only if

I (X; Z | Y) = 0

. Now, consider Lemma 1. We see that the representation Z in the optimal solution becomes determined solely by Y and is conditionally independent of X given Y, when X and Y lie within the support of the empirical distribution. This implies

I (X; Z | Y) = 0

, and thus

I (Z; Y) = I (X; Z)

is achieved. This indicates that the set of optimal solutions for the VIB in the regression tasks includes, at the very least, the optimal solutions of the IB objective under the empirical distribution. This is particularly interesting when considering that the VIB is designed through Lagrangian relaxation and the variational approximation of the original IB objective.

Furthermore, as we show in Section 4, FVIB-R achieves the condition of Lemma 1 through training, thereby reaching the optimal solutions not only of the VIB objective but also of the IB objective.

4. Methods

4.1. FVIB-R

Based on the above analysis, we design a framework called FVIB-R, which allows the optimization of the VIB objective for all values of

β

in a single training run for regression problems. In this section, we first introduce the model structure and objective function of FVIB-R. Then, we theoretically discuss FVIB-R’s ability to maximize the VIB objective.

For FVIB-R, we train a model

h_{ψ} : X \to R^{d}

with parameters

ψ \in Ψ

by maximizing the following objective function:

J_{F V I B - R} (ψ) : = - \frac{1}{N} \sum_{i = 1}^{N} | | {h_{ψ} (x_{i}) - ({\tilde{y}}_{i} - m_{\tilde{y}}) | |}_{2}^{2}

(13)

Note that, unlike the VIB objective, this objective function does not depend on

β

. Next, using the trained

h_{ψ}

, for any

β > 0

, we set the VIB parameters as follows:

${\tilde{μ}}_{β, ψ} (x) : = \frac{2}{β} ({I + \frac{2}{β} W^{⊤} W)}^{- 1} W^{⊤} h_{ψ} (x)$ ;
${\tilde{Σ}}_{β} (x) : = ({I + \frac{2}{β} W^{⊤} W)}^{- 1}$ ;
${\tilde{W}}_{β} : = P diag ({[\sqrt{max (0, λ_{i} - \frac{β}{2})}]}_{i})$ , where $S_{\tilde{y}} = : P diag ({[λ_{i}]}_{i}) P^{⊤}$ by orthogonal diagonalization;
$\tilde{b} : = m_{\tilde{y}}$ .

As a result, the value of

β

does not influence the training process and can be adjusted during the evaluation phase. This is in contrast to the VIB framework, where

β

affects the training setup and cannot be changed afterward.

In the following, we theoretically argue that, despite the structural advantages mentioned above, this setup and objective function can still learn the VIB objective.

Theorem 1.

Consider the model settings in Section 3.2. If

{lim}_{t \to \infty} J_{F V I B - R} (ψ_{t}) = 0

, then, for

β > 0

, the sequence

{{L_{V I B} ({\tilde{μ}}_{β, ψ_{t}}, {\tilde{Σ}}_{β}, {\tilde{W}}_{β}, \tilde{b}; β)}}_{t \in N}

converges to

{max}_{θ, ϕ} L_{V I B} (θ, ϕ; β)

as

t \to \infty

.

This is derived from the fact that FVIB-R with

J_{F V I B - R} (ψ) = 0

is the optimal solution to the VIB objective, and from the continuity of the VIB objective. Detailed proof is provided in Appendix A.2. This theorem shows that, for any

β

, when

J_{F V I B - R} (ψ)

is sufficiently maximized, the VIB objective approaches its maximum. This characteristic indicates that FVIB-R can learn the VIB for any

β

, despite being a learning setup that is independent of

β

. Note that Theorem 1 is valid when

J_{F V I B - R} (ψ)

converges to zero. However, in cases where gradient descent reaches a local optimum, or when the training is stopped early to prevent overfitting,

J_{F V I B - R} (ψ)

does not strictly converge to zero. Even in such cases, the following properties are beneficial regardless of the value that

J_{F V I B - R} (ψ)

reaches.

Theorem 2.

Consider the model settings in Section 3.2 and assume

{[h_{ψ} (x_{1}) h_{ψ} (x_{2}) \dots h_{ψ} (x_{N})] | ψ \in Ψ} = R^{d \times N} .

(14)

Then, for any

β > 0

,

min_{ψ : J_{F V I B - R} (ψ) = α} L_{V I B} ({\tilde{μ}}_{β, ψ}, {\tilde{Σ}}_{β}, {\tilde{W}}_{β}, \tilde{b}; β)

(15)

is monotonically increasing with respect to α.

Detailed proof can be found in Appendix A.3. The assumption of Equation (14) is justified by the flexibility of neural networks, as shown in the universal approximation theorem [52]. Theorem 2 demonstrates that optimizing

J_{F V I B - R} (ψ)

monotonically increases the worst-case value of the VIB objective for any

β

. In other words, this is a process similar to an increase in a lower bound of the objective function, which is commonly used in practical machine learning applications. It should be noted that this theorem does not strictly guarantee an increase in the VIB objective. However, as seen later in Figure 1 and Figure 2, in the experiments, the VIB objective function does indeed increase. These two theorems together show that the proposed setup and objective function are effective in simultaneously maximizing the VIB objective for all values of

β

.

4.2. Relation to FVIB

Previously, we have proposed the FVIB [32], which enables the learning of the VIB for all values of

β

in a single training run for classification problems. In this study, we address a similar challenge in the context of regression problems. Here, we will summarize the relationship between FVIB-R and FVIB. Regarding the model setup during the analysis, while the feature extractor

p_{θ}

remains the same, the predictor

q_{ϕ}

differs to reflect the problem setting. Specifically, in FVIB-R,

log q_{ϕ}

is expressed as a quadratic function, allowing the expectation in the prediction term of the VIB objective to be computed analytically. On the other hand, in FVIB, the expectation in the prediction term cannot be solved analytically, and, instead, the Taylor approximation of

log q_{ϕ}

is used. As a result, FVIB-R directly analyzes the VIB objective, whereas FVIB analyzes an approximation of the VIB objective. Additionally, FVIB assumes a deterministic scenario, where no data points exist with the same X but different Y values [34]. In contrast, FVIB-R does not require this assumption. This is because, as seen in Equation (A1) in Appendix A.1, any non-deterministic situation in regression problems can be transformed into a deterministic situation without changing the VIB objective by replacing

y_{i}

with

{\tilde{y}}_{i}

. Overall, for regression tasks, it is possible to analyze the VIB with fewer assumptions and without relying on approximations.

5. Experiments

5.1. Experimental Setup

We demonstrate the optimization performance of FVIB-R using two real datasets and further investigate its compression effects with a synthetic dataset. Below, we introduce the datasets and the experimental procedure for each setting. The detailed configurations are presented in Table 1.

5.1.1. Real Dataset

The first dataset is the California housing prices dataset [55], which consists of 20,640 samples. This dataset, based on the 1990 California census, involves a regression problem where the goal is to predict a single variable representing house prices using eight source variables, including the latitude, longitude, and the number of rooms. We use the version of the dataset distributed in the scikit-learn package [56]. The second dataset is the Nutrition5k dataset [57]. Nutrition5k contains around 5000 images of food. The task is to predict five continuous variables that indicate calories, mass, and three major nutrients, using these image data as input. The data acquisition and normalization procedures follow [58]. For Nutrition5k, we use parameters pretrained on ImageNet1k [59] as the parameters’ initial values to compensate for the limited data.

The following is an overview of the experimental procedure. We split the data into training, validation, and test sets in a ratio of

80 %

,

10 %

, and

10 %

, respectively. First, for each model, we conduct a grid search over the candidate learning settings, such as the epochs and learning rates, as shown in Table 1. During this step,

β

is fixed at

0.01

for the California housing prices dataset and

0.05

for the Nutrition5k dataset, and the evaluation metric is the MSE on the validation set. Next, using the best training settings obtained from the grid search, we retrain the models on the training data with

β \in {10^{- 6}, \dots, 10^{- 2}, 0.1, 0.2, \dots, 1.0}

for the California housing prices dataset and

β \in {5 \times 10^{- 6}, \dots, 5 \times 10^{- 2}, 0.5, 1.0, 1.5 \dots, 5.0}

for the Nutrition5k dataset. Finally, we evaluate these trained models on the test data to assess their performance. All experiments are conducted using PyTorch [60]. All models use a predictor composed of an affine transformation on top of the feature extractor described in Table 1.

5.1.2. Synthetic Dataset

As a synthetic dataset, we sample x and y from the data generation process represented below:

x \sim N (0, 1)

(16)

y = sin 5 x + ε

(17)

where

ε \sim N (0, 1)

. For both training and evaluation, 100 samples are used. In experiments that are repeated multiple times, different data samples are used for each trial. To examine the effects of noise, overfitting, and the behavior of compression in response to these factors, a relatively large model is employed for the small dataset. The other settings are shown in Table 1.

5.2. Results and Discussion

First, we verify the ability of FVIB-R to optimize the VIB objective. To do this, we compare the VIB objective values across the training epochs between FVIB-R and VIB. Figure 1 and Figure 2 show the VIB objective values for each training epoch on the California housing prices dataset and the Nutrition5k dataset, respectively. It is important to note that the VIB objective values presented here are computed using the training data. While the VIB is trained separately for each value of

β

, FVIB-R training is conducted only once. In both datasets, the VIB objective values for FVIB-R increase monotonically and converge to values similar or superior to those obtained by directly training the VIB for each

β

. These results align with the theoretical properties of FVIB-R discussed in Section 4.The reason that FVIB-R outperforms the VIB in optimization for

β = 0.5

and

2.5

in Figure 2 can be attributed to the fact that the VIB restricts the covariance matrix

Σ

of Z to be diagonal due to computational efficiency constraints, whereas FVIB-R does not impose such a restriction (see Section 3.2). This makes the model more flexible. Additionally, FVIB-R tends to converge faster compared to direct VIB training. This faster convergence might be attributed to the different handling of the expectation of the prediction term in the VIB objective between the VIB and FVIB-R. In the VIB, the expected value is computed based on the sampled Z, which introduces fluctuations in the gradient direction. In contrast, FVIB-R analytically computes the expected value and directly learns the optimal solution, avoiding the aforementioned fluctuations in the gradient direction. This difference in the training methods could be the reason for the observed faster convergence in FVIB-R.

Next, we examine FVIB-R’s optimization performance from a different perspective. Figure 3, Figure 4, Figure 5 and Figure 6 illustrate the trade-off between the compression and prediction terms of the VIB objective on the California housing prices dataset and the Nutrition5k dataset, respectively. More precisely, prediction is minus the MSE and compression is the sample mean of the KL divergence in Equation (5). Each point in these plots corresponds to a different value of

β

. Both the training and test datasets are shown in these figures. Once again, it is important to note that, while the VIB requires multiple training processes for different values of

β

, FVIB-R is trained only once per dataset. The curve for FVIB-R is shown as a plot due to the evaluation, but the model is able to continuously vary along this curve. In Figure 3 and Figure 4, FVIB-R is compared to the VIB and squared-VIB (sq-VIB) in the California housing prices dataset. The squared versions, proposed by [34], use a squared value for the compression term in the objective function for the purpose of exploring the entire trade-off. In these figures, all methods are evaluated and plotted for

β \in {10^{- 6}, \dots, 10^{- 2}, 0.1, 0.2, \dots, 1.0}

. In Figure 3, we observe that FVIB-R achieves compression–prediction performance on the test data that is comparable to that of the VIB. On the training data, FVIB-R shows better predictions at similar levels of compression compared to the VIB. This difference is likely not due to the framework’s inherent performance, but rather a result of the grid search process selecting models that were either underfitting or overfitting. Moreover, Figure 4 shows that FVIB-R is comparable to sq-VIB both in the training set and in the test set.

Figure 5 and Figure 6 show the optimization performance of FVIB-R compared to the VIB or sq-VIB in the Nutrition5k dataset. In these figures, the VIB and sq-VIB are evaluated for

β \in {5 \times 10^{- 6}, \dots, 5 \times 10^{- 2}, 0.5, 1.0, 1.5 \dots, 5.0}

. In Figure 5, FVIB-R demonstrates performance comparable to the VIB’s in both the training and test data. Figure 6 shows that the highest value of prediction for FVIB-R is similar to that of sq-VIB, while FVIB-R is more predictive than sq-VIB when the representation is compressed to some extent. In conclusion, despite the more efficient training process of FVIB-R, its performance is comparable (or sometimes superior) to that of the traditional VIBs on both the training and test data.

Finally, we present the quantitative evaluation of the performance. Here, we show the MSE as it reflects

I (Z; Y)

, as shown in Equations (3) and (8). In addition to the VIB, we compare FVIB-R with the Nonlinear Information Bottleneck (NIB) [6], as well as the squared-VIB (sq-VIB) and squared-NIB (sq-NIB). Table 2 shows the MSE on the test data and the number of parameters required for each IB method. For each method, the best model is selected based on the MSE in the validation set across

β \in {10^{- 6}, \dots, 10^{- 2}, 0.1, 0.2, \dots, 1.0}

for the California housing prices dataset and

β \in {5 \times 10^{- 6}, \dots, 5 \times 10^{- 2}, 0.5, 1.0, 1.5 \dots, 5.0}

for the Nutrition5k dataset and then evaluated on the test set. In the parameter count notation, the last multiplication indicates the number of models required for the

β

search. When comparing FVIB-R to the VIB, FVIB-R outperforms the VIB on the California housing prices dataset, while the VIB outperforms FVIB-R on the Nutrition5k dataset. Similarly, when compared to other IB methods, FVIB-R outperforms them on one dataset, while the other methods perform better on the other dataset. Overall, despite reducing the number of parameters, FVIB-R demonstrates competitive performance with the other IB methods. This performance highlights that FVIB-R can achieve comparable results with fewer parameters, making it an efficient alternative to traditional IB methods.

With the synthetic data, we investigate the distributional difference between the true

p (y | x)

and its prediction changes for each

β

. For this purpose, the expected KL divergence, represented below, is utilized.

E_{x} [D_{K L} (p (y | x) | | p (\hat{y} | x))]

(18)

where

\hat{y}

is the predicted random variable obtained from

q_{ϕ} (y | z)

. For the estimation, the sample mean of the analytically derived KL divergence is calculated. The results are shown in Figure 7. It is observed that the predicted distribution moves closer to the true distribution with suitable compression.

The distributions of the true

p (y | x)

and the predicted

p (\hat{y} | x)

for each

β

are shown in Figure 8. These results demonstrate that compression leads to a smoother mean line and wider variance. When

β = 1.0

, the predicted distribution is the closest match to the true distribution, which is consistent with the observations in Figure 7. The detailed behavior of compression and its limitations are discussed in the Limitations section.

6. Limitations and Future Work

Lemma 1 highlights the potential shortcomings of the VIB in regression problems, while also showing that FVIB-R inherits these shortcomings. The key points are summarized below. The first limitation lies in the simplicity of the compression procedure. Lemma 1 demonstrates the optimal solution of the VIB for regression problems. In particular,

μ (x)

can be viewed as a combination of

({\tilde{y}}_{i} - m_{\tilde{y}})

and a linear transformation controlled by

β

. Here, the former is deterministically determined by the data, and thus includes noise, while the latter performs compression through its transformation. However, since this is a simple linear transformation, the noise that can be eliminated is limited. FVIB-R, which is designed based on this, shares the same limitation. Future research could explore comparisons with other IB methods in this regard and investigate more flexible compression approaches. Another limitation lies in the certain divergence from the properties of the IB. It is theoretically known that the IB undergoes phase transitions, and, to the best of our knowledge, the phase transitions of the VIB have been experimentally confirmed in classification problems [33]. However, since Lemma 1 represents the solution of the VIB in regression problems, it suggests that phase transitions may not occur in such cases. This is because the VIB (and FVIB-R) aims to increase the IB’s lower bound for the empirical distribution, and there is no guarantee that they will reach the IB optimal solution for the true distribution. Improvements to the VIB and FVIB-R in this regard will be left for future work.

7. Conclusions

In this study, we explore the VIB in the context of regression problems and analytically derive its optimal solution. We anticipate that this analysis will contribute to the theoretical understanding and improvement of both the VIB and the broader IB framework in the future. Using this analysis, we propose a framework that allows the VIB objective to be learned for all values of

β

in a single training run for regression problems. Additionally, this framework explicitly demonstrates the compression method, enabling a more intuitive understanding. Finally, we theoretically justify the optimization performance of this framework and validate its properties through experiments. This method not only offers direct practical applications but also serves as a valuable tool for an understanding of the behavior and impact of IBs in regression problems. We believe that this framework will facilitate further research and practical advancements in the field of IBs.

Author Contributions

S.K. (Sota Kudo): conceptualization, methodology, software, validation, formal analysis, investigation, writing—original draft; N.O.: investigation, resources, supervision; S.K. (Shigehiko Kanaya): investigation, resources, supervision; M.H.: conceptualization, methodology, writing—review and editing, project administration, funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by JSPS KAKENHI, Grant Number 23K11305.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets used in this study are publicly available and described in Section 5.1.1.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Proof of Lemma 1

Proof.

We start with the prediction term of the VIB objective. The quotient set is denoted as

Λ : = {[i] | i \in {1, 2 \dots, N}}

. The step from the second to the third equality uses Equation (8).

\begin{matrix} \frac{1}{N} \sum_{i = 1}^{N} E_{z \sim p_{ϕ} (Z | x_{i})} [log q_{ϕ} (y_{i} | z)] \\ = & \frac{1}{N} \sum_{[i] \in Λ} \sum_{j \in [i]} E_{z \sim p_{ϕ} (Z | x_{i})} [log q_{ϕ} (y_{j} | z)] \\ = & \frac{1}{N} \sum_{[i] \in Λ} E_{z \sim p_{ϕ} (Z | x_{i})} [\sum_{j \in [i]} - | | {v (Z) - y_{j} | |}^{2}] - \frac{d}{2} log π \\ = & \frac{1}{N} \sum_{[i] \in Λ} E_{z \sim p_{ϕ} (Z | x_{i})} [\sum_{j \in [i]} - | | {v (Z) - {\tilde{y}}_{j} | |}^{2}] + \sum_{j \in [i]} ({| | {\tilde{y}}_{j} | |}^{2} - {| | y_{j} | |}^{2}) - \frac{d}{2} log π \\ = & \frac{1}{N} \sum_{i = 1}^{N} E_{z \sim p_{ϕ} (Z | x_{i})} [log q_{ϕ} ({\tilde{y}}_{i} | z)] + {| | {\tilde{y}}_{i} | |}^{2} - {| | y_{i} | |}^{2} \end{matrix}

(A1)

Furthermore,

\begin{matrix} E_{z \sim p_{ϕ} (Z | x_{i})} [log q_{ϕ} ({\tilde{y}}_{i} | z)] \\ = & E_{v \sim N (W μ (x_{i}) + b, W Σ (x_{i}) W^{⊤})} [- {| | v - {\tilde{y}}_{i} | |}^{2}] - \frac{d}{2} log π \\ = & 2 {\tilde{y}}_{i}^{⊤} (W μ (x_{i}) + b) - tr (W Σ (x_{i}) W^{⊤}) - {| | W μ (x_{i}) + b | |}^{2} - {| | {\tilde{y}}_{i} | |}^{2} - \frac{d}{2} log π \end{matrix}

(A2)

The KL divergence in the VIB objective under the model settings in Section 3.2 is represented as

D_{K L} (p_{θ} (Z | x_{i}) | | r (Z)) = \frac{1}{2} (tr (Σ (x_{i})) - log | Σ (x_{i}) | + {| | μ (x_{i}) | |}_{2}^{2} - κ)

(A3)

From the above, we obtain

\begin{matrix} L_{V I B} (θ, ϕ; β) \\ = & \frac{1}{N} \sum_{i = 1}^{N} [- \frac{β}{2} (tr ((I + \frac{2}{β} W^{⊤} W) Σ (x_{i})) + log | Σ {(x_{i})}^{- 1} |) - \frac{β}{2} μ {(x_{i})}^{⊤} (I + \frac{2}{β} W^{⊤} W) μ (x_{i}) \\ + 2 ({{\tilde{y}}_{i} - b)}^{⊤} W μ (x_{i}) - {| | {\tilde{y}}_{i} - b | |}^{2} + {| | {\tilde{y}}_{i} | |}^{2} - {| | y_{i} | |}^{2} - \frac{d}{2} log π + \frac{β}{2} κ] \end{matrix}

(A4)

Here, we use the following proposition.

If matrix B is positive definite, then

B = \underset{D ≻ 0}{argmin} (tr (B D^{- 1}) + log | D |)

(A5)

One can refer to Lemma 2 in [50] for the proof. From

I + \frac{2}{β} W^{⊤} W ≻ 0

, by maximizing Equation (A4) with respect to

Σ (x_{i})

, we obtain

\hat{Σ} (x_{i}) = ({I + \frac{2}{β} W^{⊤} W)}^{- 1}

(A6)

By completing the square with respect to the quadratic form of

μ (x_{i})

, we obtain

μ (x_{i})

that maximizes the VIB objective.

\hat{μ} (x_{i}) = \frac{2}{β} \hat{Σ} (x_{i}) W^{⊤} ({\tilde{y}}_{i} - b)

(A7)

Here, the above is feasible because the same

\tilde{y}

is associated with the same input x (i.e., a deterministic scenario), from the definition of

\tilde{y}

.

From these, we obtain

\begin{matrix} L_{V I B} (\hat{μ}, \hat{Σ}, ϕ; β) \\ = & \frac{1}{N} \sum_{i = 1}^{N} [({{\tilde{y}}_{i} - b)}^{⊤} (\frac{2}{β} W \hat{Σ} (x_{i}) W^{⊤} - I) ({\tilde{y}}_{i} - b) - \frac{β}{2} log | Σ {(x_{i})}^{- 1} | + {| | {\tilde{y}}_{i} | |}^{2} - {| | y_{i} | |}^{2} - \frac{d}{2} log π] \end{matrix}

(A8)

Considering singular value decomposition, we have

W = : U \tilde{D} V^{⊤}

, where

U \in R^{d \times d}, V \in R^{κ \times κ}

are orthogonal matrices. In this case,

\tilde{D}

is represented as

\tilde{D} = [\begin{matrix} δ_{1} & 0 \\ ⋱ & 0 \\ 0 & δ_{d} \end{matrix}] \in R^{d \times κ}

(A9)

We have

\begin{matrix} \hat{Σ} (x_{i}) = ({V (I + \frac{2}{β} {\tilde{D}}^{⊤} \tilde{D}) V^{⊤})}^{- 1} = V \hat{D} V^{⊤} \end{matrix}

(A10)

where

\hat{D} : = diag (\frac{1}{1 + 2 β^{- 1} δ_{1}^{2}}, \dots, \frac{1}{1 + 2 β^{- 1} δ_{d}^{2}}, 1, \dots, 1) \in R^{κ \times κ}

(A11)

Then, we calculate

\frac{2}{β} W \hat{Σ} (x_{i}) W^{⊤} - I

and

log | \hat{Σ} {(x_{i})}^{- 1} |

in Equation (A8) as follows.

\begin{matrix} \frac{2}{β} W \hat{Σ} (x_{i}) W^{⊤} - I & = \frac{2}{β} U \tilde{D} \hat{D} {\tilde{D}}^{⊤} U^{⊤} - I \\ = - U (- \frac{2}{β} \tilde{D} \hat{D} {\tilde{D}}^{⊤} + I) U^{⊤} \\ = - U ({- \frac{2}{β} \tilde{D} {\tilde{D}}^{⊤} + I)}^{- 1} U^{⊤} \\ = - \frac{β}{2} ({W W^{⊤} + \frac{β}{2} I)}^{- 1} \end{matrix}

(A12)

\begin{matrix} log | \hat{Σ} {(x_{i})}^{- 1} | & = log | I + \frac{2}{β} {\tilde{D}}^{⊤} \tilde{D} | \\ = log | I + \frac{2}{β} \tilde{D} {\tilde{D}}^{⊤} | \\ = log | U (I + \frac{2}{β} \tilde{D} {\tilde{D}}^{⊤}) U^{⊤} | \\ = log | W W^{⊤} + \frac{β}{2} I | + d log \frac{2}{β} \end{matrix}

(A13)

Substituting Equations (A12) and (A13) into Equation (A8), we have

L_{V I B} (\hat{μ}, \hat{Σ}, ϕ; β) = - \frac{β}{2} (\frac{1}{N} \sum_{i = 1}^{N} ({{\tilde{y}}_{i} - b)}^{⊤} C^{- 1} ({\tilde{y}}_{i} - b) + log | C |) + a

(A14)

where

C : = W W^{⊤} + \frac{β}{2} I \in R^{d \times d}

(A15)

and a is a constant against the parameters and

κ

. Since

C^{- 1}

is positive definite,

\hat{b} : = \frac{1}{N} \sum_{i = 1}^{N} {\tilde{y}}_{i} = m_{\tilde{y}}

(A16)

and we maximize the objective with respect to

b

:

L_{V I B} (\hat{μ}, \hat{Σ}, \hat{b}, W; β) = - \frac{β}{2} (tr (C^{- 1} S_{\tilde{y}}) + log | C |) + a

(A17)

Finally, we consider the optimal W. While the representation above indicates that this is the same situation as in the maximum likelihood estimate of probabilistic PCA [61], we introduce another method to find it.

We utilize Ruhe’s trace inequality.

Let A and B be

n \times n

positive semidefinite matrices and

λ_{i}

be a function that maps a matrix to its i-th smallest eigenvalue. Then,

tr (A B) \geq \sum_{i = 1}^{n} λ_{i} (A) λ_{n - i + 1} (B)

(A18)

Refer to H.1.h. in [62] for the proof.

Using orthogonal diagonalization, we denote

S_{\tilde{y}} = : P diag (λ_{1}, λ_{2} \dots, λ_{d}) P^{⊤}

, with

0 \leq λ_{1} \leq λ_{2} \dots \leq λ_{d}

. Without loss of generality of W, we assume

| δ_{1} | \leq | δ_{2} | \dots \leq | δ_{d} |

in Equation (A9). We find the optimal U,

δ_{i}

and V to derive the optimal W.

C = U diag ([{\frac{β}{2} + δ_{i}^{2}]}_{i}) U^{⊤}

(A19)

C is independent of V, indicating that V does not affect the objective.

\begin{matrix} log | C | & = log Π_{i = 1}^{d} (\frac{β}{2} + δ_{i}^{2}) \\ = \sum_{i = 1}^{d} log (\frac{β}{2} + δ_{i}^{2}) \end{matrix}

(A20)

Since

log | C |

is independent of U, it is sufficient to consider minimizing

tr (C^{- 1} S_{\tilde{y}})

for the optimal U. Using Ruhe’s trace inequality, below, it is shown that, when

U = P

, the trace is minimized. When

U = P

,

\begin{matrix} tr (C^{- 1} S_{\tilde{y}}) & = tr (U diag ([{({\frac{β}{2} + δ_{i}^{2})}^{- 1}]}_{i}) U^{⊤} P diag ([{λ_{i}]}_{i}) P^{⊤}) \\ = \sum_{i = 1}^{d} \frac{λ_{i}}{\frac{β}{2} + δ_{i}^{2}} \end{matrix}

(A21)

This is the equality of Ruhe’s trace inequality. In this case, the objective is

L_{V I B} (\hat{μ}, \hat{Σ}, \hat{b}, U = P, {[δ_{i}]}_{i}; β) = - \frac{β}{2} \sum_{i = 1}^{d} (\frac{λ_{i}}{\frac{β}{2} + δ_{i}^{2}} + log (\frac{β}{2} + δ_{i}^{2})) + a

(A22)

Here, consider a function

f (w) : = \frac{λ}{w} + log w

for

w > 0

. This function monotonically decreases for

w \leq λ

, achieving the minimum at

w = λ

, and monotonically increases for

w \geq λ

. Given this, we can find the optimal

δ

as follows. If

λ_{i} > \frac{β}{2}

, maximizing the objective with respect to

δ_{i}

requires

\frac{β}{2} + δ_{i}^{2} = λ_{i}

, indicating that

δ_{i} = \pm \sqrt{λ_{i} - \frac{β}{2}}

. If

λ_{i} \leq \frac{β}{2}

,

δ_{i} = 0

maximizes the objective. In summary,

δ_{i} = \pm \sqrt{max (0, λ_{i} - \frac{β}{2})}

(A23)

maximizes the objective, and this meets the assumption of

| δ_{1} | \leq | δ_{2} | \dots \leq | δ_{d} |

. Even if the representation of

S_{\tilde{y}} = : P diag (λ_{1}, λ_{2} \dots, λ_{d}) P^{⊤}

does not satisfy

0 \leq λ_{1} \leq λ_{2} \dots \leq λ_{d}

, setting

U = P

and Equation (A23) result in the same objective value. Additionally, substituting Equation (A23) into Equation (A22) shows that

{max}_{θ, ϕ} L (θ, ϕ; β)

is independent of the value of

κ \geq d

. Thus, we consider the case where

κ = d

. From the above, the optimal W is

W = P diag ([{\pm \sqrt{max (0, λ_{i} - \frac{β}{2})}]}_{i}) R

(A24)

where

S_{\tilde{y}} = : P diag ([{λ_{i}]}_{i}) P^{⊤}

by orthogonal diagonalization and R is an arbitrary orthogonal matrix. □

Appendix A.2. Proof of Theorem 1

Proof.

From Equation (A4), we obtain

\begin{matrix} L_{V I B} ({\tilde{μ}}_{ψ, β}, {\tilde{Σ}}_{β}, {\tilde{W}}_{β}, \tilde{b}; β) \\ = & - \frac{2}{N β} \sum_{i = 1}^{N} [({h_{ψ} (x_{i}) - ({\tilde{y}}_{i} - m_{\tilde{y}}))}^{⊤} {\tilde{W}}_{β} {\tilde{Σ}}_{β} (x_{i}) {\tilde{W}}_{β}^{⊤} (h_{ψ} (x_{i}) - ({\tilde{y}}_{i} - m_{\tilde{y}}))] \\ + g_{β} ({\tilde{Σ}}_{β}, {\tilde{W}}_{β}, \tilde{b}) \end{matrix}

(A25)

where

g_{β}

is a function of

Σ

, W and

b

, meaning that it is independent of

h_{ψ} (x_{i})

. From Lemma 1, substituting

{\tilde{y}}_{i} - m_{\tilde{y}}

into

h_{ψ} (x_{i})

in Equation (A25) results in

{max}_{ϕ, θ} L_{V I B} (ϕ, θ; β)

. Thus,

\begin{matrix} L_{V I B} ({\tilde{μ}}_{β, ψ_{t}}, {\tilde{Σ}}_{β}, {\tilde{W}}_{β}; β) - max_{ϕ, θ} L_{V I B} (ϕ, θ; β) \\ = & - \frac{2}{N β} \sum_{i = 1}^{N} [({h_{ψ_{t}} (x_{i}) - ({\tilde{y}}_{i} - m_{\tilde{y}}))}^{⊤} {\tilde{W}}_{β} {\tilde{Σ}}_{β} (x_{i}) {\tilde{W}}_{β}^{⊤} (h_{ψ_{t}} (x_{i}) - ({\tilde{y}}_{i} - m_{\tilde{y}}))] \end{matrix}

(A26)

If

{lim}_{t \to \infty} J_{F V I B - R} (ψ_{t}) = 0

, then, for all

i = 1, 2 \dots, N

,

h_{ψ_{t}} (x_{i}) \to ({\tilde{y}}_{i} - m_{\tilde{y}})

, indicating that

L_{V I B} ({\tilde{μ}}_{β, ψ_{t}}, {\tilde{Σ}}_{β}, {\tilde{W}}_{β}; β) \to {max}_{ϕ, θ} L_{V I B} (ϕ, θ; β)

as

t \to \infty

. □

Appendix A.3. Proof of Theorem 2

Proof.

Let

h_{ψ, a l l} \in R^{d N}

be the vector concatenating

h_{ψ} (x_{i})

, i.e.,

h_{ψ, a l l} : = [h_{ψ} {(x_{1})}^{⊤} \dots

h_{ψ} {(x_{N})}^{⊤}]^{⊤}

, and let

{\hat{h}}_{a l l}

be the vector formed similarly by concatenating

{\hat{h}}_{i} : = ({\tilde{y}}_{i} - m_{\tilde{y}})

. For

β > 0

,

{\tilde{W}}_{β} {\tilde{Σ}}_{β} (x_{i}) {\tilde{W}}_{β}^{⊤}

is positive semidefinite. Thus, from Equation (A25), for

β > 0

,

L_{V I B} ({\tilde{μ}}_{β, ψ}, {\tilde{Σ}}_{β}, {\tilde{W}}_{β}; β)

is concave as a function of

h_{ψ, a l l}

and its global optimal solution is

{\hat{h}}_{a l l}

. Furthermore,

J_{F V I B - R} (ψ) = - \frac{1}{N} {| | h_{ψ, a l l} - {\hat{h}}_{a l l} | |}_{2}^{2}

, and, given the assumption,

{h_{ψ, a l l} | ψ \in Ψ} = R^{d N}

. From the above, we need to show the following.

Let

f : R^{n} \to R

be convex with

\hat{x} \in R^{n}

as its global optimal solution. If

0 \leq a_{1} < a_{2}

, then

m a x_{x \in R^{n} : | | x - \hat{x} | | = a_{1}} f (x) \leq m a x_{x \in R^{n} : | | x - \hat{x} | | = a_{2}} f (x)

.

For

a_{1} = 0

, this statement holds true immediately. Now, consider the case where

a_{1} > 0

. For any

x_{1}

such that

| | x_{1} - \hat{x} | | = a_{1}

, define

x_{2} = \hat{x} + \frac{a_{2}}{a_{1}} (x_{1} - \hat{x})

(A27)

and

| | x_{2} - \hat{x} | | = a_{2}

. Let

λ : = \frac{a_{1}}{a_{2}}

; then,

0 < λ < 1

and

x_{1} = λ x_{2} + (1 - λ) \hat{x}

(A28)

Since f is convex,

f (x_{1}) \leq m a x {f (x_{2}), f (\hat{x})} = f (x_{2})

(A29)

Hence, the assertion is proven. □

References

Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv 2000, arXiv:physics/0004057. [Google Scholar]
Lemaréchal, C. Lagrangian relaxation. In Computational Combinatorial Optimization: Optimal or Provably Near-Optimal Solutions; Springer: Berlin/Heidelberg, Germany, 2001; pp. 112–156. [Google Scholar]
Gilad-Bachrach, R.; Navot, A.; Tishby, N. An information theoretic tradeoff between complexity and accuracy. In Proceedings of the Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, 24–27 August 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 595–609. [Google Scholar]
Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. Adv. Neural Inf. Process. Syst. 2003, 16, 165–188. [Google Scholar]
Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv 2016, arXiv:1612.00410. [Google Scholar]
Kolchinsky, A.; Tracey, B.D.; Wolpert, D.H. Nonlinear information bottleneck. Entropy 2019, 21, 1181. [Google Scholar] [CrossRef]
Ma, W.D.K.; Lewis, J.; Kleijn, W.B. The HSIC bottleneck: Deep learning without back-propagation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 5085–5092. [Google Scholar]
Yu, X.; Yu, S.; Príncipe, J.C. Deep deterministic information bottleneck with matrix-based entropy functional. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: lPiscataway, NJ, USA, 2021; pp. 3160–3164. [Google Scholar]
Pan, Z.; Niu, L.; Zhang, J.; Zhang, L. Disentangled information bottleneck. In Proceedings of the AAAI Conference on Artificial Intelligence, virtual, 2–9 February 2021; Volume 35, pp. 9285–9293. [Google Scholar]
Yu, S.; Yu, X.; Løkse, S.; Jenssen, R.; Principe, J.C. Cauchy-Schwarz Divergence Information Bottleneck for Regression. arXiv 2024, arXiv:2404.17951. [Google Scholar]
Strouse, D.; Schwab, D.J. The deterministic information bottleneck. Neural Comput. 2017, 29, 1611–1630. [Google Scholar] [CrossRef] [PubMed]
Fischer, I. The conditional entropy bottleneck. Entropy 2020, 22, 999. [Google Scholar] [CrossRef] [PubMed]
Piran, Z.; Shwartz-Ziv, R.; Tishby, N. The dual information bottleneck. arXiv 2020, arXiv:2006.04641. [Google Scholar]
Wang, Z.; Huang, S.L.; Kuruoglu, E.E.; Sun, J.; Chen, X.; Zheng, Y. Pac-bayes information bottleneck. arXiv 2021, arXiv:2109.14509. [Google Scholar]
An, S.; Jammalamadaka, N.; Chong, E. Maximum entropy information bottleneck for uncertainty-aware stochastic embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3809–3818. [Google Scholar]
Kawaguchi, K.; Deng, Z.; Ji, X.; Huang, J. How Does Information Bottleneck Help Deep Learning? Int. Conf. Mach. Learn. 2023, 202, 16049–16096. [Google Scholar]
Shamir, O.; Sabato, S.; Tishby, N. Learning and generalization with the information bottleneck. Theor. Comput. Sci. 2010, 411, 2696–2711. [Google Scholar] [CrossRef]
Achille, A.; Soatto, S. Information dropout: Learning optimal representations through noisy computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2897–2905. [Google Scholar] [CrossRef]
Alemi, A.A.; Fischer, I.; Dillon, J.V. Uncertainty in the variational information bottleneck. arXiv 2018, arXiv:1807.00906. [Google Scholar]
Ahuja, K.; Caballero, E.; Zhang, D.; Gagnon-Audet, J.C.; Bengio, Y.; Mitliagkas, I.; Rish, I. Invariance principle meets information bottleneck for out-of-distribution generalization. Adv. Neural Inf. Process. Syst. 2021, 34, 3438–3450. [Google Scholar]
Li, B.; Shen, Y.; Wang, Y.; Zhu, W.; Li, D.; Keutzer, K.; Zhao, H. Invariant information bottleneck for domain generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 7399–7407. [Google Scholar]
Ngampruetikorn, V.; Schwab, D.J. Information bottleneck theory of high-dimensional regression: Relevancy, efficiency and optimality. Adv. Neural Inf. Process. Syst. 2022, 35, 9784–9796. [Google Scholar]
Guo, L.; Wu, H.; Wang, Y.; Zhou, W.; Zhou, T. IB-UQ: Information bottleneck based uncertainty quantification for neural function regression and neural operator learning. J. Comput. Phys. 2024, 510, 113089. [Google Scholar] [CrossRef]
Yang, H.; Sun, Z.; Xu, H.; Chen, X. Revisiting Counterfactual Regression through the Lens of Gromov-Wasserstein Information Bottleneck. arXiv 2024, arXiv:2405.15505. [Google Scholar]
Yan, X.; Lou, Z.; Hu, S.; Ye, Y. Multi-task information bottleneck co-clustering for unsupervised cross-view human action categorization. ACM Trans. Knowl. Discov. Data (TKDD) 2020, 14, 1–23. [Google Scholar] [CrossRef]
Hu, S.; Wang, R.; Ye, Y. Interactive information bottleneck for high-dimensional co-occurrence data clustering. Appl. Soft Comput. 2021, 111, 107837. [Google Scholar] [CrossRef]
Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop (itw), Jerusalem, Israel, 26 April–1 May 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–5. [Google Scholar]
Shwartz-Ziv, R.; Tishby, N. Opening the black box of deep neural networks via information. arXiv 2017, arXiv:1703.00810. [Google Scholar]
Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the information bottleneck theory of deep learning. J. Stat. Mech. Theory Exp. 2019, 2019, 124020. [Google Scholar] [CrossRef]
Wu, T.; Fischer, I.; Chuang, I.L.; Tegmark, M. Learnability for the information bottleneck. In Proceedings of the Uncertainty in Artificial Intelligence, Virtual, 3–6 August 2020; pp. 1050–1060. [Google Scholar]
Rodríguez Gálvez, B.; Thobaben, R.; Skoglund, M. The convex information bottleneck lagrangian. Entropy 2020, 22, 98. [Google Scholar] [CrossRef]
Kudo, S.; Ono, N.; Kanaya, S.; Huang, M. Flexible Variational Information Bottleneck: Achieving Diverse Compression with a Single Training. arXiv 2024, arXiv:2402.01238. [Google Scholar]
Wu, T.; Fischer, I. Phase transitions for the information bottleneck in representation learning. arXiv 2020, arXiv:2001.01878. [Google Scholar]
Kolchinsky, A.; Tracey, B.D.; Van Kuyk, S. Caveats for information bottleneck in deterministic scenarios. arXiv 2018, arXiv:1808.07593. [Google Scholar]
Amjad, R.A.; Geiger, B.C. Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2225–2239. [Google Scholar] [CrossRef] [PubMed]
Alabdulmohsin, I. An information-theoretic route from generalization in expectation to generalization in probability. In Proceedings of the ARTIFICIAL Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 92–100. [Google Scholar]
Nachum, I.; Yehudayoff, A. Average-case information complexity of learning. In Proceedings of the Algorithmic Learning Theory, Chicago, IL, USA, 22–24 March 2019; pp. 633–646. [Google Scholar]
Negrea, J.; Haghifam, M.; Dziugaite, G.K.; Khisti, A.; Roy, D.M. Information-theoretic generalization bounds for SGLD via data-dependent estimates. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Bu, Y.; Zou, S.; Veeravalli, V.V. Tightening mutual information-based bounds on generalization error. IEEE J. Sel. Areas Inf. Theory 2020, 1, 121–130. [Google Scholar] [CrossRef]
Steinke, T.; Zakynthinou, L. Reasoning about generalization via conditional mutual information. In Proceedings of the Conference on Learning Theory, Graz, Austria, 9–12 July 2020; pp. 3437–3452. [Google Scholar]
Haghifam, M.; Negrea, J.; Khisti, A.; Roy, D.M.; Dziugaite, G.K. Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. Adv. Neural Inf. Process. Syst. 2020, 33, 9925–9935. [Google Scholar]
Neu, G.; Dziugaite, G.K.; Haghifam, M.; Roy, D.M. Information-theoretic generalization bounds for stochastic gradient descent. In Proceedings of the Conference on Learning Theory, Boulder, CO, USA, 15–19 August 2021; pp. 3526–3545. [Google Scholar]
Aminian, G.; Bu, Y.; Toni, L.; Rodrigues, M.; Wornell, G. An exact characterization of the generalization error for the Gibbs algorithm. Adv. Neural Inf. Process. Syst. 2021, 34, 8106–8118. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. beta-vae: Learning basic visual concepts with a constrained variational framework. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Alemi, A.; Poole, B.; Fischer, I.; Dillon, J.; Saurous, R.A.; Murphy, K. Fixing a broken ELBO. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 159–168. [Google Scholar]
Tschannen, M.; Bachem, O.; Lucic, M. Recent advances in autoencoder-based representation learning. arXiv 2018, arXiv:1812.05069. [Google Scholar]
Lucas, J.; Tucker, G.; Grosse, R.B.; Norouzi, M. Don’t blame the elbo! A linear vae perspective on posterior collapse. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Kumar, A.; Poole, B. On Implicit Regularization in β-VAEs. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 5480–5490. [Google Scholar]
Sicks, R.; Korn, R.; Schwaar, S. A Generalised Linear Model Framework for β-Variational Autoencoders based on Exponential Dispersion Families. J. Mach. Learn. Res. 2021, 22, 1–41. [Google Scholar]
Poole, B.; Ozair, S.; Van Den Oord, A.; Alemi, A.; Tucker, G. On variational bounds of mutual information. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 5171–5180. [Google Scholar]
Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Pace, R.K.; Barry, R. Sparse spatial autoregressions. Stat. Probab. Lett. 1997, 33, 291–297. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Thames, Q.; Karpur, A.; Norris, W.; Xia, F.; Panait, L.; Weyand, T.; Sim, J. Nutrition5k: Towards automatic nutritional understanding of generic food. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8903–8911. [Google Scholar]
Parinayok, S.; Yamakata, Y.; Aizawa, K. Open-vocabulary segmentation approach for transformer-based food nutrient estimation. In Proceedings of the ACM Multimedia Asia 2023, New York, NY, USA, 6–8 December 2023. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Tipping, M.E.; Bishop, C.M. Probabilistic principal component analysis. J. R. Stat. Soc. Series B Stat. Methodol. 1999, 61, 611–622. [Google Scholar] [CrossRef]
Marshall, A.W.; Olkin, I.; Arnold, B.C. Inequalities: Theory of Majorization and Its Applications; Academic Press: Cambridge, MA, USA, 1979. [Google Scholar]

Figure 1. VIB objective values of VIB and FVIB-R during training on California housing prices dataset.

Figure 2. VIB objective values of VIB and FVIB-R during training on Nutrition5k dataset.

Figure 3. Plots of compression versus prediction obtained by FVIB-R and VIB in California housing prices dataset. Units are nat.

Figure 4. Plots of compression versus prediction obtained by FVIB-R and sq-VIB in California housing prices dataset. Units are nat.

Figure 5. Plots of compression versus prediction obtained by FVIB-R and VIB in Nutrition5k dataset. Units are nat.

Figure 6. Plots of compression versus prediction obtained by FVIB-R and sq-VIB in Nutrition5k dataset. Units are nat.

Figure 7. The expected KL divergence between the true y and its prediction is computed for each

β

. A

95 %

confidence interval, obtained from 30 trials, is also shown. Units are nat.

Figure 7. The expected KL divergence between the true y and its prediction is computed for each

β

. A

95 %

confidence interval, obtained from 30 trials, is also shown. Units are nat.

Figure 8. The true distribution of y and its prediction given each x. The standard deviation is also shown for both distributions. Units are nat.

Table 1. Experimental setup. In the description of the feature extractor, the subscripts of each module indicate the output dimension.

	California Housing Prices	Nutrition5k	Synthetic Dataset
Feature extractor	$x \in R^{8}$	$x \in R^{3 \times 480 \times 640}$	$x \in R$
	$\to L i n e a r_{128}$	$\to R e s N e t 18_{κ o r 2 κ}$ [53]	$\to L i n e a r_{128}$
	$\to R e L U \to L i n e a r_{128}$		$\to R e L U \to L i n e a r_{128}$
	$\to R e L U \to L i n e a r_{κ o r 2 κ}$		$\to R e L U \to L i n e a r_{128}$
			$\to R e L U \to L i n e a r_{128}$
			$\to R e L U \to L i n e a r_{1}$
$κ$ in non-FVIB-R models	32	256	−
# Epochs	{50, 100, 150, 200}	{50, 100, 150, 200}	200
Learning rate	{ $1.0 \times 10^{- 4}$ , $1.0 \times 10^{- 3}$ }	{ $1.0 \times 10^{- 4}$ , $1.0 \times 10^{- 3}$ }	$1.0 \times 10^{- 3}$
Optimizer	Adam [54]	Adam	Adam

Table 2. Mean squared error and the number of parameters of FVIB-R compared to other IB methods.

	California Housing Prices		Nutrition5k
Method	MSE	# Parameters	MSE	# Parameters
FVIB-R	0.1910	$1.8 \times 10^{4} \times 1$	0.0881	$1.1 \times 10^{7} \times 1$
VIB	0.1979	$2.6 \times 10^{4} \times 15$	0.0805	$1.1 \times 10^{7} \times 15$
sq-VIB	0.1786	$2.6 \times 10^{4} \times 15$	0.0893	$1.1 \times 10^{7} \times 15$
NIB	0.1816	$2.2 \times 10^{4} \times 15$	0.1549	$1.1 \times 10^{7} \times 15$
sq-NIB	0.1790	$2.2 \times 10^{4} \times 15$	0.1431	$1.1 \times 10^{7} \times 15$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kudo, S.; Ono, N.; Kanaya, S.; Huang, M. Exploring the Trade-Off in the Variational Information Bottleneck for Regression with a Single Training Run. Entropy 2024, 26, 1043. https://doi.org/10.3390/e26121043

AMA Style

Kudo S, Ono N, Kanaya S, Huang M. Exploring the Trade-Off in the Variational Information Bottleneck for Regression with a Single Training Run. Entropy. 2024; 26(12):1043. https://doi.org/10.3390/e26121043

Chicago/Turabian Style

Kudo, Sota, Naoaki Ono, Shigehiko Kanaya, and Ming Huang. 2024. "Exploring the Trade-Off in the Variational Information Bottleneck for Regression with a Single Training Run" Entropy 26, no. 12: 1043. https://doi.org/10.3390/e26121043

APA Style

Kudo, S., Ono, N., Kanaya, S., & Huang, M. (2024). Exploring the Trade-Off in the Variational Information Bottleneck for Regression with a Single Training Run. Entropy, 26(12), 1043. https://doi.org/10.3390/e26121043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring the Trade-Off in the Variational Information Bottleneck for Regression with a Single Training Run

Abstract

1. Introduction

1.1. Information Bottleneck

1.2. Methods of IB

1.3. Effects and Applications of IB in DNNs

1.4. Efficient β Exploration

1.5. Contributions and the Structure of This Paper

2. Related Work

2.1. Theory of IB

2.2. Variational Autoencoders

3. Analysis of VIB in Regression Tasks

3.1. Variational Information Bottleneck

3.2. Model Settings for Analysis

3.3. The Optimal Solution

4. Methods

4.1. FVIB-R

4.2. Relation to FVIB

5. Experiments

5.1. Experimental Setup

5.1.1. Real Dataset

5.1.2. Synthetic Dataset

5.2. Results and Discussion

6. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Proof of Lemma 1

Appendix A.2. Proof of Theorem 1

Appendix A.3. Proof of Theorem 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

1.4. Efficient $β$ Exploration