# Nonlinear Information Bottleneck

## Abstract

## 1. Introduction

- We represent the distribution over X and Y using a finite number of data samples.
- We represent the encoding map $p\left(m\right|x)$ and the decoding map $p\left(y\right|m)$ as parameterized conditional distributions.
- We use a variational lower bound for the prediction term $I(Y;M)$, and non-parametric upper bound for the compression term $I(X;M)$, which we developed in earlier work [35].

## 2. Proposed Approach

## 3. Relation to Prior Work

#### 3.1. Variational IB

#### 3.2. Neural Networks and Kernel Density Entropy Estimates

#### 3.3. Auto-Encoders

## 4. Experiments

#### 4.1. Implementation

https://github.com/artemyk/nonlinearIB

https://github.com/burklight/nonlinear-IB-PyTorch

#### 4.2. Results

`scikit-learn`package [58]). It consists of $N=20,640$ total samples, with one dependent variable (the house price) and 8 independent variables (such as “longitude”, “latitude”, and “number of rooms”). We used the log-transformed house price as the dependent variable Y (this made the distribution of Y closer to a Gaussian). To prepare the training and testing data, we first dropped 992 samples in which the house price was equal to or greater than $500,000 (prices were clipped at this upper value in the dataset, which distorted the distribution of the dependent variable). We then randomly split the remaining samples into an 80% training and 20% testing dataset (the training dataset was then further split into the actual training dataset and a validation dataset, see above).

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

**Figure 1.**

**Top row**: Info-plane diagrams for nonlinear IB and variational IB (VIB) on the MNIST training (

**left**) and testing (

**right**) data. The solid lines indicate means across five runs, shaded region indicates the standard error of the mean. The black dashed line is the data-processing inequality bound $I(Y;M)\le I(X;M)$, the black dotted line indicates the value of $I(Y;M)$ achieved by a baseline model trained only to optimize cross-entropy.

**Bottom row**: Principal component analysis (PCA) projection of bottleneck layer activity (on testing data, no noise) for models trained with regular cross-entropy loss (

**left**), VIB (

**middle**), and nonlinear IB (

**right**) objectives. The location of the nonlinear IB and VIB models shown in the bottom row is indicated with the green vertical line in the top right panel.

**Figure 2.**

**Top row**: Info-plane diagrams for nonlinear IB and VIB on the FashionMNIST dataset.

**Bottom row**: PCA projection of bottleneck layer activations for models trained only to optimize cross-entropy (

**left**), VIB (

**middle**), and nonlinear IB (

**right**) objectives. See caption of Figure 1 for details.

**Figure 3.**

**Top row**: Information plane diagrams for nonlinear IB and VIB on the California housing prices dataset.

**Bottom row**: PCA projection of bottleneck layer activations for models trained only to optimize mean squared error (MSE) (

**left**), VIB (

**middle**), and nonlinear IB (

**right**) objectives. See caption of Figure 1 for details.

**Table 1.**Amount of prediction $I(Y;M)$ achieved at compression level $I(X;M)=log10$ for both nonlinear IB and VIB.

Dataset | Nonlinear IB | VIB | |
---|---|---|---|

MNIST | Training | 3.22 | 3.09 |

Testing | 2.99 | 2.88 | |

FashionMNIST | Training | 2.85 | 2.67 |

Testing | 2.58 | 2.46 | |

California housing | Training | 1.37 | 1.26 |

Testing | 1.13 | 1.07 |

