# Exact and Soft Successive Refinement of the Information Bottleneck

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. Conceptualisation and Organisation Outline

#### 1.2. Related Work

#### 1.3. Technical Preliminaries

#### 1.3.1. Notations and Conventions

#### 1.3.2. General Facts and Notions

- A bottleneck must saturate the information constraint, i.e., solutions T to (1) must satisfy ${I}_{\lambda}(X;T)=\lambda $. In other words, the primal trade-off parameter is the complexity cost of the corresponding bottleneck.
- The function ${I}_{Y}:\lambda \phantom{\rule{0.166667em}{0ex}}\mapsto \phantom{\rule{0.166667em}{0ex}}{I}_{\lambda}(T;Y)$ is constant for $\lambda \ge H\left(X\right)$. We will thus always assume, without loss of generality, that $\lambda \in [0,H(X\left)\right]$.
- In the discrete case, choosing a bottleneck cardinality $\left|\mathcal{T}\right|=\left|\mathcal{X}\right|+1$ is enough to obtain optimal solutions. Thus, we always assume, without loss of generality, that $\left|\mathcal{T}\right|\le \left|\mathcal{X}\right|+1$, where $\left|\mathcal{T}\right|<\left|\mathcal{X}\right|+1$ might occur if needed to make T full support.

**Definition**

**1.**

**Definition**

**2.**

**Definition**

**3.**

**Definition**

**4.**

**Notation**

**1.**

## 2. Exact Successive Refinement of the IB

#### 2.1. Formal Framework and First Results

**Definition**

**5.**

- ${T}_{1}$ is a bottleneck with parameter ${\lambda}_{1}$;
- For every $2\le i\le n$, the variable ${T}_{i}:=({T}_{i-1},{S}_{i})$ is a bottleneck with parameter ${\lambda}_{i}$.

**Proposition**

**1.**

- (i)
- There is successive refinement for parameters $({\lambda}_{1},\cdots ,{\lambda}_{n})$;
- (ii)
- There exist bottlenecks ${T}_{1},\cdots ,{T}_{n}$, of common source X and relevancy Y, with respective parameters ${\lambda}_{1},\cdots ,{\lambda}_{n}$, and an extension $q(X,{T}_{1},\cdots ,{T}_{n})$ of the ${q}_{i}:={q}_{i}(X,{T}_{i})$, such that, under q, we have the Markov chain$$\begin{array}{c}\hfill X-{T}_{n}-\cdots -{T}_{1}.\end{array}$$
- (iii)
- There exist bottlenecks ${T}_{1},\cdots ,{T}_{n}$, of common source X and relevancy Y, with respective parameters ${\lambda}_{1},\cdots ,{\lambda}_{n}$, and an extension $q(Y,X,{T}_{1},\cdots ,{T}_{n})$ of the ${q}_{i}:={q}_{i}(Y,X,{T}_{i})$, such that, under q, we have the Markov chain$$\begin{array}{c}\hfill Y-X-{T}_{n}-\cdots -{T}_{1}.\end{array}$$

**Proof.**

**Remark**

**1.**

**Proposition**

**2.**

**Proposition**

**3.**

**Proof.**

#### 2.2. The Convex Hull Characterisation and the Case $\left|\mathcal{X}\right|=\left|\mathcal{Y}\right|=2$

**Proposition**

**4.**

**Proof.**

**Remark**

**2.**

**Proposition**

**5.**

**Proof.**

#### 2.3. Numerical Results on Minimal Examples

**Definition**

**6.**

**Notation**

**2.**

## 3. Soft Successive Refinement of the IB

#### 3.1. Formalism

**Definition**

**7.**

**Definition**

**8.**

**Proposition**

**6.**

**Proof.**

**Proposition**

**7**

**Remark**

**3.**

**Definition**

**9.**

#### 3.2. Numerical Results on Minimal Examples

## 4. Alternative Interpretations: Decision Problems and Deep Learning

#### 4.1. Successive Refinement, Decision Problems, and Orders on Encoder Channels

**Definition**

**10.**

**Definition**

**11.**

**Proposition**

**8.**

- (i)
- There is successive refinement for parameters $({\lambda}_{1},{\lambda}_{2})$.
- (ii)
- There are bottlenecks ${T}_{1},{T}_{2}$ of respective parameters ${\lambda}_{1},{\lambda}_{2}$ such that$$\begin{array}{c}\hfill q\left({T}_{2}\right|X){\u2292}_{\mathcal{X}}q\left({T}_{1}\right|X).\end{array}$$
- (iii)
- There are bottlenecks ${T}_{1},{T}_{2}$ of respective parameters ${\lambda}_{1},{\lambda}_{2}$ such that$$\begin{array}{c}\hfill q\left({T}_{2}\right|X){\u2292}_{\mathcal{X}}^{\prime}q\left({T}_{1}\right|X).\end{array}$$

**Proof.**

#### 4.2. Successive Refinement and Deep Learning

- (i)
- Satisfy the Markov chain (10); and
- (ii)
- Are each bottlenecks with source X and relevancy Y, for respective trade-off parameters ${\lambda}_{1}>\cdots >{\lambda}_{n}$.

**Remark**

**4.**

- We know that the variables ${L}_{1},\cdots ,{L}_{n}$ must satisfy $X-{L}_{1}-\cdots -{L}_{n}$, i.e., we know that the joint distribution $q:=q(X,{L}_{n},\cdots ,{L}_{1})$ must be in ${\Delta}_{SR}$;
- And we want to know “how close” we can choose this joint distribution q to one whose marginals $q(X,{L}_{1}),\cdots ,q(X,{L}_{n})$ coincide with bottleneck distributions ${q}_{1}:={q}_{1}(X,{T}_{1}),\phantom{\rule{0.166667em}{0ex}}\cdots ,{q}_{n}:={q}_{n}(X,{T}_{n})$, respectively, of parameters ${\lambda}_{1}>\cdots >{\lambda}_{n}$, respectively, i.e., we want to know how close we can choose q to the set ${\Delta}_{{q}_{1},\cdots ,{q}_{n}}$.

## 5. Limitations and Future Work

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

IB | Information Bottleneck |

SR | Successive Refinement |

UI | Unique Information |

DNN | Deep Neural Network |

## Appendix A. Section 1 Details

#### Appendix A.1. Effective Cardinality

**Proposition**

**A1.**

**Proof.**

**Proposition**

**A2.**

**Proof.**

## Appendix B. Section 2 Details

#### Appendix B.1. Proof of Proposition 1

#### Appendix B.2. Operational Interpretation of Successive Refinement

**Definition**

**A1.**

**Definition**

**A2.**

**Definition**

**A3.**

**Proposition**

**A3.**

- $\left(i\right)$
- We have the Markov chain $Y-X-{T}_{n}-\cdots -{T}_{1}$;
- $\left(ii\right)$
- The variables ${T}_{1},\cdots ,{T}_{n}$ are each bottlenecks with respective parameters ${\lambda}_{1},\cdots ,{\lambda}_{n}$.

**Proof.**

- We have the Markov chain $Y-X-{T}_{n}-\cdots -{T}_{1}$; and
- We have, for all $i=1,\cdots ,n$,$$\begin{array}{cc}\hfill I(X;{T}_{i})& \le {\lambda}_{i},\hfill \end{array}$$$$\begin{array}{cc}\hfill I(Y;{T}_{i})& \ge {I}_{Y}\left({\lambda}_{i}\right).\hfill \end{array}$$

#### Appendix B.3. Proof of Proposition 2

**Proposition**

**A4.**

- (i)
- There is successive refinement for Lagrangian parameters $({\beta}_{1},\cdots ,{\beta}_{n})$.
- (ii)
- There exist Lagrangian bottlenecks ${T}_{1},\cdots ,{T}_{n}$, of common source X and relevancy Y, with respective parameters ${\beta}_{1},\cdots ,{\beta}_{n}$, and an extension $q(Y,X,{T}_{1},\cdots ,{T}_{n})$ of the ${q}_{i}:={q}_{i}(Y,X,{T}_{i})$, such that, under q, we have the Markov chain$$\begin{array}{c}\hfill Y-X-{T}_{n}-\cdots -{T}_{1}.\end{array}$$

**Proof.**

**(Proof of Proposition**

**2).**

**Proposition**

**A5.**

**Proposition**

**A6.**

**Proof.**

#### Appendix B.4. Proof of Proposition 3

#### Appendix B.5. Proof of Proposition 4

**Proposition**

**A7.**

- (i)
- There exists an extension $\tilde{q}(U,V,W)$ of $q(U,W)$ and $q(V,W)$ under which the Markov chain $U-V-W$ holds.
- (ii)
- For each $u\in \mathcal{U}$, there exists a family of convex combination coefficients $\{{\alpha}_{v,u}\phantom{\rule{0.277778em}{0ex}},\phantom{\rule{4pt}{0ex}}v\in \mathcal{V}\}$ such that$$\begin{array}{c}\hfill q\left(W\right|u)=\sum _{v}{\alpha}_{v,u}\phantom{\rule{0.166667em}{0ex}}q\left(W\right|v).\end{array}$$

**Proof.**

**Lemma**

**A1.**

**Proof.**

#### Appendix B.6. Linear Program Used to Compute the Convex Hull Condition (7)

#### Appendix B.7. Proof of Proposition 5

**Figure A1.**The function ${F}_{\beta}$ for example values of $\beta $ and $p(X,Y)$, where the source and relevancy are binary. Here, on the x-axis, p parameterises the binary distribution $[p,1-p]$.

**Proposition**

**A8.**

**Lemma**

**A2.**

**Proof.**

**Lemma**

**A3.**

**Lemma**

**A4.**

**Proof.**

#### Appendix B.8. Computation of Bifurcations Values

## Appendix C. Section 3 Details

#### Appendix C.1. Proof of Proposition 6

## Appendix D. The Unicity and Injectivity Conjecture, and Technical Subtleties It Would Solve

**Conjecture**

**1.**

- (i)
- The pair $\left(q\right(T),q(X\left|T\right))$ is, up to permuting bottleneck symbols, uniquely determined.
- (ii)
- The channel $q\left(X\right|T)$, seen as a linear operator on probability distributions, is injective.

**Conjecture**

**2.**

## Appendix E. Sample $\mathbf{p}\left(\mathbf{Y}\right|\mathbf{X})$ Used in Section 2.3 and Section 3.2

**Figure A2.**Plot of the sample distributions $p\left(Y\right|X)$ used in, respectively, from top to bottom: $\left(i\right)$ Figure 4 and Figure 7; $\left(ii\right)$ Figure 5 and Figure 8; $\left(iii\right)$ Figure 6 and Figure 9. The simplex depicted here is ${\Delta}_{\mathcal{Y}}$, where $\left|\mathcal{Y}\right|=3$, and each black square corresponds to a symbol-wise conditional probability $p\left(Y\right|x)\in {\Delta}_{\mathcal{Y}}$. Note that the corresponding $p\left(X\right)\in {\Delta}_{\mathcal{X}}$ is shown in the left parts of Figure 4, Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9, which depict the simplex ${\Delta}_{\mathcal{X}}$, where, here, we also have $\left|\mathcal{X}\right|=3$. The explicit values of the corresponding $p(X,Y)$ can be found at: https://gitlab.com/uh-adapsys/successive-refinement-ib/.

## Appendix F. Additional Plots for Exact and Soft Successive Refinement

**Figure A3.**Additional examples for $\left|\mathcal{X}\right|=\left|\mathcal{Y}\right|=3$: comparison of bottleneck trajectories (left) with exact SR patterns (center) and unique information landscapes (right). See Figure 4 and Figure 7 for more details on the legends. The conditional distributions $p\left(Y\right|X)$ corresponding to each row in this figure are plotted in Figure A4. The explicit values of the corresponding $p(X,Y)$ can be found at: https://gitlab.com/uh-adapsys/successive-refinement-ib/.

**Figure A4.**Sample distributions $p\left(Y\right|X)$ used in Figure A3, where the vertical order here corresponds to that of Figure A3. The simplex depicted here is ${\Delta}_{\mathcal{Y}}$, where $\left|\mathcal{Y}\right|=3$, and each black square corresponds to a symbol-wise conditional probability $p\left(Y\right|x)\in {\Delta}_{\mathcal{Y}}$. Note that the corresponding $p\left(X\right)\in {\Delta}_{\mathcal{X}}$ is shown in the left parts of each row in Figure A3, which depict the simplex ${\Delta}_{\mathcal{X}}$, where, here, we also have $\left|\mathcal{X}\right|=3$. The explicit values of the corresponding $p(X,Y)$ can be found at: https://gitlab.com/uh-adapsys/successive-refinement-ib/.

## References

- Tishby, N.; Pereira, F.; Bialek, W. The Information Bottleneck Method. In Proceedings of the 37th Allerton Conference on Communication, Control and Computation, Monticello, IL, USA, 22–24 September 1999; Volume 49. [Google Scholar]
- Gilad-Bachrach, R.; Navot, A.; Tishby, N. An Information Theoretic Tradeoff between Complexity and Accuracy. In Learning Theory and Kernel Machines; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar] [CrossRef]
- Bialek, W.; De Ruyter Van Steveninck, R.R.; Tishby, N. Efficient representation as a design principle for neural coding and computation. In Proceedings of the 2006 IEEE International Symposium on Information Theory, Seattle, WA, USA, 9–14 July 2006; pp. 659–663. [Google Scholar] [CrossRef]
- Creutzig, F.; Globerson, A.; Tishby, N. Past-future information bottleneck in dynamical systems. Phys. Rev. E
**2009**, 79, 041925. [Google Scholar] [CrossRef] - Amir, N.; Tiomkin, S.; Tishby, N. Past-future Information Bottleneck for linear feedback systems. In Proceedings of the 2015 54th IEEE Conference on Decision and Control (CDC), Osaka, Japan, 15–18 December 2015; pp. 5737–5742. [Google Scholar] [CrossRef]
- Sachdeva, V.; Mora, T.; Walczak, A.M.; Palmer, S.E. Optimal prediction with resource constraints using the information bottleneck. PLoS Comput. Biol.
**2021**, 17, e1008743. [Google Scholar] [CrossRef] [PubMed] - Klampfl, S.; Legenstein, R.; Maass, W. Spiking Neurons Can Learn to Solve Information Bottleneck Problems and Extract Independent Components. Neural Comput.
**2009**, 21, 911–959. [Google Scholar] [CrossRef] [PubMed] - Buesing, L.; Maass, W. A Spiking Neuron as Information Bottleneck. Neural Comput.
**2010**, 22, 1961–1992. [Google Scholar] [CrossRef] - Chalk, M.; Marre, O.; Tkačik, G. Toward a unified theory of efficient, predictive, and sparse coding. Proc. Natl. Acad. Sci. USA
**2018**, 115, 186–191. [Google Scholar] [CrossRef] - Palmer, S.E.; Marre, O.; Berry, M.J.; Bialek, W. Predictive information in a sensory population. Proc. Natl. Acad. Sci. USA
**2015**, 112, 6908–6913. [Google Scholar] [CrossRef] - Wang, S.; Segev, I.; Borst, A.; Palmer, S. Maximally efficient prediction in the early fly visual system may support evasive flight maneuvers. PLoS Comput. Biol.
**2021**, 17, e1008965. [Google Scholar] [CrossRef] [PubMed] - Buddha, S.K.; So, K.; Carmena, J.M.; Gastpar, M.C. Function Identification in Neuron Populations via Information Bottleneck. Entropy
**2013**, 15, 1587–1608. [Google Scholar] [CrossRef] - Kleinman, M.; Wang, T.; Xiao, D.; Feghhi, E.; Lee, K.; Carr, N.; Li, Y.; Hadidi, N.; Chandrasekaran, C.; Kao, J.C. A cortical information bottleneck during decision-making. bioRxiv
**2023**. [Google Scholar] [CrossRef] - Nehaniv, C.L.; Polani, D.; Dautenhahn, K.; te Beokhorst, R.; Cañamero, L. Meaningful Information, Sensor Evolution, and the Temporal Horizon of Embodied Organisms. In Artificial life VIII; ICAL 2003; MIT Press: Cambridge, MA, USA, 2002; pp. 345–349. [Google Scholar]
- Klyubin, A.; Polani, D.; Nehaniv, C. Organization of the information flow in the perception-action loop of evolved agents. In Proceedings of the 2004 NASA/DoD Conference on Evolvable Hardware, Seattle, WA, USA, 24–26 June 2004; pp. 177–180. [Google Scholar] [CrossRef]
- van Dijk, S.G.; Polani, D.; Informational Drives for Sensor Evolution. Vol. ALIFE 2012: The Thirteenth International Conference on the Synthesis and Simulation of Living Systems, ALIFE 2022: The 2022 Conference on Artificial Life. 2012. Available online: https://direct.mit.edu/isal/proceedings-pdf/alife2012/24/333/1901044/978-0-262-31050-5-ch044.pdf (accessed on 12 September 2023). [CrossRef]
- Möller, M.; Polani, D. Emergence of common concepts, symmetries and conformity in agent groups—An information-theoretic model. Interface Focus
**2023**, 13, 20230006. [Google Scholar] [CrossRef] - Catenacci Volpi, N.; Polani, D. Space Emerges from What We Know-Spatial Categorisations Induced by Information Constraints. Entropy
**2020**, 20, 1179. [Google Scholar] [CrossRef] - Zaslavsky, N.; Kemp, C.; Regier, T.; Tishby, N. Efficient compression in color naming and its evolution. Proc. Natl. Acad. Sci. USA
**2018**, 115, 201800521. [Google Scholar] [CrossRef] - Zaslavsky, N.; Garvin, K.; Kemp, C.; Tishby, N.; Regier, T. The evolution of color naming reflects pressure for efficiency: Evidence from the recent past. bioRxiv
**2022**. [Google Scholar] [CrossRef] - Tucker, M.; Levy, R.P.; Shah, J.; Zaslavsky, N. Trading off Utility, Informativeness, and Complexity in Emergent Communication. Adv. Neural Inf. Process. Syst.
**2022**, 35, 22214–22228. [Google Scholar] - Pacelli, V.; Majumdar, A. Task-Driven Estimation and Control via Information Bottlenecks. arXiv
**2018**, arXiv:1809.07874. [Google Scholar] [CrossRef] - Lamb, A.; Islam, R.; Efroni, Y.; Didolkar, A.; Misra, D.; Foster, D.; Molu, L.; Chari, R.; Krishnamurthy, A.; Langford, J. Guaranteed Discovery of Control-Endogenous Latent States with Multi-Step Inverse Models. arXiv
**2022**, arXiv:2207.08229. [Google Scholar] [CrossRef] - Goyal, A.; Islam, R.; Strouse, D.; Ahmed, Z.; Larochelle, H.; Botvinick, M.; Levine, S.; Bengio, Y. Transfer and Exploration via the Information Bottleneck. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Koshelev, V. Hierarchical Coding of Discrete Sources. Probl. Peredachi Inf.
**1980**, 16, 31–49. [Google Scholar] - Equitz, W.; Cover, T. Successive refinement of information. IEEE Trans. Inf. Theory
**1991**, 37, 269–275. [Google Scholar] [CrossRef] - Rimoldi, B. Successive refinement of information: Characterization of the achievable rates. IEEE Trans. Inf. Theory
**1994**, 40, 253–259. [Google Scholar] [CrossRef] - Tuncel, E.; Rose, K. Computation and analysis of the N-Layer scalable rate-distortion function. IEEE Trans. Inf. Theory
**2003**, 49, 1218–1230. [Google Scholar] [CrossRef] - Kostina, V.; Tuncel, E. Successive Refinement of Abstract Sources. IEEE Trans. Inf. Theory
**2019**, 65, 6385–6398. [Google Scholar] [CrossRef] - Tian, C.; Chen, J. Successive Refinement for Hypothesis Testing and Lossless One-Helper Problem. IEEE Trans. Inf. Theory
**2008**, 54, 4666–4681. [Google Scholar] [CrossRef] - Tuncel, E. Capacity/Storage Tradeoff in High-Dimensional Identification Systems. In Proceedings of the 2006 IEEE International Symposium on Information Theory, Seattle, WA, USA, 9–14 July 2006; pp. 1929–1933. [Google Scholar] [CrossRef]
- Mahvari, M.M.; Kobayashi, M.; Zaidi, A. On the Relevance-Complexity Region of Scalable Information Bottleneck. arXiv
**2020**, arXiv:2011.01352. [Google Scholar] [CrossRef] - Kline, A.G.; Palmer, S.E. Gaussian information bottleneck and the non-perturbative renormalization group. New J. Phys.
**2022**, 24, 033007. [Google Scholar] [CrossRef] - Kolchinsky, A.; Tracey, B.D.; Van Kuyk, S. Caveats for information bottleneck in deterministic scenarios. arXiv
**2018**, arXiv:1808.07593. [Google Scholar] [CrossRef] - Witsenhausen, H.; Wyner, A. A conditional entropy bound for a pair of discrete random variables. IEEE Trans. Inf. Theory
**1975**, 21, 493–501. [Google Scholar] [CrossRef] - Hsu, H.; Asoodeh, S.; Salamatian, S.; Calmon, F.P. Generalizing Bottleneck Problems. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 531–535. [Google Scholar] [CrossRef]
- Asoodeh, S.; Calmon, F. Bottleneck Problems: An Information and Estimation-Theoretic View. Entropy
**2020**, 22, 1325. [Google Scholar] [CrossRef] - Dikshtein, M.; Shamai, S. A Class of Nonbinary Symmetric Information Bottleneck Problems. arXiv
**2021**, arXiv:cs.IT/2110.00985. [Google Scholar] - Benger, E.; Asoodeh, S.; Chen, J. The Cardinality Bound on the Information Bottleneck Representations is Tight. arXiv
**2023**, arXiv:cs.IT/2305.07000. [Google Scholar] - Bertschinger, N.; Rauh, J.; Olbrich, E.; Ay, N. Quantifying Unique Information. Entropy
**2013**, 16, 2161–2183. [Google Scholar] [CrossRef] - Parker, A.E.; Gedeon, T.; Dimitrov, A. The Lack of Convexity of the Relevance-Compression Function. arXiv
**2022**, arXiv:2204.10957. [Google Scholar] [CrossRef] - Wu, T.; Fischer, I. Phase Transitions for the Information Bottleneck in Representation Learning. arXiv
**2020**, arXiv:2001.01878. [Google Scholar] - Zaslavsky, N.; Tishby, N. Deterministic Annealing and the Evolution of Information Bottleneck Representations. 2019. Available online: https://www.nogsky.com/publication/2019-evo-ib/2019-evo-IB.pdf (accessed on 12 September 2023).
- Ngampruetikorn, V.; Schwab, D.J. Perturbation Theory for the Information Bottleneck. Adv. Neural Inf. Process. Syst.
**2021**, 34, 21008–21018. [Google Scholar] - Bertschinger, N.; Rauh, J. The Blackwell relation defines no lattice. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 2479–2483. [Google Scholar] [CrossRef]
- Yang, Q.; Piantanida, P.; Gündüz, D. The Multi-layer Information Bottleneck Problem. arXiv
**2017**, arXiv:1711.05102. [Google Scholar] [CrossRef] - Cover, T.; Thomas, J. Elements of Information Theory; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
- Zaidi, A.; Estella-Aguerri, I.; Shamai (Shitz), S. On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. Entropy
**2020**, 22, 151. [Google Scholar] [CrossRef] [PubMed] - Tishby, N.; Zaslavsky, N. Deep Learning and the Information Bottleneck Principle. In Proceedings of the 2015 IEEE Information Theory Workshop, ITW 2015, Jerusalem, Israel, 26 April–1 May 2015. [Google Scholar] [CrossRef]
- Shwartz-Ziv, R.; Tishby, N. Opening the Black Box of Deep Neural Networks via Information. 2017. Available online: http://xxx.lanl.gov/abs/1703.00810 (accessed on 12 September 2023).
- Shwartz-Ziv, R.; Painsky, A.; Tishby, N. Representation Compression and Generalization in Deep Neural Networks. 2019. Available online: https://openreview.net/pdf?id=SkeL6sCqK7 (accessed on 12 September 2023).
- Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the information bottleneck theory of deep learning. J. Stat. Mech. Theory Exp.
**2019**, 2019, 124020. [Google Scholar] [CrossRef] - Achille, A.; Soatto, S. Emergence of Invariance and Disentanglement in Deep Representations. In Proceedings of the 2018 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 11–16 February 2018; pp. 1–9. [Google Scholar] [CrossRef]
- Elad, A.; Haviv, D.; Blau, Y.; Michaeli, T. Direct Validation of the Information Bottleneck Principle for Deep Nets. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 758–762. [Google Scholar] [CrossRef]
- Lorenzen, S.S.; Igel, C.; Nielsen, M. Information Bottleneck: Exact Analysis of (Quantized) Neural Networks. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
- Kawaguchi, K.; Deng, Z.; Ji, X.; Huang, J. How Does Information Bottleneck Help Deep Learning? 2023. Available online: https://proceedings.mlr.press/v202/kawaguchi23a/kawaguchi23a.pdf (accessed on 12 September 2023).
- Yousfi, Y.; Akyol, E. Successive Information Bottleneck and Applications in Deep Learning. In Proceedings of the 2020 54th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 1–4 November 2020; pp. 1210–1213. [Google Scholar] [CrossRef]
- No, A. Universality of Logarithmic Loss in Successive Refinement. Entropy
**2019**, 21, 158. [Google Scholar] [CrossRef] - Nasser, R. On the input-degradedness and input-equivalence between channels. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 2453–2457. [Google Scholar] [CrossRef]
- Lastras, L.; Berger, T. All sources are nearly successively refinable. IEEE Trans. Inf. Theory
**2001**, 47, 918–926. [Google Scholar] [CrossRef] - Williams, P.L.; Beer, R.D. Nonnegative Decomposition of Multivariate Information. 2010. Available online: https://arxiv.org/pdf/1004.2515 (accessed on 12 September 2023).
- Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J. Shared Information—New Insights and Problems in Decomposing Information in Complex Systems. In Proceedings of the European Conference on Complex Systems, 2012; Springer International Publishing: Berlin/Heidelberg, Germany, 2013; pp. 251–269. [Google Scholar] [CrossRef]
- Griffith, V.; Koch, C. Quantifying Synergistic Mutual Information. In Guided Self-Organization: Inception; Prokopenko, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2014; pp. 159–190. [Google Scholar] [CrossRef]
- Harder, M.; Salge, C.; Polani, D. Bivariate measure of redundant information. Phys. Rev. E
**2013**, 87, 012130. [Google Scholar] [CrossRef] [PubMed] - Blackwell, D. Equivalent Comparisons of Experiments. Ann. Math. Stat.
**1953**, 24, 265–272. [Google Scholar] [CrossRef] - Lemaréchal, C. Lagrangian Relaxation. In Computational Combinatorial Optimization: Optimal or Provably Near-Optimal Solutions; Jünger, M., Naddef, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 112–156. [Google Scholar] [CrossRef]
- Kolchinsky, A.; Tracey, B.; Wolpert, D. Nonlinear Information Bottleneck. Entropy
**2017**, 21, 1181. [Google Scholar] [CrossRef] - Matousek, J.; Gärtner, B. Understanding and Using Linear Programming, 1st ed.; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
- van den Brand, J. A Deterministic Linear Program Solver in Current Matrix Multiplication Time. In Proceedings of the Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms; Society for Industrial and Applied Mathematics (SODA’20), Salt Lake City, UT, USA, 5–8 January 2020; pp. 259–278. [Google Scholar]
- Rose, K. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proc. IEEE
**1998**, 86, 2210–2239. [Google Scholar] [CrossRef] - Gedeon, T.; Parker, A.E.; Dimitrov, A.G. The Mathematical Structure of Information Bottleneck Methods. Entropy
**2012**, 14, 456–479. [Google Scholar] [CrossRef] - Shamir, O.; Sabato, S.; Tishby, N. Learning and generalization with the information bottleneck. Theor. Comput. Sci.
**2010**, 411, 2696–2711. [Google Scholar] [CrossRef] - Rauh, J.; Banerjee, P.K.; Olbrich, E.; Jost, J. Unique Information and Secret Key Decompositions. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019. [Google Scholar] [CrossRef]
- Banerjee, P.; Rauh, J.; Montufar, G. Computing the Unique Information. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 141–145. [Google Scholar] [CrossRef]
- Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. J. Mach. Learn. Res.
**2005**, 6, 165–188. [Google Scholar] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Goldfeld, Z.; Polyanskiy, Y. The Information Bottleneck Problem and its Applications in Machine Learning. IEEE J. Sel. Areas Inf. Theory
**2020**, 1, 19–38. [Google Scholar] [CrossRef] - Mahvari, M.M.; Kobayashi, M.; Zaidi, A. Scalable Vector Gaussian Information Bottleneck. In Proceedings of the 2021 IEEE International Symposium on Information Theory (ISIT), Melbourne, Australia, 12–20 July 2021; pp. 37–42. [Google Scholar] [CrossRef]

**Figure 1.**Examples of distributions $q\left(X\right|T)$, visualised as families of points $\left\{q\right(X\left|t\right),\phantom{\rule{0.166667em}{0ex}}t\in \mathcal{T}\}$ on the source simplex ${\Delta}_{\mathcal{X}}$, where, here, $\left|\mathcal{X}\right|=3$. Each of the triangle’s vertices represents the Dirac probability of some $x\in \mathcal{X}$. The bottleneck’s effective cardinality is $k=2$ on the left and $k=3$ on the right.

**Figure 2.**Successive refinement visualised on the information plane. On the left, adding the information from the variable ${S}_{2}$ (the supplement variable) is not efficient enough to achieve successive refinement. On the right, it is. See main text for details (the values of $I(X;{S}_{2}|{T}_{1})$ and $I(Y;{S}_{2}|{T}_{1})$ have been chosen arbitrarily to illustrate each case).

**Figure 3.**Illustration of the convex hull condition. The black triangle represents the source simplex ${\Delta}_{\mathcal{X}}$ with, here, $\left|\mathcal{X}\right|=3$, and the pointwise bottleneck decoder probabilities $\left\{q\right(X\left|t\right),\phantom{\rule{0.166667em}{0ex}}t\in \mathcal{T}\}$ are represented on it (in cyan for the coarser bottleneck ${T}_{1}$ and in red for the finer one ${T}_{2}$). The convex hull of the respective families of points are shaded with the corresponding color. On the left, the condition is not satisfied; on the right, it is.

**Figure 4.**Left: bottleneck trajectories for an example distribution $p(X,Y)$ such that $\left|\mathcal{X}\right|=\left|\mathcal{Y}\right|=3$, i.e., trajectory of ${q}_{\lambda}\left(X\right|T)$, represented as the family of points $\{{q}_{\lambda}\left(X\right|t),\phantom{\rule{0.166667em}{0ex}}t\in \mathcal{T}\}$ on the source simplex ${\Delta}_{\mathcal{X}}$, as a function of $\lambda =I(X;T)$ (crosses: value of ${q}_{{\lambda}_{c}}\left(X\right|T)$ just before a symbol split at a critical parameter ${\lambda}_{c}$, where the crosses’ color corresponds to the value of ${\lambda}_{c}$). The conditional distribution ${q}_{\lambda}\left(X\right|T)$ is defined by the single point $p\left(X\right)$ when $\lambda =0$ (dark blue cross on the black square), or by two distinct points between the first and second symbol splits (dark blue to cyan), or by three distinct points after the second symbol split (cyan to red). Note the discontinuity of ${q}_{\lambda}\left(X\right|T)$ at each symbol split (without the discontinuity, the trajectory around a symbol split would look like a branching). Right: corresponding SR pattern, i.e., corresponding output for the convex hull condition (blue: satisfied; red: not satisfied; dashed white lines: critical values ${\lambda}_{c}\left(i\right)$ of either ${\lambda}_{1}$ or ${\lambda}_{2}$). For instance, the critical value ${\lambda}_{c}\left(2\right)\approx 0.33$ corresponds, on the bottleneck trajectories (left), to the symbol split from two to three symbols (cyan crosses). Note that ${\lambda}_{c}\left(1\right)\approx 0$. The respective $p\left(Y\right|X)$ corresponding to this figure and to Figure 5 and Figure 6 are plotted in Appendix E.

**Figure 5.**Same as Figure 4, with a different example distribution $p(X,Y)$ such that $\left|\mathcal{X}\right|=\left|\mathcal{Y}\right|=3$.

**Figure 6.**Same as Figure 4, with a different example distribution $p(X,Y)$ such that $\left|\mathcal{X}\right|=\left|\mathcal{Y}\right|=3$.

**Figure 7.**Left: example trajectory of ${q}_{\lambda}\left(X\right|T)$ as a function of $\lambda =I(X;T)$ (crosses: value of ${q}_{{\lambda}_{c}}\left(X\right|T)$ just before a symbol split at a critical parameter ${\lambda}_{c}$). Right: corresponding unique information, in bits (color), expressed as a function of the pair of trade-off parameters (white dashed lines indicate critical values ${\lambda}_{c}\left(i\right)$ of either ${\lambda}_{1}$ or ${\lambda}_{2}$.). For instance, the critical value ${\lambda}_{c}\left(2\right)\approx 0.33$ (right) corresponds, on the bottleneck trajectories (left), to the symbol split from two to three symbols (cyan crosses). The respective $p\left(Y\right|X)$ corresponding to this figure and to Figure 8 and Figure 9 are plotted in Appendix E.

**Figure 10.**New example of an exact SR pattern and the corresponding UI landscape over trade-off parameters ${\lambda}_{1}<{\lambda}_{2}$, where, here, $\left|\mathcal{X}\right|=5$ and $\left|\mathcal{Y}\right|=3$. Left: exact SR pattern, i.e., output for the convex hull condition (blue: satisfied, red: not satisfied). Right: corresponding UI landscape, in bits (color). White dashed lines indicate critical values ${\lambda}_{c}\left(i\right)$ of either ${\lambda}_{1}$ or ${\lambda}_{2}$. Note that $\left(i\right)$ the binary notion of exact SR (left) filters out most of the structure unveiled by UI (right), $\left(ii\right)$ the UI landscape seems highly impacted by IB bifurcations, and $\left(iii\right)$ the UI is in any case always small, even though not entirely negligible. See main text for more details.

**Figure 11.**Same as Figure 10, with a new example distribution $p(X,Y)$, where, here, $\left|\mathcal{X}\right|=5$ and $\left|\mathcal{Y}\right|=3$. Besides the white orthogonal dashed lines, other white dots correspond to values of $({\lambda}_{1},{\lambda}_{2})$ for which the algorithm did not converge (see main text for a comment on this lack of convergence).

**Figure 12.**Same as Figure 10, with a new example distribution $p(X,Y)$, where, here, $\left|\mathcal{X}\right|=7$ and $\left|\mathcal{Y}\right|=5$.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Charvin, H.; Catenacci Volpi, N.; Polani, D.
Exact and Soft Successive Refinement of the Information Bottleneck. *Entropy* **2023**, *25*, 1355.
https://doi.org/10.3390/e25091355

**AMA Style**

Charvin H, Catenacci Volpi N, Polani D.
Exact and Soft Successive Refinement of the Information Bottleneck. *Entropy*. 2023; 25(9):1355.
https://doi.org/10.3390/e25091355

**Chicago/Turabian Style**

Charvin, Hippolyte, Nicola Catenacci Volpi, and Daniel Polani.
2023. "Exact and Soft Successive Refinement of the Information Bottleneck" *Entropy* 25, no. 9: 1355.
https://doi.org/10.3390/e25091355