The Information Bottleneck’s Ordinary Differential Equation: First-Order Root Tracking for the Information Bottleneck

The Information Bottleneck (IB) is a method of lossy compression of relevant information. Its rate-distortion (RD) curve describes the fundamental tradeoff between input compression and the preservation of relevant information embedded in the input. However, it conceals the underlying dynamics of optimal input encodings. We argue that these typically follow a piecewise smooth trajectory when input information is being compressed, as recently shown in RD. These smooth dynamics are interrupted when an optimal encoding changes qualitatively, at a bifurcation. By leveraging the IB’s intimate relations with RD, we provide substantial insights into its solution structure, highlighting caveats in its finite-dimensional treatments. Sub-optimal solutions are seen to collide or exchange optimality at its bifurcations. Despite the acceptance of the IB and its applications, there are surprisingly few techniques to solve it numerically, even for finite problems whose distribution is known. We derive anew the IB’s first-order Ordinary Differential Equation, which describes the dynamics underlying its optimal tradeoff curve. To exploit these dynamics, we not only detect IB bifurcations but also identify their type in order to handle them accordingly. Rather than approaching the IB’s optimal tradeoff curve from sub-optimal directions, the latter allows us to follow a solution’s trajectory along the optimal curve under mild assumptions. We thereby translate an understanding of IB bifurcations into a surprisingly accurate numerical algorithm.


Introduction
The Information Bottleneck (IB) describes the fundamental tradeoff between the compression of information on an input X to the preservation of relevant information on a hidden reference variable Y . Formally, let X and Y be random variables defined respectively on finite source and label alphabets X and Y, and let p Y |X (y|x)p X (x) be their joint probability distribution 1 , or p(y|x)p(x) for short. One seeks [Tishby et al., 1999] to maximize the information I(Y ;X) over all Markov chains Y ←→ X ←→X, subject to a constraint on the mutual information I(X;X) := E p(x|x)p(x) log p (x|x) p(x) , I Y (I X ) := max p(x|x) I(Y ;X) : I(X;X) ≤ I X . (1) The latter maximization is over conditional probability distributions or encoders p(x|x). The graph of I Y (I X ) is the IB curve. We write T := |X |, for a codebook or representation alphabetX . An encoder p(x|x) which achieves the maximum in (1) is IB optimal or simply optimal. Written in a Lagrangian 2 formulation L := I(X;X) − β I(Y ;X) with β > 0, [Tishby et al., 1999] showed that a necessary condition for extrema in (1) is that the IB Equations hold. Namely, In these, Z(x, β) := x p(x) exp {−βD KL [p(y|x)||p(y|x)]} is the partition function, p(x|x) in (3) is defined by the Bayes rule p(x|x)p(x) /p(x), and D KL is the Kullback-Leibler divergence, D KL [p||q] := i p(i) log p(i) /q(i). The IB Equations (2)-(4) are a necessary condition for an extremum of L also when it is considered as a functional in three independent families of normalized distributions {p(x|x)}, {p(y|x)} and {p(x)}, [Tishby et al., 1999, Section 3.3], rather than in {p(x|x)} alone. While satisfying them is necessary to achieve the curve (1), it is not sufficient. Indeed, Equations (2)-(4) have solutions that do not achieve curve (1), and so are sub-optimal. This results in sub-optimal IB curves, which intersect or bifurcate as the multiplier β varies (see Section 3.4 there).
Iterating over the IB Equations (2)-(4) is essentially Blahut-Arimoto's algorithm variant for the IB (BA-IB) due to [Tishby et al., 1999], brought here as Algorithm 1. While the minimization problem (1) can be solved exactly in special cases, [Witsenhausen and Wyner, 1975, Section IV], exact solutions of an arbitrary finite IB problem whose distribution is known are usually obtained nowadays using BA-IB. cf., [Zaidi et al., 2020, Section 3] for a survey on other computation approaches. We write BA β for a single iteration of the BA-IB Algorithm 1. Since BA β encodes an iteration over the IB Equations (2)-(4), then an encoder p(x|x) is its fixed point, BA β [p(x|x)] = p(x|x), if and only if it satisfies the IB Equations. Or equivalently, if p(x|x) is a root of the IB operator in a manner similar to Agmon et al. [2021]. We shall then call it an IB root. Agmon et al. used a similar formulation of rate-distortion (RD) and its relations in [Gilad-Bachrach et al., 2003] to the IB, to show that BA-IB suffers from critical slowing down near critical points, where the marginal p(x) of a representorx in an optimal encoder vanishes gradually. That is, the number of BA-IB iterations required till convergence increases dramatically as one approaches such points. Formulating fixed points of an iterative algorithm as operator roots can also be leveraged for computational purposes in a constrained-optimization problem, as noted recently by Agmon [2022a] for RD. Indeed, let F (·, β) be a differentiable operator on R n for some n > 0, F : R n × R → R n , where β is a (real) constraint parameter. Suppose now that (x, β) is a root of F , such that x = x(β) is a differentiable function of β. Write D x F := ∂ ∂xj F i i,j for its Jacobian matrix, and D β F := ∂ ∂β F i i for its vector of partial derivatives with respect to β. The point (x, β) of evaluation is omitted whenever understood. As is often discussed along with the Implicit Function Theorem, e.g., [de Oliveira, 2014], applying the multivariate chain rule to F (x(β), β) in (6) yields an implicit ordinary differential equation (ODE) similar to [Rose, 1994, Section IV.C]. That is, suppose that the first variation 3 of the IB Lagrangian L vanishes, for every perturbation ∆p(x|x). This condition is necessary for extremality and implies the IB Equations (2)-(4), [Tishby et al., 1999]. Then, p(x|x), β is said to be a phase transition only if there exists a particular direction ∆q(x|x) at which p(x|x) can be perturbed without affecting the Lagrangian's value to second order, Gedeon et al. [2012], ,  and Ngampruetikorn and Schwab [2021] take such an approach. Zaslavsky [2019] similarly analyzes one type of IB bifurcation. While a perturbative approach is common in analyzing phase transitions, it has several shortcomings when applied to the IB, as noted by Agmon [2022b]. First, the IB's Lagrangian L is constant on a linear manifold of encoders p(x|x), [Gedeon et al., 2012, Section 3.1], and so condition (9) leads to false-detections. While this was considered there and in its sequel [Parker and Dimitrov, 2022] by giving subtle conditions on the nullity of the second variation in (9), in practice it is difficult to tell whether a particular direction ∆q(x|x) is in the kernel due to a bifurcation or due to other reasons, as they note. Second, note that a finite IB problem can be written as an infinite RD problem, [Harremoës and Tishby, 2007]. As discussed in Section 5, representing an IB root by a finite-dimensional vector leads to inherent subtleties in its computation. Among other things, these may well result in a bifurcation not being detectable under certain circumstances (Section 5.3). To our understanding, many of the difficulties that hindered the understanding of IB bifurcations throughout the years are, in fact, artifacts of finite dimensionality. Third, conditions (8)-(9) do not suffice to reveal the type of the bifurcation, information which is necessary for handling it when following a root's path. While [Parker and Dimitrov, 2022, Section 2.9] give conditions for identifying the type, these partially agree with our findings and do not suggest a straightforward way for handling a bifurcation.
Rather than imposing conditions on the scalar functional L, our approach to IB bifurcations follows that of [Agmon, 2022a] for RD. That is, we rely on the fact that the IB's local extrema are fixed points of an iterative algorithm, and so they also satisfy a vector equation F = 0 (6). We shall now consider a toy problem to motivate our approach. "Bifurcation Theory can be briefly described by the investigation of problem (6) in a neighborhood of a root where D x F is singular ", [Kielhöfer, 2012]. Indeed, recall that if D x F is non-singular at a root (x 0 , β 0 ), then by the Implicit Function Theorem (IFT), there exists a function x(β) through the root, x(β 0 ) = x 0 , which satisfies F x(β), β = 0 (6) at the vicinity of β 0 . The function x(β) is then not only unique at some neighborhood of (x 0 , β 0 ), but further, x(β) inherits the differentiability properties of F , [Kielhöfer, 2012, I.1.7]. In particular, if the operator F is real-analytic in its variables -as with the IB operator (5) -then so is its root x(β). While a bifurcation can occur only if D x F is singular, singularity is not sufficient for a bifurcation to occur. For example, the roots of the operator F (x, y; β) := (x − β, 0) on R 2 consist of the vertical line x = β, {(β, y) : y ∈ R}, for every β ∈ R. For a fixed y, each such root is real-analytic in β. However, one cannot deduce this directly from the IFT, as the Jacobian 1 0 0 0 of F (10) is always singular. Note, however, that in this particular example, the x coordinate alone suffices to describe the problem's dynamics, and so its y coordinate is redundant. One can ignore the y coordinate by considering the "reduction"F (x; β) := x − β of F to R 1 . Further, discarding y also removes or mods-out the direction 0 1 from ker D x F , which does not pertain to a bifurcation in this case. This results in the non-singular Jacobian matrix (1) ofF , and so it is now possible to invoke the IFT on the reduced problem. The root guaranteed by the IFT can always be considered in R 2 by putting back a redundant y coordinate at some fixed value. Agmon [2022a] used a similarly defined reduction of finite RD problems to show that their dynamics are piecewise real-analytic under mild assumptions.
The intuition behind our approach is similar to [Harremoës and Tishby, 2007, Section III], who observed that "in the IB one can also get rid of irrelevant variables within the model ". Nevertheless, the details differ. Mathematically, we consider 4 the quotient V /W of a vector space V by its subspace W . Elements of V are identified in the quotient if they differ by an element of W : v 1 ∼ v 2 ⇔ v 1 − v 2 ∈ W , for v 1 , v 2 ∈ V . This way, one "mods-out" W , collapsing it to a single point in the quotient vector space V /W . The resulting problem is smaller and so easier to handle, whether for theoretical or practical purposes. This is how the one-dimensional vector space ker D x F in our toy example (10) was reduced to the trivial ker D xF = {0}. However, one needs to understand the solution structure, for example, to ensure that the directions in W are not due to a bifurcation. We note in passing that V /W has a simple geometric interpretation as the translations of W in V , in a manner reminiscent of its better-known counterparts of quotient groups and rings. e.g., [Dummit and Foote, 2004, Section 10.2]. To keep things simple, however, we shall not use quotients explicitly. Instead, the reader may simply consider the sequel as a removal of redundant coordinates. For, we shall only remove coordinates that the reader does not care about anyway, as in the above toy example.
To achieve this approach, one needs to consider the IB in a coordinate system that permits a simple reduction as in (10), and to understand its solution structure. We achieve these in Section 5 by exploiting two properties of the IB which are often overlooked. First, proceeding with the coordinatesexchange of Section 2, the intimate relations [Harremoës andTishby, 2007, Gilad-Bachrach et al., 2003] of the IB with RD suggest a "minimally sufficient" coordinates system for the IB, just as the x axis is for problem (10). Reducing an IB root to these coordinates is a natural extension of reduction in RD, [Agmon, 2022a]. Reduction of IB roots facilitates a clean treatment of IB bifurcations. These are roughly divided into continuous and discontinuous bifurcations, in Subsections 5.2 and 5.3 respectively. While understanding continuous bifurcations is straightforward, the IB's relations with RD allow us to understand the discontinuous bifurcation examples of which we are aware as a support switching bifurcation in RD, by leveraging [Agmon, 2022a, Section 6]. A second property is the analyticity of the IB operator (5), which stems from the analyticity of the IB Equations (2)-(4). By building on the first property, analyticity leads us to argue that the Jacobian of the IB operator (5) is generally non-singular (Conjecture 5) when considered in reduced coordinates as above. As an immediate consequence, the dynamics underlying the IB curve (1) are piecewise real-analytic in β, in a manner similar to RD. Indeed, the fact that there exist dynamics underlying the IB curve (1) in the first place can arguably be attributed to analyticity; cf., the discussion following Conjecture 5. Combining both properties sheds light on several subtle yet important practical caveats in solving the IB (Subsection 5.3) due to using finite-dimensional representations of its roots. These subtleties are compatible with our numerical experience. The results here suggest that, unlike RD, the IB is inherently infinite-dimensional, even for finite problems.
Finally, Section 6 combines the modified Euler method of Section 4 with the understanding of IB bifurcations in Section 5, to obtain Algorithm 5 (IBRT1) for following the path of an optimal IB root, in Subsection 6.1. That is, First-order Root-Tracking for the IB. For simplicity, we focus mainly on continuous IB bifurcations, as these are the ones most often encountered in practice; cf., the comments in Subsection 6.3. The resulting approximations in the information plane are surprisingly close to the true IB curve (1), even on relatively sparse grids (i.e., with large step sizes), as seen in Figure 1. See Subsection 6.2 for the numerical results underlying the latter. The reasons for this are discussed in Subsection 6.3, along with the algorithm's basic properties. Unlike BA-IB, which suffers from an Despite the algorithm's approximation errors (Section 6.2), the approximate curves it yields are visually indistinguishable from the true IB curve (1), even on relatively few grid points. The reasons for this are discussed below (Section 6.3). Our First-order Root-Tracking for the IB (IBRT1) Algorithm 5 was used to approximate the optimal IB roots of a binary symmetric channel with crossover probability 0.3 and a uniform source, BSC(0.3), for several grid densities. The points in the information plane yielded from these approximations are plotted on top of the problem's exact solution (see Appendix E).
increased computational cost near bifurcations, our Algorithm 5 suffers from a reduced accuracy there, in a manner similar to root-tracking for RD, [Agmon, 2022a].
With that, we note that there are standard techniques in Bifurcation Theory for handling a non-trivial kernel of D x F at a root. For example, the Lyapunov-Schmidt reduction replaces the high-dimensional problem F = 0 (6) on R n by a smaller but equivalent problem Φ = 0, where Φ(·, β) maps vectors in the (right) kernel of D x F to vectors in its left kernel. To achieve this, it separates the kernel-and non-kernel directions of the problem, essentially handling each at its turn. e.g., [Kielhöfer, 2012, Theorem I.2.3] or [Teschl, 2022, Section 9.7]. This technique is generic, as it does not rely on any particular property of the problem at hand. As such, it is considerably more involved than removing redundant coordinates 5 , which requires an understanding of the solution structure. In contrast, reduction in the IB is straightforward. For the purpose of following a root's path, carrying on with redundant kernel directions is burdensome, computationally expensive, and sensitive to approximation errors. Parker and Dimitrov [2022] use a variant of the Lyapunov-Schmidt reduction to consider IB bifurcations due to symmetry breaking. While our findings are in agreement with theirs' for continuous IB bifurcations, they differ for discontinuous bifurcations (see Subsections 5.2 and 5.3).
Notations. Vectors are written in boldface x, scalars in a regular font x. A distribution p pertaining to a particular Lagrange multiplier value β (in Equations (2)-(4)) is denoted with a subscript, p β . The probability simplex on a set S is denoted ∆[S] (see Section 5.1). The support of a probability distribution p on S is supp p := {s ∈ S : p(s) = 0}. The source, label and representation alphabets of an IB problem are denoted X , Y andX , respectively; we write T := |X |. δ denotes Dirac's delta function, δ i,j = 1 if i = j and zero otherwise.

Output:
A fixed point p(x|x) of the IB Equations.

3:
repeat 4: i ← i + 1 10: until convergence. 11: end function 2 Coordinates exchange for the IB Just as a point in the plane can be described by different coordinate systems, so can IB roots. As demonstrated recently by Agmon [2022a] for the related rate-distortion theory, picking the right coordinates matters when analyzing its bifurcations. The same holds also for the IB. Our primary motivations for exchanging coordinates are to reduce computational costs and to mod-out irrelevant kernel directions, as explained in Section 1. In this Section, we discuss three natural choices of a coordinate system for parametrizing IB roots and the reasoning behind our choice for the sequel before setting to derive the IB ODE in the following Section 3. This work is complemented by the later Subsection 5.1, which facilitates a transparent analysis of IB bifurcations.
IB roots have been classically parameterized in the literature by (direct) encoders p(x|x), following Tishby et al. [1999]. Considering the BA-IB Algorithm 1 reveals two other natural choices, illustrated by Equation (11) below. First, an encoder p(x|x) determines a cluster marginal p(x) and an inverse encoder p(x|x), via 1.4 and 1.5, respectively. These can be interpreted geometrically as p(x)-weighted points qx(x) in the simplex ∆[X ] of X, so long that these are well-defined, ∀x p(x) = 0. No more than |X | + 1 points in the simplex are required to represent an IB root, [Witsenhausen and Wyner, 1975]. The latter is readily seen to analyze the IB in these coordinates 6 , though it pre-dates [Tishby et al., 1999]. Second, an inverse encoder determines a decoder p(y|x), via 1.6. Along with the cluster marginal, p(y|x), p(x) can be similarly interpreted as p(x)-weighted points rx(y) in the simplex ∆[Y] of Y . This choice of coordinates is implied already by [Tishby et al., 1999, Theorem 5]. Cycling around Equation (11), a decoder p(y|x), p(x) determines via 1.7 and 1.8 a new encoder, which may differ from the one with which we have started. For notational simplicity, we shall usually write p(y|x), p(x) rather than rx(y), p(x) for decoder coordinates (similarly, for inverse-encoder coordinates).
The above allows us to define three BA operators as the composition of three consecutive maps in Equation (11), encoding an iteration of Algorithm 1. When starting at an encoder p(x|x), its output is a newly-defined encoder. Similarly, when starting at one of the other two vertices, it sends an inverse-encoder p(x|x), p(x) or a decoder pair p(y|x), p(x) to newly-defines one. By abuse of notation, we denote all three compositions by BA β , with the choice of coordinates system mentioned accordingly. Indeed, these are representations of a single BA-IB iteration in three different coordinate systems, and so may be considered as distinct representations of the same operator. For completeness, BA β in decoder coordinates is spelled out explicitly at Equation (27) in Appendix A. A newly-defined encoder (or inverse-encoder or decoder) at a cycle's completion need not generally equal the one at which we have started. These are equal precisely at IB roots, when the IB Equations (2)-(4) hold. Therefore, the choice of a coordinates system does not matter then, and so moving around Equation (11) from one vertex to another yields different parameterizations of the same root, at least when ∀x p(x) = 0. In particular, this shows that the inverse-encoders qx in ∆[X ] of an IB root are in bijective correspondence with its decoders rx in ∆[Y], an observation which shall come in handy at Section 5.
Next, we consider how well each of these coordinate systems can serve for following the path of an IB root. The minimal number of symbolsx needed to write down an IB root typically varies with the constraints; cf., [Tishby et al., 1999, Section 3.4] or [Witsenhausen and Wyner, 1975, Section II.A].
6 Although known among IB practitioners, this reference has generally escaped broader attention.
Therefore, inverse-encoder and decoder coordinates are better suited than encoder coordinates for considering the dynamics of a root with β, as they allow us to consider its evolution via a varying number of points in a fixed space, ∆[X ] or ∆[Y], respectively. Indeed, a direct encoder p(x|x) can be interpreted geometrically as a point in the |X |-fold product ∆[X ] X of simplices ∆[X ], [Gedeon et al., 2012, Section 2]. So, if a particular symbolx is not in use anymore, p(x ) = 0, then one is forced to choose between replacing ∆[X ] by a smaller space ∆[X \ {x }], to carrying on with a redundant symbolx . The latter leads to non-trivial kernels in the IB due to duplicate clusters 7 (e.g., Section 3.1 there), making it difficult to tell whether a particular kernel direction pertains to a bifurcation. In contrast, when considered in decoder coordinates, for example, an IB root is nothing but p(x)weighted paths r 1 , . . . , r T in ∆[Y], with β → rx(β) a path for each rx. And so, once a symbolx is not needed anymore, then one can discard the path rx without replacing the underlying space ∆[Y]. This permits the clean treatment of IB bifurcations in Section 5.
The computational cost of solving the first-order ODE (7) for dx dβ numerically depends on dim x. Much of this cost is due to computing a linear pre-image under D x F , which is of order O(dim x) 3 , [Cormen et al., 2001, Section 28.4]. cf., Section 6. Representing an IB root on T clusters in encoder coordinates requires 8 |X | · T dimensions, in inverse-encoder coordinates (|X | + 1) · T dimensions, and in decoder coordinates (|Y| + 1) · T dimensions. Thus, when there are fewer possible labels Y than input symbols X , then the computational cost is lowest in decoder coordinates.  Figure 1; see Appendix E. The derivative's L 2 -norm is plotted in green for encoder coordinates and blue for decoder coordinates. The solution barely changes at high β values, and so the derivative in decoder coordinates is smaller (see main text). Nevertheless, the derivative in encoder coordinates does not vanish then, due to Equation (12). At low β values, however, the derivative in either coordinate system may generally be large. Both vanish to the left of the bifurcation in this problem (dashed red vertical), as the solution there is trivial (single-clustered). The derivatives diverge near the bifurcation (to its right) regardless of the coordinate system, as might be expected by the implicit ODE (7) -see also Section 6.1.
A-priori, one might expect that derivatives with respect to β vanish when the solution barely changes, regardless of the choice of coordinates system. For example, at a very large "β = ∞" value, 7 In addition to the IB's "perpetual kernel", [Gedeon et al., 2012, Parker andDimitrov, 2022]. 8 With the coordinates considered as independent variables, ignoring normalization constraints.
an obvious IB root is the diagonal encoder 9 , as can be seen by a direct examination of the IB Equations (2)-(4). It consists of one IB cluster of weight (or mass) p(x) at p Y |X=x ∈ ∆[Y] for each x ∈ X , and so one might expect that it would barely change so long that β is very large. However, the logarithmic 10 derivative d log p β (x|x) dβ in encoder coordinates need not vanish even when the derivatives in decoder coordinates do, as seen to the right of Figure 2. Indeed, given the derivative in decoder coordinates, one can exchange it to encoder coordinates by where J enc dec and J enc mrg are the two coordinate exchange Jacobian matrices of orders (T · |X |) × (T · |Y|) and (T ·|X |)×T respectively, given by Equations (94) and (96)  vanish. This unintuitive behavior of the derivative in encoder coordinates is due to the explicit dependence of the IB's encoder Equation (2) on β. This dependence is the source of the last two terms in Equation (12) (see Equation (99)). The comparison between encoder and inverse-encoder coordinates can be seen to be similar. See Appendix B.4 for further details.
In light of the above, we proceed with decoder coordinates in the sequel.
3 Implicit derivatives at an IB root and the IB's ODE We now specialize the implicit ODE (7) (of Section 1) to the IB, using the decoder coordinates of the previous Section 2. This allows us to compute first-order implicit derivatives at an IB root (Theorem 1) at a remarkable accuracy, under one primary assumption. Namely, that the root is a differentiable function of β. While differentiability breaks at IB bifurcations (Section 5), this allows to reconstruct a solution path from its local approximations in the following Section 4, so long that it holds.
To simplify calculations, we take the logarithm log p(y|x), log p(x) of the decoder coordinates of Section 2 as our variables. Exchanging the BA β operator to log-decoder coordinates is immediate, by writing log BA β [exp (log p(y|x)), exp (log p(x))]. For short, we denote it BA β [log p(y|x), log p(x)] when in these coordinates, by abuse of notation. Similarly, exchanging the IB ODE (below) back to non-logarithmic coordinates is immediate, via d dβ log p = 1 p d dβ p. In Section 6 we shall assume that p(x) never vanishes. To ensure that taking logarithms is well-defined, we also require 11 that no coordinate of p(y|x) vanishes. A sufficient condition for that is that p(y|x) > 0 for every x and y (Lemma 14 in Appendix A).
Next, define a variable x ∈ R T ·(|Y|+1) as the concatenation of the vector log p β (y|x) y∈Y,x∈X with log p β (x) x∈X . Differentiating ∂ /∂ log p with respect to log-probabilities is given by p · ∂ ∂p , by the chain rule 12 . This gives meaning to the Jacobian matrix D x (·) with respect to our logarithmic variable x. The Jacobian D log p(y|x),log p(x) BA β of a single Blahut-Arimoto iteration in these logdecoder coordinates is a square matrix of order T · (|Y| + 1). Its (T · |Y|) × (T · |Y|) upper-left block (below) corresponds to perturbations in BA's output log-decoder log p(y|x) due to varying an input 9 That is, setX := X and p(x|x) := 1 ifx = x and 0 otherwise. 10 See Section 3 below on the use of logarithmic derivatives. 11 While a decoder p(y|x) may have a well-defined derivative d dβ p(y|x) even without this requirement, the calculation details below would differ.
12 Defining u := log p, the u-derivative of f (p) is given by df du = df dp dp du , or equivalently df d log p = p · df dp . See also Appendix B.1 for a gentler treatment.
log-decoder log p(y |x ). Since we prime input but not output coordinates, this is to say that the columns of this block are indexed 13 by pairs (y ,x ) and its rows by (y,x). Its (T · |Y|) × T upper-right block corresponds to perturbations in BA's output log-decoder log p(y|x) due to varying an input log-marginal log p(x ). That is, its columns are indexed byx and rows by (y,x). Similarly, for the bottom-left and bottom-right blocks, of respective sizes T × (T · |Y|) and T × T . See (51) ff., in Appendix B.2, and the end-result at Equation (70) there. Explicitly, when evaluated at an IB root log p(y|x), log p(x) , BA's Jacobian matrix is given by where δ i,j = 1 if i = j and is 0 otherwise. As mentioned above, primed coordinates y andx index the columns, and un-primed coordinates y andx the rows. Indices y andx with more than a single prime are summation variables. A, B and C are a scalar, a vector, and a matrix, each involving two IB clusters. They are defined by, In these, y indexes B and the rows of C, y the columns of C, and x is a summation variable. These (x,x )-labeled tensors have only |Y| entries along each axis, thanks to the choice of decoder coordinates. A and B can be expressed in terms of C via some obvious relations; see Equation (58) and below in Appendix B.2. Appendix B.1 elaborates on the mathematical subtleties involved in calculating the Jacobian (13). See also Equation (71) in Appendix B.2 for an implementation-friendly form of (13). Together with D β BA β (Equations (84) and (83) in Appendix B.3), we have both of the firstorder derivative tensors of BA β in log-decoder coordinates. This allows us to specialize the implicit ODE (7) (of Section 1) to the IB, in terms of our variable x. By abuse of notation, we write log p β (y|x), log p β (x) y,x for its |Y| · T + T coordinates, and similarly for its derivatives vector v (15) below.
Theorem 1 (The IB's ODE). Let p(y|x), p(x) be an IB root, and suppose that it can be written as a differentiable function β → p β (y|x), p β (x) in β. If none of its coordinates vanish, then the vector of its implicit logarithmic derivatives is well defined and satisfies an ordinary differential equation in β, where I is the identity matrix of order T · (|Y| + 1), and the Jacobian matrix D log p(y|x),log p(x) BA β at the given IB root is given by Equation (13). The right-hand side of (16) is indexed as in (15), by (y,x) at its top andx at its bottom coordinates.
While the IB ODE was discovered by Agmon [2022b], it is derived here anew in log-decoder coordinates due to the considerations in Section 2. It is analogous to the RD ODE, due to Agmon [2022a]; Corollary 4 and around (in Section 5.1) provides a relation between these two ODEs. We emphasize that the first assumption of Theorem 1, that the IB root is a differentiable function of β, is essential. It is comprised of two parts: (i) that the root can be written as a function of β, and (ii) that this function is differentiable. These are precisely the assumptions needed to compute the first-order implicit multivariate derivative v (15) at the given root, [Agmon, 2022a, Section 2.1]. Continuous IB bifurcations violate (ii) (Subsection 5.2), while discontinuous ones violate (i) (Subsection 5.3). In contrast, the requirement that no coordinate vanishes is a technical one, due to our choice of logarithmic coordinates.
It is not necessary for the Jacobian of the IB operator (5) (to the left of (16)) to be non-singular in order to solve the IB ODE numerically. Nevertheless, non-singularity of the Jacobian will follow from the sequel (see Conjecture 5 in Section 5). With that, the derivatives v = d dβ log p β (y|x), log p β (x) (15) computed numerically from the IB ODE (16) at an exact root are remarkably accurate, as demonstrated in Figure 3. As in RD, [Agmon, 2022a], calculating implicit derivatives numerically loses its accuracy when approaching a bifurcation because the Jacobian is increasingly ill-conditioned there. For comparison, the BA-IB Algorithm 1 also loses its accuracy near a bifurcation. This is a consequence of BA's critical slowing down, [Agmon et al., 2021], just as with its corresponding RD variant.
Each coordinate of p(y|x), p(x) is treated by the IB ODE (16) as an independent variable. However, normalization of p(y|x) imposes one constraint 14 per clusterx. Thus, one might expect the behavior of BA's Jacobian (13) to be determined by fewer than T · (|Y| + 1) coordinates, at least qualitatively. This intuition is justified by the following Lemma 2, which allows to consider the kernel of the IB operator (5) by a smaller and simpler matrix S; see Appendix C for its proof.
Lemma 2. Given an IB root as above, define a square matrix of order T · |Y| by Then, the nullity of the Jacobian I − D log p(y|x),log p(x) BA β of the IB operator (5) equals that of I − S, where I is the identity matrix (of the respective order), and S is defined by (17), Specifically, write v := (v y,x ) y,x for a left eigenvector which corresponds to 1 ∈ eig S. Then, there is a bijective correspondence between the left kernels at both sides of (18), mapping where u := (ux)x is defined by ux := 1−β β · y v y,x .  . Top: Derivatives were computed at the problem's exact solution using the IB ODE (16) and compared to the problem's exact derivatives. These are accurate beyond the machine's precision, except when approaching the bifurcation (red vertical), since the Jacobian of the IB operator (5) is ill-conditioned there. Bottom: The L ∞ -errors of the solutions produced by the BA-IB Algorithm 1, with a 10 −8 stopping condition, and uniform initial conditions. Error is measured from the true direct encoder to avoid biases due to clusters of low mass. Both plots are as in [Agmon, 2022a, Figure 2.3].
In addition to offering a form more transparent than BA's Jacobian in (13), Lemma 2 also reduces the computational cost of testing I − D log p(y|x),log p(x) BA β (16) for singularity, by using the smaller I − S (17) in its place. This makes it easier to detect upcoming bifurcations (see Conjecture 5 in Section 5). Further, one can verify directly that the IB ODE (16) indeed follows the right path. Indeed, if the ODE is non-singular, then, by the Implicit Function Theorem, there is (locally) a unique IB root, which is a differentiable function of β. And so, there is a unique solution path for a numerical approximation to follow. Finally, we note that a relation similar to (18) holds also for eigenvalues of D log p(y|x),log p(x) BA β (13) other than 1. This can be seen either empirically or by tracing the proof of Lemma 2.
In Section 5, we shall proceed with this line of thought of removing irrelevant coordinates. In the following Section 4, we turn to reconstruct a solution path from implicit derivatives at a point, with bifurcations ignored for now.

A modified Euler method for the IB
We follow the path of a given IB root away of bifurcation by using its implicit derivatives computed from the IB ODE (16), of Section 3. We follow the classic Euler method for simplicity, modifying it slightly to get the most out of the calculated derivatives. Improvements using more sophisticated numerical methods are left to future work. The detection and handling of IB bifurcations are deferred to the next Section 5, and thus are ignored in this Section.
Let dx dβ = f (x, β) and x(β 0 ) = x 0 define an initial value problem. In numerical approximations of ordinary differential equations (ODEs), the Euler method for this problem is defined by setting where β n+1 := β n + ∆β, and |∆β| is the step size. The global truncation error max n x n − x(β n ) ∞ is the largest error of the approximations x n from the true solutions x(β n ). A numerical method for solving ODEs is said to be of order d if its global truncation error is of order O(|∆β| d ), for step sizes |∆β| small enough. Euler's method error analysis is a standard result, e.g., [Butcher, 2016, Theorem 212A] or [Atkinson et al., 2011, Theorem 2.4], brought as Theorem 3 below. It shows that Euler's method (20) is of order d = 1, under mild assumptions, as demonstrated in Figure 4. The immediate generalization of (20) using derivatives till order d is Taylor's method, which is a method of order d.
Theorem 3 (Euler's method error analysis). Let an initial-value problem be defined on [β 0 , β f ] by dx dβ = f (x, β) as above 15 , and suppose that f satisfies the Lipschitz condition with some constant Then, Euler method's (20) global truncation error satisfies Specializing Euler's method to our needs, replace x in (20) above by the log-decoder coordinates of an IB root, as in Section 3. So long that an IB root p β := p β (y|x), p β (x) is a differentiable function of β in the vicinity of β n , it can be approximated by 15 The initial condition x 0 in (21) is allowed to deviate from the true solution x(β 0 ).  : Error by step-size for a vanilla Euler method using the IB ODE (16), and with an added BA-IB iteration at each step, for BSC(0.3) with a uniform source (Appendix E). The linear regression (dashed black) of the third leftmost markers for the vanilla Euler method is of slope 0.99 (R 2 1), matching the theory's prediction almost perfectly. A similar regression (not shown) for Euler's method with a single added BA iteration is of nearly double slope 1.93. For comparison, reverse deterministic annealing with a single BA iteration at each grid point yields a slope of 0.91 in this example. Taking a larger (pre-determined) number of iterations at each grid point pushes the error downwards, as expected. Yet, the resulting slopes approach 1 as the number of iterations is increased (not shown). See main text and Appendix D for details. The error was calculated as the supremum of the pointwise errors as in Figure 3, over the interval [β c + 1 10 , β 0 ] which contains no bifurcation. Each method was initialized with the exact solution at β 0 = 2 5 , with ∆β = − 103 32 halved between consecutive markers.
where d log p β (y|x) dβ and d log p β (x) dβ are calculated from the IB ODE (16). Thus, applying (22) repeatedly, we obtain an Euler method for the IB. We shall take only negative steps ∆β < 0 when approximating the IB, due to reasons explained in Subsection 5.3 (after Proposition 12). In contrast to the BA-IB Algorithm 1, Euler's method (22) can be used to extrapolate intermediate points, yielding a piecewise linear approximation of the root.
The problem of tracking an operator's root belongs in general to a family of hard-to-solve numerical problems -known as stiff -if the problem has a bifurcation, [Agmon, 2022a, Section 7.2]. e.g., Butcher [2016] or Atkinson et al. [2011] on stiff differential equations. Stopping early in the vicinity of a bifurcation restricts the computational difficulty and permits convergence guarantees. Earlystopping in the IB shall be handled later, in Subsection 5.2. [Agmon, 2022a, Theorem 5] proves that Euler's method convergence guarantees (Theorem 3) hold for the closely related Euler method for RD with early stopping. While Euler's method may inadvertently switch between solution branches, the latter guarantees ensure that it indeed follows the true solution path between bifurcations, if the step size |∆β| is small enough and initializing close enough to the true solution. While we do not dive into these details for brevity, we note that similar convergence guarantees can also be proven here. Alternatively, Euler's method can be ensured to follow the true solution path by noting that an optimal IB root is (strongly) stable when negative steps ∆β < 0 are taken; these details are deferred to Subsection 6.3, as they depend on Section 5.
Following the discussion in Section 2, there is a subtle disadvantage in choosing decoder coordinates as our variables compared to the other two coordinate systems there. Indeed, recall that the IB is defined as a maximization over Markov chains Y ←→ X ←→X. An (arbitrary) encoder p(x|x) defines a joint probability distribution p(x|x)p(y|x)p(x) which is Markov. An inverse-encoder pair also similarly defines a Markov chain. In contrast, an arbitrary decoder pair p(y|x), p(x) need not necessarily define a Markov chain. Rather, by invoking the error analysis of Euler's method, one can see that Markovity is approximated at an increasingly improved quality as the step-size |∆β| in (22) becomes smaller. To enforce Markovity, we shall perform a single BA iteration (in decoder coordinates) after each Euler method step. This ensures that the newly generated decoder pair satisfies the Markov condition, as it is now generated from an encoder.
As a side effect, adding a single BA-IB iteration after each Euler method step improves the approximation's quality significantly. By linearizing BA β around a fixed point, one can show that deterministic annealing with a fixed number of BA iterations per grid point is a first-order method. Thus, deterministic annealing may arguably be considered a first-order method, as is with Euler's method. A similar argument shows that adding a single BA iteration after each Euler method step yields a second-order method. However, while a larger number of added BA iterations obviously improves the approximation's quality, it does not improve the method's order. See Appendix D for an approximate error analysis. The predicted orders are in good agreement with the ones found empirically, shown in Figure 4. We note that while [Agmon, 2022a] did not attempt an added BA iteration, they do discuss a variety of other improvements to root-tracking (see Section 3.4 there).

On IB bifurcations
For a bifurcation to exist, it is necessary that the Jacobian of the IB operator (5) would be singular, as illustrated by Figure 5. However, a priori singularity is not sufficient to detect a bifurcation (cf., [Gedeon et al., 2012, Section 3.1]), nor does this allow to distinguish between bifurcations of different types. In order to be able to exploit the IB's ODE (16) (Section 3), we shall now take a closer look into its bifurcations. These can be broadly classified into two types: where an optimal root is continuous in β and where it is not. As noted after Theorem 1, each type violates an assumption necessary to compute implicit derivatives. Sections 5.2 and 5.3 provide the means to identify bifurcations, distinguish between their types, and handle them accordingly (for continuous bifurcations). To facilitate the discussion, Section 5.1 considers the IB as a rate-distortion problem, following Harremoës and Tishby [2007] and others. This allows us to leverage recent insights on RD bifurcations, [Agmon, 2022a], while suggesting a "minimally sufficient" choice of coordinates for the IB. The latter permits a clean treatment of continuous IB bifurcations in Section 5.2. Viewing the IB as an infinite-dimensional RD problem facilitates the understanding of its discontinuous bifurcations, which in turn highlight subtleties in its finite-dimensional coordinate systems (of Section 2). These provide insight into the IB and are also of practical implications (Section 5.3), and so are necessary for our algorithms at Section 6.

The IB as a rate-distortion problem
We now explore the intimate relation between the IB and RD, following Harremoës and Tishby [2007] and Gilad-Bachrach et al. [2003]. This leads to a "minimally sufficient" coordinate system for the IB, thereby completing the work of Section 2. In this coordinate system, results [Agmon, 2022a] on the dynamics of RD roots are readily considered in IB context. This leads to Conjecture 5, that 3 4 5 Figure 5: While the Jacobian D log p(y|x),log p(x) (Id − BA β ) must be singular at a bifurcation, this does not suffice to identify its type. The Jacobian eigenvalues of BA β (13) with respect to log-decoder coordinates are plotted for BSC(0.3) with a uniform source, as in Figure 1; see Appendix E for its exact solution. An eigenvalue reaches one (dashed green) precisely at the bifurcation (dashed red vertical), as expected by Conjecture 5 in Section 5.1. In particular, the Jacobian is increasingly ill-conditioned when approaching the bifurcation, as noted in Figure 3 (top). While this allows to detect the bifurcation, identifying its type is necessary for handling it.
the IB operator (5) in these coordinates is typically non-singular. The discussion here facilitates the treatment of IB bifurcations in the following Sections 5.2 and 5.3.
First, recall a few definitions. A rate distortion problem on a source alphabet X and a reproduction alphabetX is defined by a distortion measure 9 d : X ×X → R ≥0 and a source distribution p X (x). One seeks the minimal rate I(X;X) subject to a constraint D on the expected distortion E [d(x,x)], [Shannon, 1948[Shannon, , 1959, known as the rate-distortion curve. The minimization is over test channels p(x|x). A test channel that attains the RD curve (23) is called an achieving distribution. We say that an RD problem is finite if both of the alphabets X andX are finite. Using Lagrange multipliers for (23) with 10 I(X;X) + β E[d(x,x)], one obtains a pair of fixed-point equations in the marginal p(x) and test channel p(x|x), similar to the IB Equations (2) and (4). Iterating over these is Blahut's algorithm for RD, Blahut [1972], denoted BA RD β here. As with the IB (1), β parametrizes the slope of the optimal curve (23) also for RD. See Cover and Thomas [2006] or Berger [1971] for an exposition of rate-distortion theory.
We clarify a definition needed to rewrite the IB as an RD problem. We define the simplex ∆[S] on a (possibly infinite) set S as the collection of finite formal convex combinations s a s · s of elements of S. That is, as the S-indexed vectors 11 (a s ) s∈S that satisfy s a s = 1 and a s ≥ 0, with a s non-zero for only finitely many elements s (the support of (a s ) s ). Addition and multiplication are defined pointwise, as in s a s · s + s b s · s = s (a s + b s ) · s. ∆[S] is closed under finite convex combinations because the sum of finitely supported vectors is finitely supported. When taking S = {e 1 , . . . , e n } the standard basis vectors (e i ) j = δ i,j of R n , then one can identify the formal operations with those in R n , reducing the simplex ∆[S] to its usual definition. We write r for an element of ∆[Y]. In particular, an element of ∆ ∆[Y] is merely a finite convex combination is a special case 13 of the decoder coordinates of Section 2. Now, let a finite IB problem be defined by a joint probability distribution p Y |X p X , as in Section 1. To write it down as an RD problem, [Harremoës andTishby, 2007, Gilad-Bachrach et al., 2003], define the IB distortion measure by (25) and p X define an RD problem on the continuous reproduction alphabet . Minimizing the IB Lagrangian L (at Section 1) is equivalent to minimizing the Lagrangian of this RD problem, [Harremoës and Tishby, 2007, Theorem 5]. That is, the IB is a rate-distortion problem when considered in these coordinates 14 . IB clusters r ∈ ∆[Y] assume the role of RD reproduction symbols, while an IB root (considered now as an RD root) is equivalently described either by the probabilities of each cluster -namely, by a point in ∆ ∆[Y] -or, by a test channel p(r|x). Unlike the finite-dimensional coordinate systems of Section 2, this definition of the IB entails no subtleties due to finite-dimensionality, such as duplicate clusters (see more below). However, while it allows to spell out the IB explicitly as an RD problem, handling an infinite reproduction alphabet is difficult for practical purposes. Since no more than |X | + 1 reproduction symbols are needed to write down an IB root (see therein), this motivates one to consider the IB's local behavior, with clusters fixed. So instead, one may require the reproduction symbols of d IB (25) to be in a list 15 (rx)x indexed by some finite setX , with each rx in ∆[Y]. This defines a finite RD problem, for which d IB (25) is merely an |X |-by-T matrix. Yet, placing identical clusters in the list (rx)x inadvertently introduces degeneracy to the matrix d IB (25), as discussed below. [Gilad-Bachrach et al., 2003, Section 6] take (rx)x to be the decoders defined by a given encoder p(x|x), as in Equation (11) (Section 2). We shall then refer to d IB (25) as the distortion matrix defined by p(x|x). When p β0 (x|x) is an optimal IB root then the problem (d IB , p X ) defined by it is called the tangent RD problem. Indeed, its RD curve (23) coincides 16 with the IB curve (1) at this point. However, the curves differ outside this point since IB clusters usually vary with β, while the distortion of the tangent problem is defined at p β0 (x|x) and so is fixed. By definition (1), it follows that the IB curve is a lower envelope of the curves of its tangent RD problems, [Gilad-Bachrach et al., 2003, Corollary 2]. We note that a similar construction can also be carried out in inverse-encoder coordinates; cf., [Witsenhausen and Wyner, 1975].
Regardless of the formulation used to rewrite the IB as an RD problem, the associated RD problem has an expected distortion E[d IB ] of I(X; Y ) − I(X; Y ) at an IB root, [Harremoës and Tishby, 2007, Section 5], [Gilad-Bachrach et al., 2003, Lemma 8]. That is, the IB is a lossy compression method that strives to preserve the relevant information I(X; Y ). Due to the Markov condition, information on Y is available only through X. Thus, one may intuitively consider the IB as a lossy compression 11 Equivalently, as functions mapping each element s of S to a real number as. 12 Note that ∆[S] is a set. 13 Unlike ∆[X ] here, the decoder coordinates of Section 2 are not required to have their clusters r distinct. 14 The astute reader might notice that the IB Equations (2) and (4) are then equivalent to RD's fixed-point Equations (24), with (3) implied by the IB's Markovity. The IB's Y -information I(Y ;X) equals the expected distortion E [d IB (x,x)] at (23) up to a constant, [Harremoës and Tishby, 2007, Section 5], and so is linear in the test channel p(r|x). 15 We use here a tuple (rx 1 , . . . , rx T ) rather than a set, since the points rx need not be distinct a priori. 16 Note that an optimal choice of IB clusters is already encoded into d IB (25) here. With d IB and p X given, solving the IB boils down to finding the optimal cluster weights p(x), which is an RD problem. method of the information on Y that is embedded in X. These intimate relations between the IB and RD suggest that studying bifurcations in either context could be leveraged to understand the other. Bifurcations in finite RD problems are discussed at length in [Agmon, 2022a, Section 6]. To facilitate the study of IB bifurcations in the sequel (Sections 5.2 and 5.3) using results from RD, we need a "minimally-sufficient" coordinate system for the IB.
Consider an IB root in decoder coordinates as finitely-many p(x)-weighted points rx(y) in ∆[Y], as in Section 2. Exchanging to decoder coordinates (Equation (11) there) is well-defined so long that there are no zero mass clusters, ∀x p(x) = 0. Yet, even then, the points rx in ∆[Y] yielded by Equations 1.4 through 1.6 need not be distinct. Namely, they may yield identical clusters rx = rx at distinct indicesx =x . This leads to a discussion of structural symmetries of the IB (its degeneracies), which is not of use for our purposes; cf., Gedeon et al. [2012]. To avoid such subtleties, we shall say that an IB root is reduced if it has no zero-mass clusters, ∀x p(x) = 0, and all its clusters are distinct, x =x ⇔ rx = rx . A root that is not reduced is degenerate or represented degenerately. An IB root can be reduced by removing clusters of zero mass and merging identical clusters of distinct indicessee Algorithm 2 in Section 5.2 below. It is straightforward to see from the IB Equations (2)-(4) that reduction preserves the property of being an IB root. Similarly, reducing a root does not change its location in the information plane. So, a root achieves the IB curve (1) if and only if its reduction does. Therefore, reduction decreases the dimension in which the problem is considered while preserving all its essential properties. This allows to represent an IB root on the smallest number of clusters possible -its effective cardinality -by factoring out the IB's structural symmetries. cf., [Zaslavsky, 2019, 2.3 in Chapter 7], upon which this definition is based.
While the purpose of reduction is to mod-out redundant kernel coordinates (see Section 1), it highlights the differences between the various IB definitions found in the literature 17 , bringing to light a subtle caveat of finite dimensionality. To see this, note that reduction could have been defined above in terms of the other coordinate systems of the IB. Its definition in inverse-encoder coordinates is nearly identical to that above, while defining it in encoder coordinates is a straightforward exercise. Since the coordinate systems of Section 2 are equivalent at an IB root (without zero-mass clusters), the precise definition does not matter then. Each of these parameterizations encodes the coordinates r(y) of a root's clusters r using a finite-dimensional vector x (note Equation (11)). This enables one to represent duplicate clustersx =x with rx = rx , and obliges one to choose the order in which clusters are being encoded into the coordinates of x. A finite-dimensional representation x of an IB root is invariant to interchanging clustersx =x precisely when they are identical, rx = rx . The IB's functionals (e.g., its X and Y -information) are invariant to any cluster permutation. cf., [Gedeon et al., 2012, Parker andDimitrov, 2022]. Both of these structural symmetries result from using a finite-dimensional parameterization, with the former eliminated by reduction. In contrast, the elements of ∆ [Y] are distinct by definition (since ∆ [Y] is a set), and so parametrizing the IB by points in ∆ [∆[Y]] does not permit identical clusters. An element r p(r)r of ∆ [∆[Y]] assigns a probability mass p(r) to every point r in ∆[Y], with only finitely-many points r supported. Thus, it implicitly encodes all the entries r(y) of every probability distribution r ∈ ∆[Y] in a "one size fits all" approach, giving no room for the choices above. This leads us to argue that the IB's structural symmetries are not an inherent property but rather an artifact of using its finite-dimensional representations.
In rate-distortion, the reduction of a finite RD problem is defined similarly, [Agmon, 2022a, Section 3.1], by removing a symbolx from the reproduction alphabetX and its column d(·,x) from the distortion matrix once it is not in use anymore (of zero mass). A distortion matrix d is non-degenerate if its columns are distinct, d(·,x) = d(·,x ) for allx =x . Non-degeneracy arises naturally when considering the RD problem tangent to a given IB root p(x|x). Indeed, the distortion matrix d IB (25) defined by p(x|x) has duplicate columns if the root has identical clusters, while the other direction holds under mild assumptions 18 . Under these assumptions, the distortion matrix induced by an IB root p(x|x) is reduced and non-degenerate precisely when p(x|x) is a reduced IB root.
Reduction in RD provides the means to show that the dynamics underlying the RD curve (23) are piecewise analytic in β, [Agmon, 2022a], under mild assumptions. Just as in definition (5) of the IB operator, [Agmon et al., 2021, Eq. (5)] similarly define the RD operator Id − BA RD β in terms of Blahut's algorithm for RD, [Blahut, 1972]. By using their Theorem 1, [Agmon, 2022a, Section 3.1] observed that reducing a finite RD problem to the support 19 of a given RD root mods-out redundant kernel coordinates if the distortion measure is finite and non-degenerate. That is, the Jacobian D(Id − BA RD β ) of the RD operator on the reduced problem is then non-singular 20 , just as with our toy problem (10) in Section 1. By the Implicit Function Theorem, there is therefore a unique RD root of the reduced problem through the given one; this root is real-analytic in β (details there). Considering this for the RD problem tangent to a reduced IB root immediately yields the following, Corollary 4. Let p β0 (x|x) be a reduced IB root of a finite IB problem defined by p Y |X p X , such that the matrix p Y |X is of rank |Y|. Then, near β 0 , there is a unique function continuous in β, which is a root of the tangent RD problem through p β0 (x|x); it is real-analytic in β.
Corollary 4 shows that the local approximation of an IB problem (the roots of its tangent RD problem) is guaranteed to be as well-behaved as one could hope for, provided that the IB is viewed in the right coordinates system. Note, however, that the RD root through p β0 (x|x) of the tangent problem does not in general coincide with the IB root outside of β 0 since the IB distortion d IB (25) varies along with the clusters that define it. However, when the IB clusters are fixed, then one might expect that the Jacobian (13) of BA β in log-decoder coordinates would be the same as the Jacobian of its RD variant. Indeed, the Jacobian matrix of BA RD β is the T × T bottom-right sub-block of the Jacobian (13) of BA β , up to a multiplicative factor. cf., Equations (5)-(6) in [Agmon et al., 2021], Equations (14) and (13) in Section 3, and (51) in Appendix B.2.
As in RD, we argue that reduction in the IB also provides the means to show that the dynamics underlying the optimal curve (1) are piecewise analytic in β. Corollary 4 concludes that, under mild assumptions, through every reduced IB root passes a unique real-analytic RD root. However, its crux is that the Jacobian of the RD operator Id − BA RD β is non-singular at a reduced root. Due to the IB's close relations with RD, and since reduction in the IB is a natural extension of reduction in RD, we argue that the same is also to be expected of the IB operator Id − BA β (5) in decoder coordinates. To see this, note that IB roots are finitely supported, [Witsenhausen and Wyner, 1975, Lemma 2.2(i)], and so one may take finitely-supported probability distributions ∆ ∆[Y] for the IB's optimization variable. Thus, the IB's BA β operator in decoder coordinates (of Section 2) may be considered as an operator on ∆ ∆[Y] . Next, consider the RD problem defined by p X and d IB (25) on the continuous reproduction alphabet ∆[Y], as in [Harremoës and Tishby, 2007]. This defines on ∆ ∆[Y] also the BA operator BA RD β for RD. Now that both BA operators are considered on an equal footing, we note the following. First, while BA RD β iterates over 21 the IB Equations (2) and (4), its IB variant BA β iterates also over the decoder Equation (3). The latter Equation is a necessary condition 22 for Y → X →X to be Markov, and so can be understood as enforcement of Markovity. That is, IB roots are RD roots with an extra constraint. Second, by [Agmon et al., 2021, Theorem 1], reducing Id − BA RD β from the continuous reproduction alphabet ∆[Y] to a root of finite support renders it non-singular, under mild assumptions. Due to the similarity between these operators, and since reduction in the IB is a natural extension of reduction in RD, this suggests that reducing Id − BA β (5) from ∆ ∆[Y] to a root's effective cardinality should also render it non-singular. In line with the discussion of Section 1 on reduction, we therefore state the following, The support of a distribution p(x) is defined by supp p(x) := {x : p(x) > 0}. 20 When considered in the right coordinates system; see therein for details. 21 To see this, plug the IB distortion measure d IB (25) into the Equations (24) defining BA RD β . 22 For comparison, only p(y|x) = x p(y|x,x)p(x|x) holds for an arbitrary triplet (Y, X,X) of random variables.
Conjecture 5. The Jacobian matrix I − D log p(y|x),log p(x) BA β at (16) of the IB operator (5) in logdecoder coordinates is non-singular at reduced IB roots so long that it is well-defined, except perhaps at points of bifurcation.
The intuition behind this conjecture stems from analyticity, as follows. The IB operator Id − BA β (5) is real-analytic, since each of the Equations 1.4-1.8 defining it (in the BA-IB Algorithm 1) is realanalytic in its variables. For a root x 0 of a real-analytic operator F , one might expect that, in general, (i) no roots other than x 0 exist in its vicinity and that (ii) D x F | x0 has no kernel. That is, unless the operator is degenerate at x 0 in some manner or x 0 is a bifurcation. To see this, recall [Dieudonné, 1969, IX.3] that a real-valued function F i in x ∈ R n is real analytic in some open neighborhood of x 0 if it is a power series in x = (x 1 , . . . , x n ), within some 23 radius of convergence. For every practical purpose, one may replace F i by a polynomial in (x 1 , . . . , x n ) when x is close enough to the base-point x 0 , by truncating the power series. Viewed this way, a root of an operator F (x) = F 1 (x), . . . , F n (x) is nothing but a solution of n polynomial equations in n variables. However, a square polynomial system typically has only isolated roots, which is (i). This is best understood in terms of Bézout's Theorem; see [Gowers et al., 2008, 6 in IV.4] for example. For (ii), a vector v is in ker D x F precisely when it is orthogonal to each of the gradients ∇F i . However, ∇F i is the vector of the first-order monomial coefficients of x 1 , . . . , x n in F i . In a general position, these n coefficient vectors ∇F 1 , . . . , ∇F n are linearly-independent, and so v must vanish as claimed. If F is degenerate such that F i = F j for particular i = j, for example, then both points fail, of course. cf., also [Coolidge, 1959, I.2] for (i) and (ii). This intuition accords with the comments of [Berger, 1971, Section 2.4] on RD: "usually, each point on the rate distortion curve [...] is achieved by a unique conditional probability assignment. However, if the distortion matrix exhibits certain form of symmetry and degeneracy, there can be many choices of [a minimizer] ". Indeed, the fact that the dynamics underlying the RD curve (23) are piecewise real-analytic (under mild assumptions), [Agmon, 2022a], can be similarly understood to stem from the analyticity of the RD operator Id − BA RD β . Subject to Conjecture 5, a Jacobian eigenvalue of the IB operator (5) must vanish gradually 24 as one approaches a bifurcation, causing the critical slowing down of the BA-IB Algorithm 1, [Agmon et al., 2021]. When an IB root traverses a bifurcation in which its effective cardinality decreases, then it is not reduced anymore. One can then handle the bifurcation by reducing the root anew. To ensure proper handling by the bifurcation's type, we consider the latter closely in Sections 5.2 and 5.3 below. In a nutshell, following the IB's ODE (16) along with proper handling of its bifurcations is the idea behind our Algorithm 5 (Section 6), for approximating the IB numerically.
Conjecture 5 is compatible with our numerical experience. However, we leave its proof to future work. To that end, one could examine closely the smaller matrix S (17) (of Lemma 2 in Section 3) for example. However, even if Conjecture 5 were violated, then one could detect that easily by inspecting the Jacobian's eigenvalues. Conjecture 5 also implies that IB roots are locally unique outside of bifurcations when presented in a reduced form. Non-uniqueness of optimal roots is detectable by inspecting the Jacobian's eigenvalues -see Corollary 10 in Section 5.3 and the discussion following it. cf., [Agmon, 2022a, Section 6.3] for the respective discussion in RD. With that, most of the results in Sections 5.2 and 5.3 below do not depend on the validity of Conjecture 5.

Continuous IB bifurcations: cluster-vanishing and cluster-merging
Following Agmon [2022b], we consider the evolution of IB roots which are a continuous function of β. By representing an IB root in its reduced form (Section 5.1), it is evident that there are two types of continuous IB bifurcations. We provide a practical heuristic (Algorithm 2) for identifying and handling such bifurcations. The discussion here is complemented by Subsection 5.3 below, which considers the case where continuity does not hold.
The evolution of an IB root in β obeys the ODE (16) so long that it can be written as a differentiable function in β, as in Theorem 1. Considering the root in decoder coordinates, this amounts to an evolution of a T -tuple of points rx in ∆[Y] and their weights p(x). These typically traverse the simplex smoothly as the constraint β is varied, as demonstrated in Figure 6. We now consider two cases where this evolution does not obey the ODE (16), due to violating differentiability. Figure 6: A cluster-merging bifurcation. The reduced form of the optimal IB root in decoder coordinates as a function of β, for the exact solution of BSC(0.3) with a uniform source, as in Figure  1 (see Appendix E). At high enough β, the root is comprised of two clusters (in green and blue), each of a marginal probability 1 2 . The clusters collide at β c = 6 1 /4 (dashed red vertical) and merge to one, yielding the trivial solution -a single cluster of probability 1 at p Y . Carefully note that only a single IB root is plotted here, in its reduced form, with one cluster to the left of β c and two to the right. The violation of clusters' differentiability at β c can be observed visually (top), and the root is otherwise real-analytic in β, as can be deduced from Figure 5. Since the trivial solution is an IB root for every β > 0 (not shown), then β c is indeed a bifurcation, where the trivial and non-trivial roots intersect. To see this, consider the degenerate form of the trivial solution on two copies of p Y , each of probability 1 2 . The marginals p(x) (bottom) appear to be discontinuous at β c because the root was reduced before plotted (the latter degenerate form of the trivial root is not plotted to the left of β c ).
Consider an optimal IB root in its reduced form (see Section 5.1). Namely, consider the reduced form of a root that achieves the IB curve (1). Suppose that its decoders rx and weights p(x) are continuous in β. Then, a qualitative change in the root can occur only if either (i) two (or more) of its clusters collide or (ii) the marginal probability p(x) of a clusterx vanishes. In either case, the minimal number of points in ∆[Y] required to represent the root decreases. That is, its effective cardinality decreases 25 . We call the first a cluster-merging bifurcation and the second a clustervanishing bifurcation. Or, continuous bifurcations collectively. Both types were observed already at [Rose, 1994, Section IV.C] in the related setting of RD problems with a continuous source alphabet.
At a continuous bifurcation, IB roots of distinct effective cardinalities collide and merge into one, as discussed in Section 5.3 below. Specifically, one root achieves the minimal value of the IB Lagrangian and so is stable, while the other root is sub-optimal. Thus, continuous IB bifurcations are pitchfork bifurcations 26 , e.g., [Strogatz, 2018, Section 3.4], in accordance with Parker and Dimitrov [2022]. Even though the optimal root is continuous in β (by assumption), its differentiability is lost at the point of bifurcation, as noted after Theorem 1 and demonstrated in Figure 6. Among the two, clustervanishing bifurcations are more frequent in practice than cluster-merging. This can be understood by considering cluster trajectories in the simplex. In a general position, one might expect clusters to seldom be at the same "time" and place (β and r ∈ ∆[Y]).
With that, we note that cluster vanishing bifurcations cannot be detected directly by standard local techniques (i.e., considering the derivative's kernel directions at the bifurcation), whether considering the Hessian of the IB's loss function as in [Gedeon et al., 2012] or the Jacobian of the IB operator (5) as here. The technical reason for this is as follows, while the root cause underlying it is discussed in Subsection 5.3 (after Proposition 12). Observe that the I(Y ;X) and I(X;X) functionals do not depend on the coordinates (r(y)) y of clusters r of zero mass. Thus, the directions corresponding to these coordinates are always in the kernel regardless of whether evaluating at a bifurcation or not, and so cannot be used to detect a bifurcation 27 . Indeed, with its dynamics in β reversed, "a new symbol grows continuously from zero mass" in a cluster-vanishing bifurcation, as [Rose, 1994, IV.C] comments in a related setting. It is then not clear a priori which point in ∆[Y] should be chosen for the new symbol, rendering the perturbative condition at Equation (9) difficult to test. In accordance with this, [Gedeon et al., 2012, Section 5] offers a perturbative condition for detecting arbitrary IB bifurcations, while [Zaslavsky, 2019, 3.2 in Part III] offers a condition for detecting cluster-merging bifurcations by analyzing cluster stability. However, both conditions are equivalent (Appendix F), and so must detect the same type of bifurcations. In contrast, a cluster-splitting (or merging) bifurcation is straightforward to detect because the stability of a particular clusterx is a property of the root itself -see Appendix F and the references therein for details.
One may wonder whether bifurcations exist in the IB for the same reason as they do in RD. As in the IB, RD problems typically have many sub-optimal curves, [Agmon, 2022a, Section 6.1]. While bifurcations in the IB stem from restricting the effective cardinality 28 , [Tishby et al., 1999, Section 3.4], in RD they stem from the various restrictions that a reproduction alphabet has. e.g., a reproduction alphabetX := {r 1 , r 2 , r 3 } of an RD problem may be restricted to the distinct subsets {r 1 , r 2 } and {r 2 , r 3 }, usually yielding distinct sub-optimal RD curves; cf., [Agmon, 2022a, Figure 6.1]. In contrast to RD, the IB's distortion d IB (25) defined by a root's clusters is determined a posteriori by the problem's solution rather than a priori by the problem's definition. As a result, both reasons for the existence of bifurcations coincide. To see this, consider the IB as an RD problem whose reproduction symbolsX are a finite subset of ∆[Y] which is allowed to vary (e.g., as if defining the tangent RD problem anew at each β). Distinct restrictions of a reproduction alphabetX can be forced to agree by altering the symbols themselves, so long that they are of the same size. For example, restricting the set {r 1 , r 2 , r 3 } of reproduction symbols to {r 1 , r 2 } is the same as restricting it to {r 2 , r 3 } instead, and then replacing r 3 with r 1 ∈ ∆[Y] in the restricted problem 29 .
The dynamical point of view above, considering an IB root as weighted points traversing ∆[Y], 26 Strictly speaking, several copies of the root of larger effective cardinality collide at a continuous bifurcation. When two clusters r = r collide in (i), the root itself is invariant to interchanging their coordinates after the collision but not before it, breaking the IB's first structural symmetry discussed in Subsection 5.1. Interchanging the coordinates of r and r (and their marginals) before the collision yields two distinct copies of essentially the same root. For (ii), the IB's functionals (e.g., its X and Y -information) do not depend on the coordinates (r(y)) y of a vanished cluster r, rendering these redundant. cf., [Gedeon et al., 2012, Section 3.1]. Before the cluster r vanishes, there is one copy of the root for each indexx, with r placed at itsx coordinates. Considered in reduced coordinates, these coincide to a single copy after the cluster vanishes. This breaks the second structural symmetry. 27 The direction corresponding to a cluster's marginal is useless when one does not know which cluster (r(y)) y to pick. 28 At least for continuous IB bifurcations. 29 This is not to be confused with cluster permutations, which change the order among clusters but do not alter the symbols themselves.

4:
delete the coordinates ofx, from p(x) and p(y|x). return p(y|x), p(x) 15: end function offers a straightforward way to identify and handle continuous IB bifurcations. It is spelled out as our root-reduction Algorithm 2. For cluster-vanishing bifurcations, one can set a small threshold value δ 1 > 0 and consider the clusterx as vanished if p(x) < δ 1 (step 2.3), as in [Agmon, 2022a, Section 3.1]. Similarly, for cluster-merging bifurcations, one can set a small threshold δ 2 > 0 and consider the clustersx =x to have merged if rx − rx ∞ < δ 2 (step 2.9). A vanished cluster is then erased (and merged clusters replaced by one), resulting in an approximate IB root on fewer clusters. This not only identifies continuous IB bifurcations but also handles them, since the output of the root-reduction Algorithm 2 is a numerically-reduced root, represented in its effective cardinality. To re-gain accuracy, we shall later invoke the BA-IB Algorithm 1 on the reduced root, as part of Algorithm 5 (in Section 6). We note that one should pick the thresholds δ 1 and δ 2 small enough to avoid false detections, and yet not too small so as to cause mis-detections. Mis-detections are handled later, using the heuristic Algorithm 4 (Subsection 6.1).
Using the root-reduction Algorithm 2 allows one to stop early in the vicinity of a bifurcation when following the path of an IB root. As mentioned in Section 4, early-stopping restricts the computational difficulty of root-tracking, [Agmon, 2022a]. Further, reducing the root before invoking BA-IB (Algorithm 1) allows to avoid BA's critical slowing down, [Agmon et al., 2021]. For, reduction removes the nearly-vanished Jacobian eigenvalues that pertain to the nearly-vanished (or nearlymerged) cluster(s), which are the cause of BA's critical slowing down. cf., Proposition 12 (Section 5.3) and the discussion around it. See also [Agmon, 2022a, Figure 3.1(C) and Section 3.2] for the respective behavior in RD. Finally, we comment that the root-reduction Algorithm 2 can also be implemented in the other two coordinate systems of Section 2.

Discontinuous IB bifurcations and linear curve segments
In the previous Subsection 5.2, we considered continuous IB bifurcations -namely, when the clusters rx ∈ ∆[Y] and weights p(x) of an IB root are continuous functions of β. By exploiting the intimate relations between the IB and RD (Section 5.1), we now consider IB bifurcations where these cannot be written as a continuous function of β. Although in our experience discontinuous bifurcations are infrequent in practice, the theory they evoke has several subtle yet important consequences, with practical implications for computing IB roots (in Section 6). We start with several examples before diving into the theory. The examples of discontinuous IB bifurcations of which we are aware can be understood in RD context as follows. Consider the IB as an RD problem on the continuous reproduction alphabet ∆[Y], with IB roots parametrized by points in ∆ [∆[Y]] (see Section 5.1). In RD, the existence of linear curve segments is well known, [Berger, 1971]. See, for example, Figure 2.7.6 in the latter and its reproduction [Agmon, 2022a, Figure 6.2]. [Agmon, 2022a, Section 6.5] offers an explanation of these in terms of a support-switching bifurcation. Namely, a bifurcation where two RD roots of distinct supports exchange optimality at a particular multiplier value β c . Both roots evolve smoothly in β, while only exchanging optimality at the bifurcation. At β c itself, every convex combination of these two roots is also an RD root. This is manifested by a linear segment of slope −β c in the RD curve (see panels E and F in [Agmon, 2022a, Figure 6.2]). In particular, the optimal RD root cannot be written as a continuous function of β. The sudden emergence of an entire segment of RD roots is best understood in light of Bézout's Theorem; cf., point (i) in the discussion following Conjecture 5.
For one example of linear curve segments in the IB, say that a matrix M decomposes if it can be written (non-trivially) as a block matrix by permuting its rows or columns. In light of the above, we have the following refinement of [Witsenhausen and Wyner, 1975, Theorem 2.6], Theorem 6. The IB curve (1) has a linear segment at β = 1 if and only if the problem's definition p Y |X p X decomposes. Figure 7 demonstrates a simple decomposable problem, exhibiting a support-switching bifurcation at β c = 1 between the trivial and non-trivial roots there. For other examples, a symmetric binary erasure channel can also be seen to exhibit a support-switching bifurcation, [Witsenhausen and Wyner, 1975, IV.B], which is manifested by a linear segment of slope 1 /βc, for β c > 1. Similarly, also for Hamming channels with a uniform input, [Witsenhausen and Wyner, 1975, IV.E] and [Witsenhausen, 1974], whose problem definition p Y |X p X is of full support. We argue that in the IB, support-switching bifurcations exhibit the same behavior as in RD. That is, two roots that evolve smoothly in β and exchange optimality at the bifurcation. While the sequel can justify this in general, there is a simple way to see this in practice. Namely, following the two roots of Figure 7 through the bifurcation 30 by using BA-IB with deterministic annealing, [Rose et al., 1990a]. As deterministic annealing usually follows a solution branch continuously, this immediately reveals either root at the region where it is sub-optimal (not displayed).
A support-switching bifurcation evidently has similar characteristics 31 to a transcritical bifurcation; e.g., [Strogatz, 2018, Section 3.2]. This extends the results of [Parker and Dimitrov, 2022], who conclude that 32 bifurcations in the IB "are only of pitchfork type". To see the reason for this discrepancy, note that they employ the mathematical machinery in [Golubitsky et al., 1988] of bifurcations under symmetry. As pitchfork bifurcations are "common in physical problems that have a symmetry", [Strogatz, 2018, Section 3.4], then detecting only pitchforks by using the above machinery might not come as a surprise. Both [Gedeon et al., 2012] and its sequel [Parker and Dimitrov, 2022] consider the IB's symmetry to interchanging the coordinates of identical clusters 33 . However, this is a structural symmetry of the IB which stems from representing IB roots by finite-dimensional vectors (Subsection 5.1), and is broken at continuous IB bifurcations (Subsection 5.2). On the other hand, discontinuous IB bifurcations need not break this symmetry, as can be seen by inspecting the roots of Figure 7 closely 34 .
A few convexity results from rate-distortion theory are needed to consider discontinuous bifurcations in general. These have subtle practical implications, which are of interest in their own right.
Theorem 7 (Theorem 2.4.2 in Berger [1971]). The set of conditional probability distributions p(x|x) which achieve a point (D, R(D)) on the rate-distortion curve (23) is convex.
Viewing the IB as an RD problem as in [Harremoës and Tishby, 2007] immediately yields an identical result for the IB, Corollary 8. The set of IB encoders that achieve a point (I X , I Y ) on the IB curve (1) is convex.
The proof is provided below for completeness. We note that a version of Corollary 8 in inverseencoder coordinates can also be synthesized from the ideas leading to Theorem 2.3 in Witsenhausen and Wyner [1975].
Proof of Corollary 8. Consider a finite IB problem p Y |X p X as an RD problem (d IB , p X ) on the continuous reproduction alphabet ∆[Y], as defined by (25) in Section 5.1. As noted above, its encoders (or test channels) are conditional probability distributions p(r|x), with r ∈ ∆[Y], supported on finitely many coordinates (r, x).
Let p 1 (r|x) and p 2 (r|x) be encoders achieving a point (I X , I Y ) on the IB curve (1). By [Harremoës and Tishby, 2007, Theorem 5], these may be considered as test channels achieving the curve (23) of the RD problem (d IB , p X ). The reproduction symbols r ∈ ∆[Y] supporting 35 a convex combination p λ := 30 That is, following the trivial root of Figure 7 from left to right and the non-trivial one from right to left, through the bifurcation at βc = 1 there. 31 Strictly speaking, the two roots do not intersect as in a classical transcritical, and so a support-switching bifurcation should perhaps be classified as an imperfect transcritical.
32 Theorem 5 in [Parker and Dimitrov, 2022] says that the bifurcations detected by their Theorem 3 are degenerate rather than transcritical. It is then concluded "that the bifurcation guaranteed by Theorem 3 is [generically] pitchforklike". 33 e.g., [Parker and Dimitrov, 2022, Definition 1(1)]. 34 The trivial solution to the left of βc (Figure 7, left panel) may be given a degenerate bi-clustered representation, which is fully supported on p Y but has a second cluster r of zero-mass. One may choose r = p Y , in which case the root possesses no symmetry to interchanging cluster coordinates, at either side of the bifurcation there.
35 Defined here supp p(r|x) := supp p(r), where p(r) is defined from p(r|x) by marginalization, as in (4).
λ·p 1 +(1−λ)·p 2 , 0 ≤ λ ≤ 1, are contained in the the supports of p 1 and p 2 , supp p λ ⊆ supp p 1 ∪supp p 2 , and so p λ is finitely-supported. Although Berger's Theorem 7 assumes that the reproduction alphabet is finite, one can readily see that its proof works just as well when the distributions involved are finitelysupported. Thus, by Theorem 7, p λ achieves the above point on the RD curve (23). Since this point (I X , I Y ) is on the IB curve (1), then p λ is an optimal IB root.
The RD curve (23) is the envelope of lines of slope −β and intercept min p(x|x) I(X;X)+β E[d(x,x)] along the R-axis, e.g., [Berger, 1971]. Thus, Theorem 7 can be generalized by considering the achieving distributions that pertain to a particular slope value rather than to a particular curve point (D, R(D)) -see [Agmon, 2022a, Section 6.3].
As with Corollary 8, we immediately have an identical result for roots achieving the IB curve (1), Corollary 10. For any β > 0 value, the set of optimal IB encoders that correspond to β is convex.
See also [Witsenhausen and Wyner, 1975, IV] for an argument in inverse-encoder coordinates. In particular, note the duality technique leading to (b) and (c) in Theorem 4.1 there. This duality boils down to describing a compact convex set in the plane by its lines of support, as in the observation leading to Theorem 9. Commensurate with the IB being a special case of RD, Corollary 10 can also be proven directly from the IB's definitions in direct-encoder terms, Benger [2019]. Note that the requirement that the IB root indeed achieves the curve is necessary. Otherwise one could take convex combinations with the trivial IB root 36 p(r|x) = δ r,p Y . This yields absurd since the trivial root contains no information on either X or Y .
As in [Agmon, 2022a, Section 6.3], the convexity of optimal IB roots (Corollary 10) has several important consequences. For one, unlike the (local) bifurcations we have considered so far, bifurcation theory also has global bifurcations. These are "bifurcations that cannot be detected by looking at small neighborhoods of fixed points", [Kuznetsov, 2004, Section 2.3]. From convexity, it immediately follows that Corollary 11. There are no global bifurcations in finite IB problems.
Indeed, if at a given β value there exists more than one optimal root, then the Jacobian of the IB operator Id − BA β (5) must have a kernel vector pointing along the line connecting these optimal roots, by Corollary 10.
With that comes an important practical caveat. Corollaries 8 and 10 hold for the IB when parametrized by points in ∆ [∆ [Y]]. However, the above kernel vector (which exists due to convexity) may not be detectable if an IB root is improperly represented by a finite-dimensional vector. For example, consider the bifurcation in Figure 7, where a line segment at β c connects the trivial (single-clustered) root to the 2-clustered root. Obviously, the bifurcation there cannot be detected by the Jacobian of the IB operator (5) when it is computed on T = 1 clusters (Jacobian of order 1 · (|Y| + 1)). Indeed, the root of effective cardinality two cannot be represented on a single cluster, and so the line segment connecting it to the trivial root does not exist in a 1-clustered representation. This is demonstrated in Figure 8, which compares Jacobian eigenvalues at reduced representations to those at 2-clustered representations. The same reasoning gives the following necessary condition, Proposition 12 (A necessary condition for detectability of IB bifurcations). A bifurcation at β c in a finite IB problem which involves roots of effective cardinalities T 1 and T 2 is detectable by a non-zero vector in ker(I − D log p(y|x),log p(x) BA βc ) only if the latter is evaluated at a representation on at least max{T 1 , T 2 } clusters. 36 One can verify directly that this satisfies the IB Equations (2)-(4) for every β > 0.
Indeed, suppose that T 1 T 2 (the conclusion is trivial if T 1 = T 2 ). By definition, a root of effective cardinality T 2 does not exist in representations with less than T 2 clusters. Thus, there is no bifurcation in a T -clustered representation if T < T 2 , and so there is then nothing to detect. As a special case of this argument, note that Conjecture 5 (Section 5.1) implies that the Jacobian is non-singular in a T 1 -clustered representation of the T 1 -clustered root (namely, at its reduced representation). With that, we have observed numerically that the eigenvalues of D log p(y|x),log p(x) BA β do not depend on the representation's dimension if computed on strictly 37 more clusters than the effective cardinality. Rather, only the eigenvalues' multiplicities vary by dimension. We omit practical caveats on exchanging between the coordinate systems of Section 2 for brevity.  Figure 7. The eigenvalues are evaluated at solutions obtained by the BA-IB Algorithm 1 (stopping condition = 10 −9 ), initialized anew at random for each β. While the random initializations account for much of the eigenvalues' spread, they reveal the solution's behavior through its various approximations. Other factors which contribute to this spread are the degeneracy of the solutions (when β < 1, to the right), BA's loss of accuracy near the bifurcation (Figure 3 bottom), and the decoders' proximity to the simplex boundaries (see Equation (13)). Left: when computed at reduced representations (on T = 1 clusters to the left, T = 2 to the right), then the eigenvalues at the trivial solution give no indication of the upcoming bifurcation (at β < 1), unlike the eigenvalues at the 2-clustered root (β > 1). Right: the bifurcation's presence is clearly noticed also at the trivial solution (β < 1) when evaluated at its degenerate 2-clustered representations. Indeed, the trivial solution is then represented on the same number of clusters (T = 2) as the root to the right (β > 1) -see Proposition 12. However, due to the bifurcation, the eigenvalues' trajectories are not smooth at β c = 1. Both: a similar dependency on the representation's dimension also exists in the other bifurcation examples in this paper (though without a spread of eigenvalues).
To complete the discussion on continuous bifurcations (Section 5.2), we argue that cluster-merging and cluster-vanishing are indeed bifurcations, where IB roots of distinct effective cardinalities collide and merge into one. We offer two ways to see this. First, using the inverse-encoder 38 formulation of the IB in [Witsenhausen and Wyner, 1975, Section II.A], one can consider an optimization problem in which the number of IB clusters is constrained explicitly. By the arguments therein, the constrained problem has an optimal root (due to compactness), which achieves the optimal curve of the constrained problem. The latter curve must be sub-optimal if fewer clusters are allowed than needed to achieve the IB curve (1). Thus, whenever the effective cardinality of an optimal root (in the un-constrained problem) decreases, it must therefore collide with an optimal root of the constrained IB problem, by Corollary 10. This accords with [Tishby et al., 1999, Section 3.4], which describes IB bifurcations as a separation of optimal and sub-optimal IB curves according to their effective cardinalities. Second, consider the reduced form of an IB root at the point of a continuous bifurcation. Since its effective cardinality decreases there strictly, say from T 2 to T 1 , then the root can be represented on T 1 clusters at the bifurcation itself. However, the Jacobian of the IB operator (5) in log-decoder coordinates is non-singular when represented on T 1 clusters, as noted after Proposition 12. Thus, by the Implicit Function's Theorem, there is a unique IB root on T 1 clusters through this point. It exists at both sides of the bifurcation (above and below the critical point). When represented on T 2 clusters, however, the latter intersects at the bifurcation with the root of effective cardinality T 2 , and so the two roots collide and merge there to one. This argument is identical to [Agmon, 2022a, Section 6.2], which proves that distinct RD roots collide and merge at cluster-vanishing bifurcations in RD.
The arguments above imply that cluster vanishing bifurcations cannot be detected directly by considering kernel directions of the IB operator (5) at the bifurcation, as argued in Subsection 5.2. Indeed, consider a continuous bifurcation, where roots p 1 and p 2 of respective effective cardinalities T 1 < T 2 intersect. These are paths in ∆ [∆ [Y]] that coincide at the bifurcation itself, p 1 (β c ) = p 2 (β c ), and so in particular are of the same effective cardinality T 1 there. Asking whether a bifurcation is detectable amounts to considering the evaluation of ker D(Id − BA β ) at a finite-dimensional representation (or "projection") of p. The Jacobian D(Id − BA β ) of the IB operator (5) is non-singular at β c when evaluated on T 1 -clusters in log-decoder coordinates, as noted after Proposition 12. We argue that evaluating it on representations with more clusters T T 1 does not allow to detect the bifurcation, not even if T ≥ T 2 . See Appendix H for a formal argument. Intuitively, this follows because picking a degenerate representation amounts to duplicating clusters of the reduced representation or adding clusters of zero mass; cf., reduction in Subsection 5.1. Introducing degeneracies to a reduced root adds no information about the problem at hand.
Due to the above, cluster-vanishing bifurcations cannot be detected by following a root p 1 of effective cardinality T 1 through a "cluster growing" bifurcation, but only by following a root p 2 with T 2 > T 1 till its collision with p 1 . As discussed after Conjecture 5 (Subsection 5.1), the Jacobian of Id − BA β in reduced log-decoder coordinates can then be used to indicate an upcoming collision of p 2 with p 1 , in addition to the root-reduction Algorithm 2. The exact same arguments as above apply also to cluster-merging bifurcations. However, as noted in Subsection 5.2 (and Appendix F), the stability of a particular IB clusterx is a property of the root itself. Thus, these are detectable by standard local techniques at the point bifurcation. Unlike continuous bifurcations, discontinuous bifurcations are inherently detectable due to the line segment in ∆ [∆[Y]] connecting the roots at the bifurcation (Corollary 10), so long that the IB root is represented on sufficiently many clusters (Proposition 12) -see Figure 8. These results make sense, considering that cluster-vanishing bifurcations seem to appear more frequently in practice than other types. Intuitively, branching from a suboptimal root p 1 to an optimal one p 2 is harder than the other way around, just as learning new relevant information is harder than discarding it. Cases where both directions are equally difficult are the exception, as one might expect. This is consistent with the later discussion in Subsection 6.3 on the stability of optimal IB roots (Appendix G).
When following the path of a reduced IB root (as in Section 4), one would like to ensure that its bifurcations are indeed detectable by BA's Jacobian. Due to the caveats above, it is computationally preferable 39 to follow its path as the effective cardinality decreases rather than increases. As a result, we take only negative step sizes ∆β < 0, since the effective cardinality of an optimal IB root cannot decrease with β. To see this, first note that the IB curve I Y (I X ) (1) is concave, and so its slope 1 /β cannot increase with I X . That is, β cannot decrease with I X . Second, note that allowing more clusters cannot decrease the X-information x p(x)H p(x|x) achieved by the IB's optimization variables. Indeed, a T -clustered variable p(x|x), p(x) (not necessarily a root) can always be considered as (T + 1)-clustered, by adding a cluster of zero mass. cf., the construction at [Witsenhausen and Wyner, 1975, II.A]. Thus, the effective cardinality of an optimal root cannot decrease as the constraint I X on the X-information is relaxed. When both points are combined, the effective cardinality cannot decrease with β, as argued. In contrast to the IB, we note that the behavior of RD problems is more complicated, e.g., [Berger, 1971, Example 2.7.3 and Problems 2.8-2.10], since the distortion of each reproduction symbol is fixed a priori.
Finally, we proceed with the argument of Section 5.2 for the case of discontinuous IB bifurcations. That is, consider the reduced form of an optimal IB root, and suppose that either its decoders or its weights (or both) cannot be written as a continuous function of β in the vicinity of β c . Write r + x and r − x for its distinct decoders as β → β + c and β → β − c , respectively. Similarly, p + (x) and p − (x) for its non-zero weights. Consider the tangent RD problem on the reproduction alphabet , as in Section 5.1; cf., [Agmon et al., 2021, Section V], upon which this argument is based. By construction, the IB coincides with its tangent RD problem at the two points r + x , p + (x) and r − x , p − (x) . Since both points achieve the optimal curve at the same slope value 1 /βc, then the linear segment of distributions connecting these points is also optimal, by Theorem 9. Alternatively, one could apply Corollary 10 directly to the IB problem. Either way, there exists a line segment of optimal IB roots, which pertain to the given slope value. In summary, Theorem 13. Let a finite IB problem have a discontinuous bifurcation at β c > 0. Then, its IB curve (1) has a linear segment of slope 1 /βc.

Unless the decoder sets {r +
x }x and {r − x }x are identical, then this is a support-switching bifurcation, as in Figure 7; cf., [Agmon, 2022a, Section 6.5]. A priori, the IB roots r + x , p + (x) and r − x , p − (x) may achieve the same point in the information plane, in which case the linear curve segment is of length zero. However, we are unaware of such examples. Yet, even if such bifurcations exist, they would be detectable by the Jacobian of BA-IB (when represented on enough clusters), subject to Conjecture 5.

First-order root-tracking for the Information Bottleneck
Gathering the results of Sections 2 through 5, we can now not only follow the evolution of an IB root along the first-order equation (16), but can also identify and handle IB bifurcations. This is summarized by our First-order Root-Tracking Algorithm 5 for the IB (IBRT1) in Section 6.1, with some numerical results in Section 6.2. Section 6.3 discusses the basic properties of IBRT1, and mainly the surprising quality of approximations of the IB curve (1) that it produces, as seen in Figure 1. We focus on continuous bifurcations (Section 5.2), as in our experience, these are far more frequent than discontinuous ones and are straightforward to handle. cf., Section 6.3.

The IBRT1 Algorithm 5
To assist the reader, we first present a simplified version in Algorithm 3, with edge-cases handled at Algorithm 4 -clarifications follow. These two combined form our IBRT1 Algorithm 5, specified below.
Ensure accuracy of the reduced root, using BA-IB Algorithm 1 till convergence. return results.

21: end function
We now elaborate on the main steps of the Simplified First-order Root-Tracking for the IB (Algorithm 3), which follows Root-Tracking for RD, Algorithm 3 in [Agmon, 2022a]. Its purpose is to follow the path of a given IB root p β0 (x|x) in a finite IB problem. The initial condition p β0 (x|x) is required to be reduced and IB-optimal. Its optimality is needed below to ensure that the path traced by the algorithm is indeed optimal. The step-size ∆β is negative, for reasons explained in Section 5.3 (Proposition 12 ff.). The cluster-mass and cluster-merging thresholds are as in the root-reduction Algorithm 2 (Section 5.2).
Denotep (line 3 of Algorithm 3) for the distributions generated from an encoder (cf., Equation (11) in Section 2). Algorithm 3 iterates over grid pointsp, with each while iteration generating the reduced form of the next grid point, as follows. On line 6, evaluate the IB ODE (16) at the current rootp, solving the linear equations numerically. By Conjecture 5 (Section 5.1), the IB ODE has a unique numerical solution v ifp is a reduced root and not a bifurcation. Lines 7 and 8 approximate the root at the next grid point at β + ∆β, by exponentiating Euler-method's step (22) (Section 4). Normalization is enforced on line 9, since it is assumed throughout. Off-grid points can be generated by repeating lines 7 through 9 for intermediate ∆β values if desired. The approximate root at β + ∆β is reduced on line 11, by invoking the root-reduction Algorithm 2 (Section 5.2). Note that Algorithm 2 returns its input root unmodified unless reducing it numerically. If reduced, then the root is a vector of a lower dimension -either a cluster mass p(x) has nearly vanished or distinct clusters have nearly merged. To re-gain accuracy, we invoke (on line 14) the Blahut-Arimoto Algorithm 1 for the IB till convergence, on the encoder defined at line 13 by the reduced root. Although BA-IB is invoked near a bifurcation, this does not incur a hefty computational cost due to its critical slowing-down, [Agmon et al., 2021] -see comments at the bottom of Section 5.2. Invoking BA (on line 14) before reducing (on line 11) would have inflicted a hefty computational cost to BA-IB due to the nearby bifurcation. Finally, a single BA-IB iteration in decoder coordinates is invoked on the approximate root (line 17), whether reduced earlier or not. This enforces Markovity while improving the algorithm's order (see Section 4, and Figure 4 in particular). Algorithm 3 continues this way (line 4) until the approximate solution is trivial (single-clustered), or β is non-positive. In the IB, the trivial solution is always optimal for tradeoff values β < 1. However, here β plays the role of the ODE's independent variable instead. Thus, we allow Algorithm 3 to continue beyond β = 1, so long that 40 β > 0 (which we assume throughout). This shall be useful for overshooting -see below.
With that, there are caveats in Algorithm 3, which stem from passing too far or close to a bifurcation. For one, suppose that the error accumulated from the true solution is too large for a bifurcation to be detected. The approximations generated by the algorithm will then overshoot the bifurcation. Namely, proceeding with more clusters than needed until the conditions for reduction are met later on (see Section 6.3 below), as demonstrated by the two sparse grids in Figure 9 (Section 6.2). For another, suppose that the current grid pointp is too close to a bifurcation. This might happen due to a variety of numerical reasons -e.g., thresholds δ 1 , δ 2 too small, or due to the particular grid layout. The coefficients matrix 41 I − D log p(y|x),log p(x) BA β of the IB ODE (16) would then be ill-conditioned (cf., Conjecture 5 ff. in Section 5.1), typically resulting in very large implicit numerical derivatives v (on line 6). Any inaccuracy 42 in v might then send the next grid point astray, derailing the algorithm from there on. Indeed, the derivatives dx dβ = −(D x F ) −1 D β F defined by 43 the implicit ODE (7) are in general unbounded near a bifurcation of F . This can be seen in Figure 2 (Section 2) for example, where the derivatives "explode" at the bifurcation's vicinity. See also [Agmon, 2022a, Section 7.2] on the computational difficulty incurred by a bifurcation. While overshooting a bifurcation is not a significant concern for our purposes (see Section 6.3), passing too close to one is. The latter is important, especially when the step size |∆β| is small. While decreasing |∆β| generally improves the error of Euler's method, it also makes it easier for the approximations to come close to a bifurcation, thus potentially worsening the approximation dramatically if it derails. This motivates one to consider how singularities of the IB ODE (16) should be handled.
Next, we elaborate on our heuristic for handling singularities of the IB ODE (16), brought as 40 The condition β > |∆β| is required on line 4, to ensure that the target β value of the next grid point is non-negative. 41 That is, the Jacobian D Id − BA β of the IB operator (5) in log-decoder coordinates. 42 e.g., due to the accumulated approximation error or due to the error caused by computing implicit derivatives in the vicinity of a bifurcation (see Figure 3 top, in Section 3). 43 Note that DxF here is always non-singular outside bifurcations, due to Conjecture 5 and the use of reduced coordinates.

Algorithm 4 A heuristic for handling singularities of the IB ODE (16)
1: function Handle singularity(p Y |X p X , p(y|x),p(x) , v, β) Input: An IB problem definition p Y |X p X , with ∀x p X (x) > 0. An approximate root p(y|x),p(x) of the given problem, near a singularity of the IB ODE (16).

Approximate numerical derivatives
at the given root.
The β > 0 value of the next (output) grid point. Output: An approximate IB rootp at β on one fewer cluster.
A new encoder on one cluster less than the input.
Re-gain accuracy, by the BA-IB Algorithm 1.

9:
returnp 10: end function Algorithm 4. The inputs of this heuristic are defined as in Algorithm 3. It starts with the assumption that the coefficients matrix I − D log p(y|x),log p(x) BA β of the IB ODE (16) is nearly-singular at the current grid pointp due to 44 a nearby bifurcation. As a result, the implicit derivatives v atp are not to be used directly to extrapolate the next grid point, as explained above. Instead, we use them to identify the two 45 fastest moving clusters, on line 2 of Algorithm 4. These are replaced by a single cluster (lines 3 through 6), resulting in an approximate root on one fewer cluster. To re-gain accuracy, the BA-IB Algorithm 1 is then invoked (at line 8) on the encoder generated (at line 7) from the latter root, thereby generating the next grid point. If the fast-moving clusters have merged (in the true solution) by the following grid point, then the output of Algorithm 4 will be an IB-optimal root if its input grid point is so. Namely, the branch followed by the algorithm remains an optimal one. Otherwise, if these clusters merge shortly after the next grid point, then Algorithm 4 yields a sub-optimal branch. However, optimality is re-gained shortly afterward since the sub-optimal branch collides and merges with the optimal one in continuous IB bifurcations (Section 5.3). Figure 9 below demonstrates Algorithm 4. cf., the similar heuristic [Agmon, 2022a, Section 3.2] in root-tracking for RD, which may also lose optimality near a bifurcation and re-gain it shortly after.
The heuristic Algorithm 4 is motivated by cluster-merging bifurcations. In these, the implicit derivatives are very large only 46 at the coordinates d log p(y|x) dβ of the points colliding in ∆ [Y]. While intended for cluster-merging bifurcations, this heuristic works nicely in practice also for cluster-vanishing ones. To see why, note that one can always add a cluster of zero-mass to an IB root without affecting the root's essential properties, regardless of its coordinates in ∆[Y]; cf., Section 5.1 on reduction in the IB. Therefore, a numerical algorithm may, in principle, do anything with the coordinates p(y|x) ∈ ∆[Y] of a nearly-vanished clusterx, p(x) 0, without affecting the approximation's quality too much. Thus, for numerical purposes, one may treat a cluster-vanishing bifurcation as a clustermerging one. Conversely, in a cluster-merging bifurcation, a numerical algorithm may, in principle, 44 While a priori the Jacobian D log p(y|x),log p(x) (Id − BA β ) may be singular also due to other reasons, by Conjecture 5 it is non-singular at the approximations generated so far since they are assumed to be in their reduced form. cf., Section 5.1. 45 While this can be refined to handle more than two fast-moving clusters at once, that is not expected to be necessary for typical bifurcations. 46 Note that cluster masses barely change in the vicinity of a cluster-merging, till the point of bifurcation itself.
zero the mass of one cluster while adding it to the remaining cluster. Again, without affecting the approximation's quality too much. To conclude, for numerical purposes, cluster-vanishing is very similar to cluster-merging. A variety of treatments between these extremities may be possible by a numerical algorithm. Empirically, we have observed that our ODE-based algorithm treats both as cluster-merging bifurcations. To our understanding, this is because our algorithm operates in decoder coordinates, unlike the BA-IB Algorithm 1, for example, which operates in encoder coordinates. Finally, we combine the simplified root-tracking Algorithm 3 with the heuristic Algorithm 4 for handling singularities, yielding our IBRT1 Algorithm 5. It follows the lines of simplified Algorithm 3, except that after solving for the implicit derivatives on line 6, we test the IB ODE (16) for singularity. To that end, we propose to use the matrix S (17) (from Lemma 2 in Section 3), since its order T · |Y| is smaller than the order T · (|Y| + 1) of the ODE's coefficients matrix. This might make it computationally cheaper to test for singularity (on lines 7 and 8). Our heuristic Algorithm 4 is invoked (on line 9) if the ODE (16) is found to be nearly-singular, otherwise proceeding as in Algorithm 3.

16:
if old dim = dimp(x) then 17:p(x|x) ← encoder defined from p(y|x),p(x) , via Equation 1.7-1.8. 18:p ← BA-IB(p(x|x); p Y |X p X , β + ∆β). To demonstrate the IBRT1 Algorithm 5, we present the numerical results used to approximate the IB curve in Figure 1 (Section 1) -see Section 6.3 below on the approximation quality and the algorithm's basic properties. This example was chosen both because it has an analytical solution (Appendix E) and because it allows one to get a good idea of the bifurcation handling added (in Section 6.1) on top of the modified Euler method (from Section 4). The source code used to generate these results is provided for readers who wish to examine the details (bottom of Section 1). Carefully note that only a single IB root is plotted here; its two clusters merge at β c , as seen in Figure  6 (Section 5.2). At 20 and 100 grid points, the approximations overshoot the bifurcation, terminating due to (approximate) cluster collision. While on 1200 grid points, the approximations pass too close to the bifurcation, terminating due to the nearby singularity. This can be seen in the inset to the right: The leftmost green marker has passed the cluster-merging threshold (dashed-green lines), and so was numerically reduced to the trivial (single-clustered) solution by the root-reduction Algorithm 2. On the other hand, the orange markers to the right are still far from the cluster-merging threshold; the leftmost one was reduced by the singularity-handling heuristic Algorithm 4 since the IB ODE (16) is nearly-singular there. Indeed, the numerical derivative is about five orders of magnitude larger there than at the algorithm's initial condition (see Figure 2) due to the bifurcation's proximity. The leftmost green and orange markers were drawn after the reductions took place. See main text and Section 6.1 for details, Figure 10 for errors, and Figure 1 (in Section 1) for the approximate IB curves. The marginals p(x) are not shown, as these barely deviate from their true value in this problem. For each step-size ∆β, the algorithm was initialized at the problem's exact solution at β = 2 5 , with thresholds set to δ i = 10 −2 , for i = 1, 2, 3. The lines connecting consecutive markers are for visualization only.
We discuss the numerical examples of this Section in light of the explanations provided in the previous Section 6.1. The error of the IBRT1 Algorithm 5 generally improves as the step-size |∆β| becomes smaller, as expected. The single BA-IB iteration added to Euler's method (in Section 4) typically allows one to achieve the same error by using much fewer grid points, thus lowering computational costs. For example, the two denser grids in Figure 9 require about an order of magnitude fewer points to achieve the same error compared to Euler's method for the IB; this can be seen from Figure 4 (Section 4).
In sparse grids, the approximations often pass too far away from a bifurcation for the root-reduction Algorithm 2 to detect it. When overshooting it, the conditions for numerical reduction are generally met later on, as discussed in Section 6.3 below. Decreasing |∆β| further often leads the approximations too close to a bifurcation, as can be seen in the densest grid of Figure 9. The implicit derivatives are typically very large at the proximity of a bifurcation, while the least accurate there (see Section 6.1). As these might send subsequent grid points off-track, the heuristic Algorithm 4 is invoked to handle the nearby singularity (see inset of Figure 9). As noted earlier, the computational difficulty in tracking IB roots (or root-tracking in general) stems from the presence of a bifurcation, manifested here by large approximation errors in its vicinity. While the algorithm's error peaks at the bifurcation, it typically decreases afterward when overshooting, as seen in Figure 10. See Section 6.3 for details.  Figure 9 from the exact solutions; the error is measured as in Figure 3 (bottom). Increasing the grid density decreases the error, as one might expect. While the error peaks at the bifurcation, it decreases afterwardsee main text and Section 6.3 below. The rightmost marker for each grid density is missing since the initial error is zero.

Basic properties of the IBRT1 Algorithm 5 and why does it work
Apart from presenting the basic properties of the IBRT1 Algorithm 5, the primary purpose of this section is to understand why does it approximate the problem's true IB curve (1) so well, despite its apparent errors in approximating the IB roots? While shown here only in Figures 1 and 9 (Sections 6.2 and 1), this behavior is consistent in the few numerical examples that we have tested. We offer an explanation why this may be true in general.
To understand why the IBRT1 Algorithm 5 approximates the true IB curve (1) so well, we first explain why overshooting is not a significant concern, as noted earlier in Section 6.1. To that end, consider the implicit ODE (7) dx dβ = −(D x F ) −1 D β F , from Section 1. So long that D x F and D β F at its right-hand side are well-defined, it defines a vector field on the entire phase space of admissible x values, at least when D x F is non-singular. That is, even for x's which are not roots (6) of F . Ignoring several technicalities, the IB ODE (16) therefore defines a vector field also outside IB roots. Indeed, due to Conjecture 5, the Jacobian of the IB operator Id − BA β (5) is non-singular in the vicinity of a reduced root, 47 . Now, suppose that p β is an optimal IB root, and consider a point p = p β in its vicinity. An argument based on a strong notion of Lyapunov stability (in Appendix G) shows that p flows along the IB's vector field towards p β in regions that do not contain a bifurcation, though only if flowing in decreasing β as done by our IBRT Algorithm 5. An approximation p would then be "pulled" towards the true root. Stability at this direction of β is very reasonable, considering that p β follows a path of decreasingly informative representations as β decreases. Indeed, all the paths to oblivion lead to one placethe trivial solution, whose representation in reduced coordinates is unique. As a result, a numerical approximation p would gradually settle in the vicinity of the true root p β as seen in Figures 9 and 10, so long that p β does not change much and the step-size |∆β| is small enough. While this explanation obviously breaks near a bifurcation, it does suggest that the approximation error should decrease when overshooting it (see Section 6.1), once the true reduced root has settled down. In a sense, overshooting is similar to being in the right place but at the wrong time.
The above suggests that the IBRT1 Algorithm 5 should generally approximate the true IB curve (1) well, despite its errors in approximating IB roots. To see this, note that while β −1 is the slope of the optimal curve (1) of the IB, [Tishby et al., 1999, Equation (32)], for the IB ODE (16) it is merely a "time-like" independent variable. When solving for the optimal curve (1), one is not interested in an optimal root or at its β value, but rather at its image I(X;X), I(Y ;X) in the information plane. As a result, achieving the optimal roots but on the wrong β values does yield the true IB curve (1), as required. This is the reason that the true curve (1) is achieved in Figure 1 (Section 1) even on sparse grids, despite the apparent approximation errors in Figures 9 and 10 (Section 6.2). With that, expect the approximate IB curve produced by the IBRT1 Algorithm 5 to be of lesser quality when there are more than two possible labels y. To see this, note that the space ∆[Y] traversed by the approximate clusters is not one-dimensional then, and so it is possible to maneuver around clusters of an optimal IB root.
Next, we briefly discuss the basic properties of the IBRT1 Algorithm 5. Its computational complexity is determined by the complexity of a single grid point. The latter is readily seen to be dominated by the complexity O T 2 · |Y| 2 · (|X | + T · |Y|) of computing the coefficients matrix of the IB ODE (16) and of solving it numerically (on line 6). To that, one should add the complexity of the BA-IB Algorithm 1 each time a root is reduced. However, the critical slowing down of BA-IB [Agmon et al., 2021] is avoided since we reduce the root before invoking BA-IB (see Section 5.2). The complexity is only linear in |X | thanks to the choice of decoder coordinates. Had we chosen one of the other coordinate systems in Section 2, then solving the ODE would have been cubic in |X | rather than linear (see there). The computational difficulty in following IB roots stems from the existence of bifurcations (Section 4), as it generally is with following an operator's root, [Agmon, 2022a, Section 7.2].
As noted in Section 4, convergence guarantees can be derived for Euler's method for the IB when away of bifurcation, in terms of the step-size |∆β|, in a manner similar to [Agmon, 2022a, Theorem 5] for RD. These imply similar guarantees for the IBRT1 Algorithm 5, as adding a single BA-IB iteration in our modified Euler method improves its order (see there). These details are omitted for brevity, however.
For a numerical method of order d > 0 (see Section 4) with a fixed step-size |∆β| and a fixed computational cost per grid point, the cost-to-error tradeoff is given by as in [Agmon, 2022a, Equation (3.6)], when |∆β| is small enough. See Butcher [2016] for example. Figure 3.4 in [Agmon, 2022a] demonstrates for RD that methods of higher order achieve a better tradeoff, as expected, as in the fixed-order Taylor methods they employ. Since computing implicit derivatives of higher orders requires the calculation of many more derivative tensors of Id − BA β (5) than done here, [Agmon, 2022a, Section 2.2], we have used only first-order derivatives for simplicity. However, while the vanilla Euler method for the IB is of order d = 1, the discussion in Section 4 (and Figure 4 in particular) suggests that the order d of the modified Euler method used by the IBRT1 Algorithm 5 is nearly twice than that. cf., Section 6.2.
With that, we comment on the behavior of the IBRT1 Algorithm 5 at discontinuous bifurcations. Consider the problem in Figure 7 (Section 5.3), for example. When Algorithm 5 follows the optimal 2-clustered root there, the Jacobian's singularity (in Figure 8) is detectable by it because the step size ∆β is negative. cf., the discussion in Section 5.3 there. Indeed, due to Conjecture 5 ff., the algorithm can detect discontinuous bifurcations in general. Whether a particular discontinuous bifurcation is detected by Algorithm 5 in practice depends on the details 48 , of course, as with continuous bifurcations. Indeed, the details may or may not cause a particular example to be detected by the conditions on lines 7 and 8 (in Algorithm 5). If missed, Algorithm 5 will continue to follow the 2-clustered root in Figure 7 to the left of the bifurcation, where it is sub-optimal, just as BA-IB with reverse deterministic annealing would. Once detected, though, one may wonder whether the heuristic Algorithm 4 works well also for discontinuous bifurcations. The example of Figure 7 has just one single-clustered root to the left of the bifurcation. Thus, the BA-IB Algorithm 1 invoked on line 8 (of Algorithm 4) must converge to it. However, there may generally be more than a single root of smaller effective cardinality to the left of the bifurcation, to which BA-IB may converge. The handling of discontinuous bifurcations is left to future work. Such handling is expected to be easier in the IB than in RD. Since, in contrast to RD, the effective cardinality of an optimal IB root cannot decrease with β (bottom of Section 5.3). See [Berger, 1971, Problems 2.8-2.10] for counter-examples in RD. This makes detecting discontinuous bifurcations easier in the IB and is also expected to assist with their handling.
We list the assumptions used along the way for reference. These are needed to guarantee the optimality of the IBRT1 Algorithm 5 at the limit of small step-sizes |∆β|, except at a bifurcation's vicinity. In Section 1, it was assumed without loss of generality 49 that the input distribution p X is of full support, p(x) > 0 for every x. The requirement p(y|x) > 0 was added in Section 3 as a sufficient technical condition for exchanging to logarithmic coordinates (Lemma 14 in Appendix A), and could perhaps be alleviated in alternative derivations. Together, these are equivalent to having a never-vanishing IB problem definition, p(y|x)p(x) > 0 for every x and y. The algorithm's initial condition is assumed to be a reduced and optimal IB root, as reduction is needed by Conjecture 5 in Section 5.1. Finally, the given IB problem is assumed to have only continuous bifurcations, except perhaps for its first (leftmost) one. While these assumptions are sufficient to guarantee optimality, we note that milder conditions might do in a particular problem. 48 e.g., on the threshold value δ 3 for detecting singularity and on the precise grid points layout. 49 Otherwise, one may remove symbols x with p X (x) = 0 from the source alphabet.

Concluding remarks
The IB is intimately related to several problems in adjacent fields, Zaidi et al. [2020], including coding problems, inference, and representation learning. Despite its importance, there are surprisingly few techniques to solve it numerically. This work attempts to fill this gap by exploiting the dynamics of IB roots.
The end result of this work is a new numerical algorithm for the IB, which follows the path of a root along the IB's optimal tradeoff curve (1). A combination of several novelties was required to achieve this goal. First, the dynamics underlying the IB-curve (1) obeys an ODE, Agmon [2022b]. Following the discussion around Conjecture 5 (in Section 5.1), the existence of such a dynamics stems from the analyticity of the IB's fixed-point Equations (2)-(4), thus typically resulting in piece-wise smooth dynamics of IB roots. Several natural choices of a coordinate system for the IB were considered, both for computational purposes and to facilitate a clean treatment of IB bifurcations below. The IB's ODE (16) was derived anew in appropriate coordinates, allowing an efficient computation of implicit derivatives at an IB root. Combining BA-IB with Euler's method yields a modified numerical method whose order is higher than either.
Second, one needs to understand where the IB ODE (16) is not obeyed, thereby violating the differentiability of an optimal root with respect to β. To that end, one not only needs to detect IB bifurcations but also needs to identify their type in order to handle them properly. Unlike standard techniques, our approach is to remove redundant coordinates, following root-tracking for RD, Agmon [2022a]; cf., Section 1. To achieve a reduction, we follow the arguably better definition of the IB in Harremoës and Tishby [2007]. Namely, a finite IB problem is an RD problem on the continuous reproduction alphabet ∆ [Y]. Therefore, the IB may be intuitively considered as a method of lossy compression of the information on Y embedded in X. Viewing a finite IB problem as an infinite RD problem suggests a particular choice of a coordinate system for the IB, which enables reduction in the IB; this extends reduction in RD, Agmon [2022a]. Furthermore, this point of view highlights subtleties due to computing finite-dimensional representations of IB roots. To our understanding, these subtleties hindered the understanding of IB bifurcations throughout the years.
Combining the above allows us to translate an understanding of IB bifurcations to a new numerical algorithm for the IB (the IBRT1 Algorithm 5). There are several directions that one could consider to improve our algorithm. Near bifurcations, one could improve its handling of discontinuous bifurcations. While we used implicit derivatives only of the first order for simplicity, higher-order derivatives generally offer a better cost-to-error tradeoff when away of bifurcations. See also [Agmon, 2022a, Section 3.4] on possible improvements for following an operator's root.

Appendix A The BA-IB operator in decoder coordinates
For reference, we give an explicit expression for the BA-IB operator in decoder coordinates, defined in Section 2.
Denote by p Y |X and pX the vectors whose coordinates are p(y|x), p(x) . We denote the evaluation of BA β at this point by BA β [p Y |X , pX ]. Its output is again a decoder-marginal pair, whose coordinates are denoted respectively BA β [p Y |X , pX ](y|x) and BA β [p Y |X , pX ](x). Explicitly, BA β in decoder coordinates is given by, where Z(x, β) is defined in terms of p(y|x) and p(x) as in the IB's encoder Equation (2) (Section 1).
The following lemma is handy when exchanging to logarithmic coordinates in Section 3.
Lemma 14. Let p(y|x)p(x) define a finite IB problem, such that p(y|x) > 0 for every x and y. Let p(y|x) be the decoder of an IB root, andx such that p(x ) > 0. Then p(y|x ) > 0 for every y.
Proof of Lemma 14. This follows immediately from the IB's decoder Equation (3), since p(x|x ) is a well-defined normalized conditional probability distribution if p(x ) > 0.

B The first-order derivative tensors of Blahut-Arimoto for the IB
We calculate the first-order derivative tensors of the Blahut-Arimoto operator BA β in log-decoder coordinates (see Sections 2 and 3). Namely, its Jacobian matrix D log p(x|x),log p(x) BA β , and the vector D β BA β of its partial derivatives with respect to β. cf., Appendix A for explicit formulae of BA β in decoder coordinates. While these are "just" differentiations, many subtleties are involved in getting the math right. For example, one needs to correctly identify the inputs and outputs of BA β , when considered as an operator on log-decoder coordinates. For another, one must take special care as to which variable depends on which, and especially on which does it not depend, as multiple variables are involved. Above all, these calculations require a deep understanding of the chain rule. With that, a common caveat in such calculations is that the BA β operator (and the equations defining it) should be differentiated before they are evaluated. While this is obvious for real functions, where f (3) stands for the derivative function of f (x) evaluated at x = 3, for the BA β operator, this might get obfuscated by the myriad of variables and variable-dependencies of which it is comprised. Although calculating the derivative of BA β (at an arbitrary point) first and only then evaluating at a fixed point might appear as a mere technical necessity, it is required by this work. For example, when considering the vector field defined by the IB operator (5) at Section 6.3. cf., [Agmon, 2022a, Section 5], for the derivative tensors of Blahut's algorithm [Blahut, 1972] for RD, of arbitrary order.
The subtitles involved in these differentiations are discussed in Appendix B.1, with the bulk of the calculations carried out in B.1.1. The latter are gathered and simplified in Appendix B.2 to obtain the Jacobian matrix D log p(x|x),log p(x) BA β , and in Appendix B.3 to obtain the partial-derivatives vector D β BA β . The results provided here naturally depend on the choice of coordinate system. To compare results between log-decoder and log-encoder coordinates in Section 2 (e.g., in Figure 2), we derive in Appendix B.4 the coordinate-exchange Jacobians between these coordinate systems.

B.1 Calculation setups and partial derivatives of unnamed functions
We explain the mathematical subtitles relevant to the sequel.
As we are interested in the derivatives of the Blahut-Arimoto Algorithm 1 for the IB (in Section 1), we shall follow its notation. Namely, distributions are subscripted i or i + 1 by the algorithm's iteration number. A subscript i is usually considered an input distribution, and a subscript i + 1 is usually considered an output distribution. e.g., p i (x) or p i+1 (y|x). These need not be IB roots but rather are arbitrary distributions. On the other hand, a subscript β denotes a distribution of an IB root at a tradeoff value β, as in p β (y|x) for a root's decoders. To avoid subtleties due to zero-mass clusters, we usually assume p i (x) = 0 in the sequel, for anyx. cf., Sections 2 and 5.1 on root-reduction in the IB.
It is important to distinguish which variables are dependent and which are independent in a particular calculation. e.g., in Appendix B.3. Since this task is easier for a single real variable (as opposed to distributions, for example), we consider simplifications to the real case. Note that each of the equations 1.4 through 1.8 defining the BA-IB Algorithm 1 yields a new distribution in terms of already-specified ones. These define unnamed functions, whose variables and values are probability distributions. For example, one could have formally defined p i (x|x) in 1.5 by the function where p i (x|x) and p i (x) are the variables of F, and its output is a conditional probability distribution, with x conditioned uponx. As the input and representation alphabets X andX are finite, N := |X | and T := |X |, the arguments p i (x|x), p i (x) and values p i (x|x) of F (28) are merely real vectors. Thus, enumerating the variables x 1 , . . . , x N andx 1 , . . . ,x T allows to spell-out (28) by its coordinates, (29) While (29) is too cumbersome to work with, it does highlight that F is merely a vector of N · T real vector-valued functions, in T + N · T real variables. This allows us to use partial derivatives rather than their infinite-dimensional counterparts (namely, variational derivatives), as in This is the derivative of F (29) with respect to a particular (j, k)-entry of its argument, by definition. However, to maintain a concise notation, we shall carry on with un-named function definitions, writing ∂pi(x|x) /∂pi(xj|x k ) for the partial derivative of (28) rather than its explicit form (30). If disoriented, the reader is encouraged to return to the definitions (30). We often exchange variables implicitly to logarithmic coordinates, as in Section 3. For example, ∂F [pi(x|x) is to be understood as exchanging variables to u i (x, The output of F may similarly be exchanged to logarithmic coordinates, as in log F [exp u i (x, x), exp u i (x)].
To proceed, carefully note the dependencies between the various variables in a BA-IB iteration, at 1.4 through 1.8. These are summarized compactly by the following diagram, by their order of appearance in the BA-IB Algorithm 1. This diagram proceeds to both sides by the iteration number i. Each node in (32) serves both as a function of the nodes preceding it and as a variable for those succeeding it, and so it is a "function-variable".
To differentiate along the dependencies graph (32), we shall need the multivariate chain rule for a function f y, z(y) . As the dependencies graph (32) involves multiple function-variables, such as z(y), we pause on the definition's subtleties. The partial derivative of a function g in several variables x 1 , . . . , x N with respect to its i-th entry is defined by We emphasize that variables x 1 , . . . , x i−1 , x i+1 , . . . , x N other than x i are fixed when calculating ∂g ∂xi . And so, it makes no difference in (34) whether or not they depend on x i , as in x j = x j (x i ) for j = i.
Next, suppose we would like to calculate how changing an input distribution affects some output distribution. This is relevant in Appendix B.2 for example, when considering how does a change in a coordinate of an input decoder p i (y|x) or marginal p i (x) affect a particular coordinate of the output decoder or marginal. For exposition's simplicity, though, suppose that we would like to calculate how a change in the (k 1 , k 2 ) coordinate p i (x k1 |x k2 ) of an input encoder affects the (j 1 , j 2 ) coordinate p i+1 (x j1 |x j2 ) of the output encoder. That is, deriving the rightmost node in (32) with respect to a coordinate of the leftmost one, where we have exchanged to logarithmic coordinates to simplify calculations. To calculate (35), one needs to apply the multivariate chain rule (33) along all the possible dependencies of the output log p i+1 (x j1 |x j2 ) on the input coordinate log p i (x k1 |x k2 ). This amounts to following all the paths in (32) connecting these two nodes, summing the contributions of every possible path. For example, traversing from the input p i (x k1 |x k2 ) rightwards at (32) to p i (x), then downwards to Z i (x, β) and then to the output p i+1 (x j1 |x j2 ) yields the term corresponding to this path, at particular x andx coordinates. To collect the contribution from every intermediate function-variable coordinate, we need to sum the latter over x andx . Writing down all such paths, one has for (35), Repeated unbounded variables are understood to be summed over, as in Einstein's summation convention.

B.1.1 Differentiating along the dependencies graph
Next, we differentiate each edge in (the logarithm of) the dependency graph (32). These are necessary to evaluate derivatives along dependency paths, that underlie the subsequent sections' calculations.
Equation 1.4 in the BA-IB Algorithm 1 defines the cluster marginal in terms of the direct encoder, In the first and second equalities we have used the identity ∂ ∂x y = y ∂ ∂x log y for the differentiation of a function's logarithm, when y is a function of x.
Following the comments around the definition (34) of a partial derivative, note that 1.5 defines the inverse encoder log p i (x|x) as a function of the variables log p i (x|x) and log p i (x) (and p(x), which we ignore under differentiation). Thus, differentiating this equation with respect to an entry of the variable log p i (x|x) implies that the entries of the other variable log p i (x) are held fixed, and vice versa. So, for the Bayes rule 1.5 we have where log p i (x j2 |x j1 ) at the right-hand side is different from the variable log p i (x) of differentiation, and so its partial derivative vanishes. Next, differentiating 1.5 with respect to a coordinate of its other variable log p i (x|x), Using again the logarithmic derivative identity ∂ ∂x y = y ∂ ∂x log y, by the decoder Equation 1.6 we have Next, consider the KL-divergence term in the definition 1.7 of the partition function Z i , δx ,x ·δ y,y = −δx ,x · p(y|x ) (41) Since the partition function 1.7 depends on the decoder p i (y|x) only via the KL-divergence, For the derivative of the partition function with respect to the marginal p i (x), Where the second equality follows from the logarithmic derivative identity. Hence, Finally, for the encoder Equation 1.8, The first two terms to the right, p i (x) and Z i (x, β), take the role of a variable in Equation 1.8. In contrast, we consider the last divergence term as a shorthand for summing over p i (y|x). Thus, the latter is a variable of (46). With (41), we thus have For the other derivatives of the encoder equation 1.8, And, where the variable p i (x) of Equation 1.8 differs from the variables Z i and p i (y|x), on which the crossed-out terms depend.
We summarize the calculations of this subsection in the following diagram: (50) A differentiation variable is denoted with commas, at an arrow's source in this diagram. A coordinate of the function which we differentiate is written without commas, at an arrow's end. e.g.,

B.2 The Jacobian matrix of BA-IB in log-decoder coordinates
By gathering the results of Appendix B.1.1 and following the lines of B.1, we calculate the Jacobian matrix (13) (in Section 3) of the Blahut-Arimoto operator BA β in log-decoder coordinates, defined in Section 2.
The derivative of BA β in decoder coordinates boils down to the four quantities: the effect d log pi+1(y|x) d log pi(y |x ) that varying a coordinate log p i (y |x ) of an input cluster has on a coordinate log p i+1 (y|x) of an output cluster, the effect d log pi+1(y|x) d log pi(x ) that varying an input marginal coordinate log p i (x ) has on a coordinate log p i+1 (y|x) of an output cluster, and so forth. And so, the Jacobian D log p(y|x),log p(x) BA β it is a block matrix,  Its rows correspond to the output coordinates of BA β . We index its upper rows by y ∈ Y and x ∈ {1, . . . , T }, while its lower rows are indexed byx alone. Similarly, its columns correspond to the input coordinates of BA β . We index its leftmost columns by y andx , and its rightmost columns bŷ x alone. Each block in (51) is comprised of contributions along all the distinct paths connecting two vertices in the dependencies graph (32). For example, the lower-left block in (51) is comprised of the contributions along all the paths in (32) connecting p i (y |x ) to p i+1 (x). We now spell out the paths contributing to each block in (51), with repeated dummy indices understood to be summed over. Afterward, we shall calculate the contributing paths explicitly, carrying out the summations. The upper-left block of (51) is comprised of This Equation (52) encodes the fours paths connecting the vertex p i (y |x ) to p i+1 (y|x) in (32). When accumulating the contributions in (52), one must carefully sum only over repeated dummy indices that appear in the given term. e.g., the two paths in (52) which traverse the edge ∂ log pi+1(x1|x2) ∂ log pi+1(x4|x5) (pointing from p i+1 (x|x) to p i+1 (x|x)) do not involve a summation overx 3 . In contrast, the two paths involving ∂ log pi+1(x1|x2) ∂ log pi+1 (x3) ∂ log pi+1 (x3) ∂ log pi+1(x4|x5) there do entail a summation overx 3 . This is relevant for the calculations below, as in (57) for example.
Similarly, for the upper-right block of (51), For the lower-left block of (51), Last, for the lower-right block of (51), Next, by using the intermediate results summarized in (50) (Section B.1.1), we calculate each of the four blocks of (51) explicitly. For the upper-left block (52) we have · [βδx 4,x p(y |x 5 ) + (−δ x5,x6 )βp i+1 (x |x 6 )p(y |x 6 )] (56) For clarity, we elaborate on each step needed to complete the calculation of the upper-left block (52) while providing only the main steps for the other blocks. To carry out the summations over the dummy variables x 1 ,x 2 ,x 3 ,x 4 , x 5 and x 6 in (56), we carefully sum only over repeated dummy indices, as explained after (52). We carry out one summation at a time, starting withx 2 . This yields, In the first equality above we carried out the summation over x 1 , in the second overx 3 , in the third overx 4 , in the fourth over x 6 , and in the fifth over x 5 .
To simplify the notation, we replace summations over x with definitions as in Equation (14) (Section 3), and note that The quantities A, B, and C involve two IB clusters. They are a scalar, a vector, and a matrix, respectively. The definition of D involves only one IB cluster and coincides with C Y in [Zaslavsky, 2019, 3.2 in Part III]. The relations to the right of (58) show that each can be expressed in terms of C(x,x ; i) y,y . Equation (59) shows that the latter can be rewritten as a right-stochastic matrix, up to trivial manipulations. As seen below, the Jacobian matrix (51) of a BA-IB step in log-decoder coordinates can be computed in terms of the quantities in (58).
With the latter definitions (58), (57) can be rewritten as, The third equality above follows from (59), the identities to the right of (58) and simple algebra.
For the upper-right block (53), · δx 4 ,x + βδx 4,x8 p(y 7 |x 5 ) p(y 7 |x 9 )p i (x 9 |x 8 ) p i (y 7 |x 8 ) δx 8 ,x10 (−δx 10,x ) In a manner similar to (57), summing over all ten dummy variables other than x 1 and x 5 yields, The two terms involving δx ,x cancel out when summing over x 1 and x 5 at the first equality. Rewriting with the definitions (58) of A and B further simplifies (62) to, For the lower-left block (54), Summing over dummy variables and simplifying yields, In terms of definitions (58), this simplifies to Finally, for the lower-right block (55), This simplifies to, With definitions (58), this can be written as Collecting the results from (60), (63), (66) and (69) back into (51), BA's Jacobian in these coordinates is When evaluated at an IB root, this is Equation (13) of Section 3. Equivalently, it can be written in the following form, which is more convenient for implementation B.3 The partial β-derivatives of BA-IB in log-decoder coordinates We calculate the vector D β BA β of partial derivatives of the BA β operator in log-decoder coordinates (of Section 2), which appears at the right-hand side of the IB-ODE (16) (in Section 3).
To that end, we differentiate backward along the dependencies graph (32) (in Appendix B.1) with respect to β, starting at the output coordinates p i+1 (y|x) and p i+1 (x) of BA β . After differentiating, we mind our independent variables. Here, these are β, and the input coordinates p i (y|x) and p i (x) of BA β . The differentiation of these with respect to β vanishes (except for dβ dβ = 1), as they are independent. Finally, we compose the differentiations to obtain the effect D β BA β of changing β on BA's output. We note that, in principle, one can differentiate the explicit formulae (27) of BA β in decoder coordinates (Appendix A) with respect to β. However, we find that to be cumbersome and far more error-prone than our approach, and so proceed in the spirit of the previous Appendix B.2. since p i (y|x) and p i (x) are considered as independent variables. Therefore, Summing over the three dummy variables as before, the latter simplifies to B.4.2 Exchanging from decoder to encoder coordinates In the other way around, a decoder p i (y |x ) and a marginal p i (x ) determine the subsequent encoder p i+1 (x|x). Using diagram (50), one has Summing over the dummy variable x 1 , this is the coordinates exchange Jacobian J enc dec mentioned in Section 2, Next, for the derivative with respect to the marginal, Summing over the five dummy variables, this is the coordinates exchange Jacobian J enc mrg from Section 2, Finally, note that the encoder Equation 1.8 depends on β explicitly, rather than indirectly only via its other variables. So, to calculate the partial derivative term ∂log pi+1(x|x) ∂β in (89), write as follows for log Z, Thus, And so, from the encoder Equation 1.8 we have where the term ∂log pi(x) ∂β vanishes since it is considered as an independent variable here.
C Proof of Lemma 2, on the kernel of the Jacobian of the IB operator in log-decoder coordinates We prove Lemma 2 from Section 3, using the results of Appendix B.
In the first direction, suppose that (v y,x ) y,x , (ux)x is a vector in the left kernel of the Jacobian of the IB operator (5) in log-decoder coordinates, I − D log p(y|x),log p(x) BA β , as in (16) in Section 3. Using the Jacobian's implicit form (51) (Appendix B.2), this is to say that hold, for every y andx . We spell out and manipulate these equations to obtain the desired result. By the Jacobian's explicit form (70)  Next, we expand and simplify each of the terms in (102) and (103), using the definition (58) of A, B and C from Appendix B.2. For the first summand to the right of (102), β · y,x v y,x x ,y (δx ,x − δx ,x ) 1 − δ y ,y pi+1(y|x) C(x,x ; i + 1) y ,y = β · y,x v y,x x ,y (δx ,x − δx ,x ) 1 − δ y ,y pi+1(y|x) x p(y |x)p(y |x)p i+1 (x |x)p i+1 (x|x) (104) We simplify each of the four addends to the right of (104) while temporarily ignoring the β coefficient. For the δx ,x · 1 term, y,x v y,x x ,y δx ,x x p(y |x)p(y |x)p i+1 (x |x)p i+1 (x|x) = y,x v y,x x p(y |x)p i+1 (x |x)p i+1 (x|x) (105) For the −δx ,x · δ y ,y pi+1(y|x) term,  (107) And for the last −δx ,x · −δ y ,y pi+1(y|x) term, y,x v y,x x ,y δx ,x · δ y ,y pi+1(y|x) x p(y |x)p(y |x)p i+1 (x |x)p i+1 (x|x) = y v y,x p i+1 (y|x ) x p(y |x)p(y|x)p i+1 (x|x ) (108) Collecting (105), (106), (107) and (108) for the first summand to the right of (102). The second summand to the right of (102) equals, β · x ux δx ,x p i+1 (y |x) − B(x,x ; i + 1) y = β · ux p i+1 (y |x ) − β · x p(y |x)p i+1 (x |x) x ux p i+1 (x|x) (110) Combining (109) and (110) for any y andx . Summing (111) over y and simplifying, we obtain (112) for anyx . Next, we expand and simplify Equation (103). Using the definition (58) of B, the first summand to its right can be written as Similarly, the second summand to the right of (103) can be written as Combining (113) and (114), Equation (103) can now be written explicitly, for everyx . Next, subtracting (115) from (112), we obtain for anyx. Substituting (116) into (111) and using the decoder Equation 1.6 to expand p i+1 (y |x ) there, Next, inserting x δx ,x into the sums on the last line, Finally, this simplifies to v y ,x = y,x v y,x x p(y |x) δx ,x − p i+1 (x |x) p i+1 (x|x) β · p(y|x) pi+1(y|x) + (1 − 2β) The latter is to say that (v y,x ) y,x is a left-eigenvector of the eigenvalue 1 of the matrix to the right. At an IB root, this is precisely the matrix S (17) from the Lemma's statement, as desired.
As a side note, we comment that Equations (100) and (101) which can be seen by summing (111) and (115) respectively overx , and simplifying.
At the other direction, let v := (v y,x ) y,x be a left-eigenvector of the eigenvalue 1 of S (17). That is, assume that Equation (119) holds. Define a vector u := (ux)x by Equation (116). Reversing the algebra, (119) is equivalent to (117). Substituting (116) into the latter yields back (111), which is equivalent to the explicit form (102) of Equation (100). Next, summing (111) over y and simplifying yields (112). Adding the latter to (116) yields back (115), which is equivalent to Equation (103), the explicit form of (101). To conclude, both of the Equations (100) and (101) (10)]. Denote repeated BA iterations initialized at p 0 by Linearizing around a fixed-point p β of BA, where D BA β | p β denotes the Jacobian matrix of BA β evaluated at p β . Rewriting in terms of the error δp k := p k − p β of the k-th iterate, Thus, to first order, repeated applications of BA β reduce the initial error according to Next, consider k > 0 applications of BA β+∆β to a root p β at β. This is similar to deterministic annealing, but with a capped number of BA iterations. Plugging the initial error δp 0 := p β −p β+∆β −∆β dp dβ β into Equation (124) shows that this method is of the first order, δp k+1 |∆β| · D BA β+∆β | p β+∆β k dp dβ β .

F Equivalent conditions for cluster-merging bifurcations
We briefly discuss the equivalent conditions for cluster-merging bifurcations in the IB (Subsection 5.2) found in the literature. [Rose et al., 1990b, Section 4] derive a condition for cluster-splitting phase transitions (Equation (17) there) in the context of fuzzy clustering. Following this, [Zaslavsky, 2019, 3.2 in Part III] derives an analogous condition for cluster splitting in the IB, which is Equation (12) there. Namely, for a clusterx to split it is necessary that 1 /β would be an eigenvalue of an |X |-by-|X | matrix C X (x; β), whose entries at an IB root are given by and I is the identity. While the coefficients matrix (135) for the IB differs from the one for fuzzy clustering, inter-cluster interactions are explicitly neglected in both derivations (see therein). Indeed, the definition (135) of C X involves the coordinates of clusterx alone, as one might expect when considering a root in either decoder or in inverse-encoder coordinates (Section 2). Reversing the dynamics in β, condition (134) characterizes cluster-merging bifurcations in the IB (Subsection 5.2). Zaslavsky [2019] notes that (134) is closely related to the bifurcation analysis of Gedeon et al. [2012]. The latter provides a condition to identify the critical β values of IB bifurcations, given in their Theorem 5.3. Indeed, their condition is equivalent to (134), and therefore it also characterizes cluster-merging bifurcations. To see this, the necessary condition they give for a phase transition at β is that 1 /β must be an eigenvalue of a matrix V (Equation (21) there). When written in our notation, this matrix is given by V (x; β) x,x := y p(x , y)p(x, y)p β (x|x) p β (y,x)p(x ) .
However, V (136) is readily seen to be the transpose of C X (135), and so they have the same eigenvalues.

G Lyapunov-stability of an optimal IB root
We provide the essential parts of a proof that an optimal IB root is Lyapunov uniformly asymptotically stable on closed intervals which do not contain a bifurcation when following the flow dictated by the IB's ODE (16) in decreasing β. Definitions for the below are as in [Slotine et al., 1991] (see especially Section 4.2 there). See Subsection 6.3 for a discussion of the results below.
Let p * (β) be an optimal IB root. We start by rewriting it as an equilibrium of a non-autonomous ODE, as in [Slotine et al., 1991, Equation (4.1)]. Consider the implicit ODE (7) dp dβ = −(D p F ) −1 D β F , specialized to the IB by setting F := Id−BA β (5). Denote δp := p−p * , for an arbitrary p. Subtracting the ODE at p from that at p * yields a non-autonomous ODE in the error δp from the optimal root, This rewrites the given root p * as an equilibrium δp = 0 of this ODE (137), simplifying the below.
Next, we define a Lyapunov function for the flow of the equilibrium δp = 0 along the ODE (137), when its dynamics in β is reversed. Consider the IB's Lagrangian L β := I(X;X) − β · I(Y ;X) as a functional in p, and let L * β := L β [p * ] be its optimal value at β. Then, is the desired Lyapunov function. Specifically, (i) L β − L * β is positive definite and (ii) d dβ L β − L * β is negative definite, when the dynamics in β are reversed. Theorem 4.1 in [Slotine et al., 1991] then implies that δp = 0 is uniformly asymptotically stable, [Slotine et al., 1991, Definition 4.6]. For (i), L β − L * β (138) is immediately seen to be positive semi-definite from the definition of L * β , up to technicalities ignored here 50 . The results of Subsection 5.3 (after Proposition 12) imply that representing p in reduced log-decoder coordinates renders (138) strictly positive definite. Indeed, D(Id − BA β ) is non-singular in a reduced representation in these coordinates, as mentioned there, and so an optimal root p * is locally unique. As for condition (ii), from the definition of L β we have where d dβ I(X;X) = β d dβ I(Y ;X) in the last equality follows by direct calculations similar to those in the Appendix of [Zaslavsky, 2019, Part III]. Thus, for the β-derivative of (138) we have d dβ L β − L * β (δp) = I(Y ;X)| p * − I(Y ;X)| p .
The latter is always positive semi-definite around p * , since by definition (1) p * yields the maximal Y -information subject to a constraint on the X-information. The same argument as above shows that it is strictly positive definite. Finally, reversing the dynamics in β leaves the ODE (137) unaffected but flips the sign of (140), rendering it negative definite as required.
H Introducing degeneracies cannot increase the nullity of the IB operator in decoder coordinates We show that evaluating the kernel of the IB operator on a degenerate representation cannot increase its nullity rank.
Let p ∈ ∆ [∆[Y]] be an IB root of effective cardinality T 1 . A T -clustered representation of a root (e.g., in decoder coordinates) is a function π : ∆ [∆ [Y]] → R (|Y|+1)·T , defined on some neighborhood of the root. In the other way around, one can consider the inclusion i : R (|Y|+1)·T → ∆ [∆ [Y]], defined on normalized decoder coordinates in the obvious way. Let π be a representation of p in its effective cardinality T 1 , andπ a degenerate one on T 2 > T 1 clusters. These satisfy where reduc is the reduction map 51 . In the other way around, one can pick a particular degenerating map degen (e.g., "split the third cluster to two copies of probability ratio 1:2"). Applying a particular degeneracy and then reducing is the identity, though not the other way around. Let i andĩ be the inclusions corresponding to π andπ respectively. Similar to (141), introducing degeneracy to a root has no effect before including it in ∆ [∆[Y]], Recall from Subsection 5.1 (before Conjecture 5) that BA β in decoder coordinates may be considered as an operator on ∆ [∆ [Y]]. To summarize, we have the following diagram, Next, consider the representations of the IB operator Id − BA β (5) on T 1 and T 2 clusters. These amount to pre-composing with the inclusions and post-composing with the representation maps. Denote by Id i the identity operator on R (|Y|+1)·Ti . By identities (141), (142) and (143), we have (145) Differentiating, by the chain rule we have Multiplying matrices can only enlarge the kernel, dim ker AB ≥ dim ker A, and so Thus, introducing degeneracies to the IB operator in decoder coordinates cannot increase its nullity rank.