Symmetry-Breaking Bifurcations of the Information Bottleneck and Related Problems

In this paper, we investigate the bifurcations of solutions to a class of degenerate constrained optimization problems. This study was motivated by the Information Bottleneck and Information Distortion problems, which have been used to successfully cluster data in many different applications. In the problems we discuss in this paper, the distortion function is not a linear function of the quantizer. This leads to a challenging annealing optimization problem, which we recast as a fixed-point dynamics problem of a gradient flow of a related dynamical system. The gradient system possesses an SN symmetry due to its invariance in relabeling representative classes. Its flow hence passes through a series of bifurcations with specific symmetry breaks. Here, we show that the dynamical system related to the Information Bottleneck problem has an additional spurious symmetry that requires more-challenging analysis of the symmetry-breaking bifurcation. For the Information Bottleneck, we determine that when bifurcations occur, they are only of pitchfork type, and we give conditions that determine the stability of the bifurcating branches. We relate the existence of subcritical bifurcations to the existence of first-order phase transitions in the corresponding distortion function as a function of the annealing parameter, and provide criteria with which to detect such transitions.


Introduction
This paper analyzes bifurcations of solutions to constrained optimization problems of the form as a function of a scalar parameter β and a quantizer or classifier q = (q 1 , . . ., q N ) with q i ∈ K . The real-valued function f is sufficiently smooth, and ∆ is the constraint space of valid quantizers, a convex set of discrete probabilities (simplices). This type of problem arises in Rate Distortion Theory [1,2], Deterministic Annealing [3] and biclustering [4]. The specific motivations for the abstract problem formulation given in (1) are the Information Bottleneck [5] and Information Distortion [6] functions max q∈∆ F(q, β) = max q∈∆ (D(q) − βI(Y; T)). ( These were proposed in [5,7] to analyze the Markov chain X → Y → T in which X → Y, characterized by a probability p(X, Y), is the original system of interest, characterized by its mutual information I(X; Y), and T is a simplification (quantized version of) Y. Here we work mainly with discrete versions of Y and T, with cardinalities |Y| = K and |T| = N. Typically N << K. I(Y; T) is the mutual information between the K objects in Y and the N clusters in T. The goal is to cluster K objects in Y into N clusters in T given inputs X such that the function F is maximized in [q i ] j ; the probability that the jth element of Y is classified as being a member of the cluster with label i ∈ T. We call such a set of conditional probabilities a stochastic quantizer, or just a quantizer, to relate to the vector quantization literature [8]. The annealing parameter β ∈ [0, ∞).
It has been shown that finding hard-clustering solutions to (2) is NP-complete (combinatorial search) when D(q) is the mutual information I(X; T) [9], as in the Information Bottleneck [5,10,11] and the Information Distortion [7,12,13] methods. Information Bottleneck (IB) approaches are gaining in penetration into multiple scientific and engineering domains [14][15][16][17][18]. As they typically involve the nonlinear optimization problem (2), there is need for optimization methods for such problems that can avoid the rise in complexity implied by the NP-complete hard-clustering solutions [9]. Originally, Tishby et al. [5] approached this problem with an algorithm inspired by the Blahut-Arimoto approach to solving Rate-Distortion types of problems ( [2], Chapter 10). The "self-consistent" equations in [5] optimize both the quantizer and the "relevance" distribution p(x|t). However, unlike the classic Blahut-Arimoto algorithm, which can guarantee convergence to a unique solution to its iterative scheme because of the convex geometry of the two state spaces, the "self-consistent" equations have no such guarantee due to the more-complicated geometry of three convex sets over which the optimization is performed, as also noted in [5]. Accordingly, in this work, we use the original optimization problem (2) over a single variable: the quantizer (conditional probability) q(t|y). It may be possible that a related Blahut-Arimoto style optimization coupled to the bifurcation structure of its gradient flow discussed here can lead to additional insights into this problem, but we consider this beyond the scope of this particular manuscript.
We have investigated the structure of soft-clustering annealing-type methods that reach the hard-clustering solution in the limit of the annealing parameter [19,20] through a series of bifurcations. A bifurcation in this context is a point that is a solution (q * , β * ) to (2) such that the number of solutions to (2) changes in a small neighborhood of (q * , β * ). Because a bifurcation corresponds to a point at which some of the objects Y have just been classified, in the IB literature, a bifurcation is usually referred to as a phase transition. One of the goals of this and related work is to understand why annealing-type algorithms, such as the original optimization heuristics in [5,10], work as well as they do. This can help with designing further optimization heuristics and can assess how close those can get to the global solutions to IB problems. We believe that this amalgamation of optimization theory and dynamical systems theory, as stated in [19,20], can provide a solid foundation with which to address such optimization challenges.
Because of the form (1) of F, it possesses certain symmetries. That is, the value of F(q, β) does not change (is invariant) under arbitrary permutations of the vectors q i . In other words, F is S N -invariant. The form (1) further implies that the Hessian d 2 . These conditions are met by the Information Distortion function [6], where H(T|Y) is the entropy, and by the cost function used in the original IB method [5], which is the focus of this manuscript. Both the Information Distortion and Informaton Bottleneck problems have the form given in (1) and (2). Importantly, d 2 F IB (q) has a "perpetual kernel" since each block d 2 f (q i ) has the eigenpair (0, q i ) for every q [20]. In other words, the Hessian d 2 F is singular for every q and every value of β. This makes bifurcation detection challenging because bifurcations can usually be detected by identifying isolated singularities of d 2 F. This degeneracy is a consequence of the translational symmetry of F IB : if k k k ∈ ker d 2 q F IB (q * ), then F IB (q * ) = F IB (q * + tk k k) for all t ∈ such that q * + tk k k ∈ ∆. At bifurcations of solutions to (4), the translational symmetry never breaks.
To better understand bifurcations of solutions to problems of the form (1), which includes the problems (3) and (4), we consider the gradient flow Equilibria of this flow correspond to critical points of (1), where L is the Lagrangian with respect to the constraints imposed by ∆, and λ is the vector of Lagrange multipliers. Previous work showed that when d 2 F is generically non-singular, as occurs for the Information Distortion (3), then there are isolated singularities of d 2 L that indicate possible bifurcations of solutions to (1). In this case, an M > 1-dimensional ker d 2 F necessitates an M − 1-dimensional ker d 2 L, which admits a bifurcation of solutions to (1) where symmetry breaks from S M to S m × S n for every m, n > 0 such that m + n = M [21].
Here we allow d 2 F and d 2 L to be singular for every q ∈ ∆, as occurs for the Information Bottleneck (4). That is, the perpetual kernel for d 2 F implies that d 2 L also has a perpetual kernel ker d 2 L = K p (q), which means that the eigenvalue crossing condition that must occur at a bifurcation (i.e., d 2 L must have a zero eigenvalue at a bifurcation) [20] is never satisfied in K p . There are a few challenges due to the existence of the perpetual kernel (i.e., degeneracy) of the Information Bottleneck that we address in this paper. First, detecting bifurcations may be problematic because one cannot simply monitor the determinant of either d 2 F or d 2 L. Second, the standard theory that assures the existence of bifurcating branches, the Equivariance Branching Lemma, cannot be applied directly. Lastly, the spaces that contain the bifurcating solutions are always at least two-dimensional, which makes tracking the bifurcating solutions problematic.
Here we address two of these three challenges. We show that at a bifurcation, new eigenvalue(s) of d 2 F IB and d 2 L must cross zero, causing ker d 2 L to expand so that ker d 2 L(q * ) = K p ∪ K * , where K * is the span of the eigenvectors with crossing eigenvalues. Instead of detecting bifurcations by the expensive process of monitoring the expansion of ker d 2 L (from K p to K p ∪ K * ), we give a simple way to check the eigenvalue crossing condition for annealing problems F = G(q) + βD(q) as in (2) [20]. We prove the existence of the bifurcating branches by adapting the standard proof for the Equivariant Branching Lemma. This newly developed theory guarantees that bifurcating branches exist in K * , are generically pitchforks, and that symmetry breaks from S M to S m × S n . Additionally, we give conditions to check whether the pitchforks are subcritical or supercritical, and how stability of the bifurcating branches relates to optimality in the optimization problem (1).

Equivariant Branching Lemma
The Equivariant Branching Lemma relates the subgroup structure of a symmetry group Γ with the existence of symmetry-breaking bifurcating branches of equilibria oḟ x = f (x, β). Observe that we present a version that does not require absolute irreducibility. For a proof see [22] p. 83.
For an arbitrary Γ-equivariant system where bifurcation occurs at (x * , β * ), the requirement in Theorem 1 that the bifurcation occurs at the origin is accomplished by a translation. Assuring that the Jacobian vanishes, d x f (0, 0) = 0, can be effected by restrict-ing and projecting the system onto the kernel of the Jacobian. This transform is called the Liapunov-Schmidt reduction (see [23]).
The Equivariant Branching Lemma does not directly apply to yield bifurcating branches for the problem (1) at q for which d 2 F is singular for the following reasons: • K p and K * have independent bases, which implies that each is invariant to the action of S N , and so the decomposition ker d 2 L(q * ) = K p × K * shows that S N does not act absolutely irreducibly on ker d 2 F(q * ), but it does act absolutely irreducibly on each of these disjoint subspaces separately. This is why we present a version of the Equivariant Branching Lemma that does not require absolute irreducibility. • The Liapunov-Schmidt reduction onto ker d 2 L(q * ) is clear, but not onto K * . where v v v, y y y ∈ K . We address these issues in the manuscript and show that a small modification of the Equivariant Branching Lemma allows for similar analysis to be successfully applied to Information Bottleneck-style problems such as (2) with minimal modifications to the original algorithm from [20].

A Gradient Flow
We now lay the groundwork necessary to determine the bifurcations of local solutions to (1) max , which includes as a special case the Information Distortion (3) and Information Bottleneck (4) problems. The convex set of discrete conditional probabilities is Due to the form of F, it has the following properties: 1. F(q, β) is an S N -invariant, real-valued function of q, where the action of S N on q permutes the component vectors q i , i = 1, . . . , N, of q ∈ ∆.

2.
The NK × NK Hessian d 2 The Lagrangian of (1) with respect to the equality constraints from ∆ is The scalar λ k is the Lagrange multiplier for the constraint ∑ N i=1 q i k − 1 = 0, and λ ∈ K is the vector of Lagrange multipliers λ = (λ 1 , λ 2 , . . ., λ K ) T . The gradient of the Lagrangian in (5) is Observe that J has full row rank. The Hessian of (5) with respect to the vector q λ ∈ NK+K is where 0 0 0 is The dynamical system whose equilibria are stationary points of (1) is the gradient flow of the Lagrangian for L as defined in (5) and β ∈ [0, ∞). The equilibria of (8) are points q * λ * ∈ R NK+K where ∇L(q * , λ * , β) = 0.
The Jacobian of this system is the Hessian d 2 L(q, λ, β) from (7).

Equilibria with Symmetry
Next, we categorize the equilibria of (8) according to their symmetries, which allows us to determine when to expect symmetry-breaking bifurcations.
Let q ∈ Fix(S M ) for some 1 ≤ M ≤ N. Then there exists a partition of {1, 2, . . ., N} into the sets U and R, where |U | = M, so that q i = q j if and only if i, j ∈ U . Clearly, d 2 F has M identical blocks, {d f (q i )} i∈U .
To distinguish between the blocks of d 2 F, we write As mentioned in the introduction, we assume that for each q ∈ ∆, each block d 2 f (q i ) always has at least a one-dimensional kernel with basis vector(s) which depend on q. Thus, dim ker d 2 F ≥ N. At an equilibrium of (q * , λ * , β * ) of (8) where q ∈ Fix(S M ), we consider the following three cases: We will show that the first case necessitates a symmetry-breaking bifurcation (Theorem 3). In the second case, there is no bifurcation (Corollary 1). Finally, in the third case, we expect a saddle node [21], a symmetry-preserving bifurcation.
We are able to distinguish between the three cases above by considering which blocks of d 2 F(q * ) have kernels that have more than one dimension. This motivates the following definition.
For B, the M block(s) of the Hessian defined in (9), ker B has dimension 2 with basis vectors v v v, y y y ∈ K . v v v is associated with the crossing eigenvalues, and y y y is associated with the constant zero eigenvalue of B.

3.
The N − M block(s) of the Hessian {R i } i∈ , defined in (9), each have a one-dimensional kernel with basis vector z z z(i) ∈ K . 4.
The vectors v v v, y y y and {z z z(i)} are linearly independent. 5.
The matrix is nonsingular. R − i is the Moore-Penrose inverse of R i . When M = N, we define A := N I K .
We wish to emphasize that we showed in [21] that requirements 2-5 in Definition 1 hold generically.
A straightforward calculation shows that every block of the Hessian d 2 F of the Information Bottleneck cost function (2) is singular for every (q, β), and the basis for ker d 2 f (q i ) is y y y = q i for 1 ≤ i ≤ M and z z z(i) = q i for M + 1 ≤ i ≤ N (Lemma 42 in [25]), which assures that these vectors are linearly independent, as in Definition 1.4. At a bifurcation, the kernels of the identical blocks B expand by v v v as in Definition 1.2. Using the notation above, y y y = q i for each i ∈ U , and z z z(i) = q i for each i ∈ R.

The Kernel at a Bifurcation
The equilibria of (8) change their stability with β, and hence change the solutions to (1). The changes of stability are determined by the kernel of d 2 L(q * ) at a bifurcation point q * . In this section we show that for any q ∈ Fix(S M ) with M > 1, d 2 L(q * ) has a perpetual kernel K p that is at least M − 1 dimensional. The zero eigenvalues associated with the eigenvectors in K p remain constant, so that at a bifurcation point (q * , λ * , β * ) of (8) where q * is M-singular, new eigenvalues of d 2 L must cross zero. Thus, the kernel expands, and the bifurcating directions exist in an "expanded" kernel of d 2 L(q * ), ker d 2 L(q * ) = K * × K p .
We determine a basis for ker d 2 L at an M-singular q * when M > 1. If q is 1-singular with a trivial isotropy group (i.e., no symmetery), then d 2 L(q * ) is non-singular-K p disappears. First, we ascertain a basis for ker d 2 F(q * ).
Recall that in the preliminaries, when x x x ∈ NK , we defined x x x j ∈ R K to be the jth vector component of x x x. We now define the linearly independent vectors where 0 0 0 ∈ K , and v v v and y y y are defined in Definition 1.2. For example, if M = 2 and N = 3, Due to the block diagonal form of d 2 F(q * ), it is easy to see that the N + M vectors defined in (11) form a basis for ker d 2 F(q * ). Now, let (7), it is easy to see that these three sets of vectors are in ker d 2 L(q * ). The next theorem shows that are a basis for ker d 2 L(q * ). This natural partition of the basis vectors shows that ker d 2 L(q * ) can be written as ker d 2 L(q * ) = K p × K * . According to Definition 1, the "perpetual kernel" corresponding to constant zero eigenvalues of The part of the kernel that arises at a bifurcation corresponding to eigenvalues crossing zero is where k k k F is NK × 1, and k k k J is K × 1. Hence, Now, from (6) and the fact that We set and using the notation from (9), then (15) implies and v v v and y y y are the basis vectors of ker B from Definition 1.2. Thus,

Corollary 1.
If q * is 1-singular and has isotropy group equal to the identity, then d 2 L(q * ) is nonsingular.
Proof. If q is 1-singular, then d 2 F(q * ) has a single block B with a two-dimensional kernel. The other N − 1 blocks {R i } are distinct with one-dimensional kernels. By constructing the vectors as in (11), we see that dim ker d 2 F(q * ) = N + 1 with basis vectors v v v 1 , y y y 1 , {z z z i } N i=2 . Now, following the proof of Theorem 2, we take an arbitrary k k k ∈ ker d 2 L(q * , λ, β), and then decompose k k k as in (13) and (16). The proof to Theorem 2 holds for the present case up until, and including (18). Linear independence now shows that d i = e i = c i = 0, which implies that k k k = 0 0 0.

Remark 2.
The independent bases given for K p and K * in Theorem 2 imply that each is invariant to the action of S N , and so the decomposition ker d 2 L(q * ) = K p × K * shows that S N does not act absolutely irreducibly on ker d 2 F(q * ). That is, by definition, The explicit bases show that K p , K ∼ = {x x x ∈ R M : ∑[x x x] i = 0}, which implies that S M acts absolutely irreducibly on K p and K * [26]. Thus, K p and K * are each S M -irreducible.
To assure that the Jacobian vanishes, we restrict and project F onto ker d 2 L(q * ) in a neighborhood of (0 0 0, 0 0 0, 0). This is the Liapunov-Schmidt reduction of F [23], where The system defined by the Liapunov-Schmidt reduction,ẋ x x = r(x x x, β), has a bifurcation of equilibria at (x x x = 0 0 0, β = 0), which are in 1 − 1 correspondence with equilibria of (8). However, the stability of these associated equilibria is not necessarily the same. It is straightforward to verify the following derivatives ( [23] p. 32), which we will require in the sequel. The (2M − 2) × (2M − 2) Jacobian of (19) is which shows that d x x x r(0 0 0, 0) = 0 0 0 since ker(I − E) = range d 2 L(q * ).
Our crossing condition at a bifurcation depends on the matrix of derivatives where the derivatives of L are evaluated at (q * , λ * , β * ), and L − is the Moore-Penrosegeneralized inverse [27] are the basis vectors of ker d 2 L(q * ) from Theorem 2.
The (2M − 2) × (2M − 2) × (2M − 2) three-dimensional array of second derivatives is In [21], we showed that ∂ 2 r i ∂x j ∂x k  (12)). We now consider the case when i, j ≤ M − 1 and k > M − 1. All other cases are dealt with using a similar argument. Substituting in for w w w i we have The vectors v v v and y y y are defined in (2). An immediate consequence of this calculation is Thus, similar arguments show that ∂ 2 r i ∂x j ∂x k (0 0 0, 0) = 0 whenever: Further, we get four different "cubes" of identical entries in the 3-D array. They are: The points above will prove useful when proving that d 2 r(0 0 0, 0) = 0 0 0. The four-dimensional array of third derivatives of r is where the derivatives of L are evaluated at (q * , λ * , β * ), and L − is the Moore-Penrosegeneralized inverse [27] of d 2 L(q * ).
Since ker d 2 L(q * ) is not absolutely irreducible, but K * is, one might try to define a Liapunov-Schmidt reduction by restricting and projecting ∇L onto K * . One issue with projecting the reduction onto K * is how to define the projection matrix E so that EF = 0 and (I − E)F = 0 if and only if F = 0 holds and E d x x x r(0 0 0, 0) is non-singular in range (E) so that the Implicit Function Theorem assures the restriction (q, λ) = Wx x x + U(Wx x x, β), where U(Wx x x) ∈ range (d 2 L(q * )), and Wx x x ∈ K * instead of Wx x x ∈ ker d 2 L(q * ) as in (19) [23]. Simply ignoring the space K p by considering U ∈ range (d 2 L(q * )) and Wx x x ∈ K * amounts to setting Wx x x = k * + k p and k p = 0. Since Wx x x + U is still embedded in the larger NK+K , which contains K p , then derivatives are affected by the implicit k p = 0 constraint. This constraint P K p (q, λ) = k * + U is nonlinear (and may not even be tractable) since K p depends on q, where P K p is a projection matrix that depends on q (see Theorem 7).

Isotropy Subgroups S m × S n of S N
The decomposition ker d 2 L(q * ) = K p × K * shows that Fix(S m × S n ) ∩ ker d 2 L(q * ) is two-dimensional with basis vectors {(ny y y T , . . ., ny y y T , −my y y T , . . ., −my y y Restricted to K * , these isotropy subgroups S m × S n of S M have one-dimensional fixed point spaces. This assures that we can use Theorem 1. We have the following Lemma. where v v v is defined as in Definition 1.2, and let u u u (m,n) = û u u (m,n) 0 0 0 where 0 0 0 ∈ R K . Then the isotropy subgroup of u u u (m,n) is Σ (m,n) ⊂ Γ U such that Σ (m,n) ∼ = S m × S n , where S m permutes u u u i when i ∈ U m , and S n permutes u u u i when i ∈ U n . The fixed point space of Σ (m,n) restricted to K * ⊂ d 2 L(q * ) is one dimensional.

Bifurcating Branches
Theorem 3. Let (q * , λ * , β * ) be an equilibrium of (8) such that q * is M-singular for 1 < M ≤ N, and the crossing condition is satisfied. Then there exists bifurcating solutions, , where u u u (m,n) ∈ K * is defined in (26), for every pair (m, n) such that M = m + n, each with an isotropy group isomorphic to S m × S n .
Proof. We mimic the proof of the Equivariant Branching Lemma. Let u u u := u u u (m,n) ∈ Fix(S m × S n ) ∩ K * and let V be a matrix with columns composed of the M − 1 vectors {V i }. Thus, there exists x x x 0 ∈ M−1 so that u u u = Vx x x 0 . Since r(Fix(S m × S n ) ∩ K * ) ⊆ Fix(S m × S n ) ∩ K * (for every σ ∈ S m × S n , r(Vx x x) = r(σVx x x) (u u u ∈ Fix(S M × S n ) that equals σr(Vx x x) (by equivariance)), then r(tx x x 0 , β) = h(t, β)x x x 0 , where r is the Liapunov-Schmidt reduction (19), and h is a polynomial in t. Since K * is S M -irreducible, then Fix(S M ) ∩ K * = {0 0 0} (otherwise, σx x x = x x x for some x x x ∈ K * for every σ ∈ S M , which implies that span(x x x) is an invariant subspace of K * ). Now [22] p. 75 shows that r(0 0 0, β) = 0 0 0, and so h(0, β) = 0, from which it follows that h(t, β) = tk(t, β). Thus, Differentiating with respect to t yields from which it follows that and so k(0 0 0, 0) = 0. Furthermore, we see that d β k(0, 0)x x x 0 = d 2 x x x,β r(0, 0)x x x 0 = 0 0 0 by assumption (see (23)). This shows that d β k(0, 0) is a non-zero eigenvalue of d x x x r(tx x x 0 , β) with associated eigenvector x x x 0 . By the Implicit Function Theorem, k(t, β) = 0 has a non-zero unique solution for β = β(t).

The Crossing Condition for Annealing Problems
We next determine how to check the crossing condition in Theorem 3 when F is an annealing problem, as in (2) F(q, β) = H(q) + βD(q).
First, we show that the crossing condition can be checked in terms of the Hessian of the function D. Furthermore, when G is strictly concave on span({v v v i }), then the crossing condition is always satisfied, and every singularity is a bifurcation. Proof. Let x x x 0 ∈ 2M−2 so that u u u = Wx x x 0 ∈ Fix(S m × S n ) ∩ K * . Multiplying Equation (21) on the left by x x x T 0 and on the right by x x x 0 yields By Theorem 2, an arbitrary u u u ∈ K * can be written as . Substituting this into (29) and observing that Differentiating with respect to β, evaluating at β = 0, and using (20) yields which must be non-zero since we assume that d 2 D(q) is either positive or negative definite on span({v v v i }).
From (30), we can get an expression for ξ, the eigenvalue of d 2 x x x,β r(0, 0) with eigenvector x x x 0 . Substituting d 2 x x x,β r(0 0 0, 0)x x x 0 = ξx x x 0 and observing that The requirement that d 2 D(q) is either positive or negative definite on span({v v v i }) holds when d 2 G(q * ) is either negative or positive definite, respectively, on span({v v v i }).
These results are important for the Information Bottleneck problem (2), where d 2 G(q) = −d 2 I(Y; Z) is only non-positive definite on ker d 2 F(q * ), but is negative definite on span({v v v i }). Thus, every singularity of the Information Bottleneck with ker d 2 L(q * ) = K * × K p is a bifurcation point. The space K p does not contain bifurcating branches since the crossing condition is never satisfied there: for u u u ∈ K p ,û u u T d 2 G(q)û u u + βû u u T d 2 D(q)û u u = 0 + 0 (by Lemma 42 in [25]), and so (Theorem 109, [25]) ξ =û

Bifurcation Type
Suppose that a bifurcation occurs at (q * , λ * , β * ), where q * is M-singular. This section examines the type of bifurcation from which emanate the branches q * λ * + tu u u, β * + β(t) , whose existence is guaranteed by Theorem 3.
This expression is similar to the one given in [22] p. 90. The numerator can be calculated via (24). In [21], we showed that β (0) = 0. We have the same result in the present case. Proof. To show that the numerator of (33) d 2 x x x r(0 0 0, 0) = 0 0 0, expand r i , the ith component of r, about x x x = 0 0 0, and so Applying the equivariance relation Ar(x x x, 0) = r (Ax x x, 0), where A is any element of the group isomorphic to S M that acts on r in R M−1 , and equating the quadratic terms yields By (24), the diagonal ∂ 2 r i ∂x i ∂x i (0 0 0, 0) = 0 for each i as well as for all of the "multi-diagonals".
If β (0) = 0, which we expect to be true generically, then Theorem 5 shows that the bifurcation guaranteed by Theorem 3 is pitchfork-like.

Stability and Optimality
The next Theorem relates the stability of equilibria (q * , λ * , β) in the flow (8) with optimality of q * in Problem (1). In particular, if a bifurcating branch corresponds to an eigenvalue of d 2 L(q * ) changing from negative to positive, then the branch consists of stationary points (q * , β * ) that are not solutions of (1). Positive eigenvalues of d 2 L(q * ) do not necessarily show that q * is not a solution of (1) (see Remark 1). For example, see page 668 of [21]. A proof of this theorem is given in [21]. Theorem 6. For each bifurcating branch guaranteed by Theorem 3, u u u is an eigenvector of d 2 L( q * λ * + tu u u, β * + β(t)) for sufficiently small t. Furthermore, if the corresponding eigenvalue is positive, then the branch consists of unstable stationary points that are not solutions to (1).

Structure of the Symmetry Projection
The matrix P R (q * ) that projects (q, λ) ∈ NK+K onto range (d 2 L(q * )) × K * by annihilating K p is important for numerical computations for equilibria of IB, since we may want to take each equilibrium found by Newton's method and take out any part in K p . P R is written as a function of q since its constitutive vectors y y y (from Definition 1) depend on q.
The following theorems clarify the structure of this projection. Theorem 7. P R (q) = I − P K p (q), where P K p = A 0 0 0 0 0 0 0 0 0 . P R and P K p are (NK + K) × (NK + K). The matrix A is NK × NK with N 2 blocks, {A ij } N i,j=1 , of size K × K, defined by 2y y yy y y T −y y yy y y T −y y yy y y T 0 0 0 −y y yy y y T 2y y yy y y T −y y yy y y T 0 0 0 −y y yy y y T −y y yy y y T 2y y yy y y T 0 0 0 . Thus, the matrix that projects onto K p is with an appeal to Lemma 34 in [25] to compute the inverse, shows that Ny y y T y y y A 0 0 0 0 0 0 0 0 0 . Dropping the constant yields the result.
For the Information Bottleneck, the matrix P R is easy to calculate, since y y y = q i for any i ∈ U . For example, when q = q 1 N , then y y y T y y y = K N 2 and y y yy y y T = 1 N 2 1 1 1, and so where 1 1 1 is a K × K matrix of 1s. Thus, Theorem 8. The symmetry group S M commutes with the matrix P R , which projects onto NK+K \ K p .

Visualizations of Sample Results
We illustrate these structures numerically. In [7], we introduced the toy "Four-blob" probability distribution p(x, y) shown in Figure 1. For the Information Distortion problem (3) [7,12,13] and the synthetic dataset composed of a mixture of four Gaussians (Figure 1), we determined the bifurcation structure of solutions to (3) by annealing in β and finding the corresponding stationary points to (1). A typical run of the derived gradient dynamical system tends to follow the main bifurcation branch S K → S K−1 from the fully symmetric uniform quantizer q 1 N (N = 4 here) to the fully resolved deterministic quantizer (hard clustering) seen at the end in Figure 2. The permutation symmetry is also obvious there-the value of the cost function does not change if the classes along the vertical axis in T are permuted/relabeled. The uniform quantizer q 1 N (Item 1 in the figure) plays a special role in the formulation (3), as it is the unique solution to the problem for β = 0 as the maximum entropy solution of max q H(T|Y). Its loss of stability at the first bifurcation for increasing β can hence be determined analytically and the first bifurcation structure characterized completely. Because of the "perpetual kernel" of the cost function in (4), the uniform quantizer is just one of a continuous set of "uninformative" quantizers for the IB problem (4): all {q(t|y) : q(t|y) = f (t)}, having constant probability of assignment of each y to class t, but the assignment weight can be different for different classes. Such a structure does not change the value of the cost function in the IB problem (4) (but does change it for (3), which hence does not have this degeneracy). We address the degeneracy of the IB optimization by projecting onto the subspace that has the correct symmetry (i.e., just the uniform quantizer q 1 N in this case), as outlined in Remark 2. A more-thorough structure of the bifurcation diagram, using the analysis presented above, is shown in Figure 3.
Similar to the results we presented in [28], the close-up of the bifurcation at β ≈ 1.038706 in Figure 3B shows a subcritical bifurcating branch (a first-order phase transition) that consists of stationary points of Problem (1). By projecting the Hessian ∆ q (G(q * ) + βD(q * )) onto each of the kernels referenced in Theorem 6, we determined that the points on this subcritical branch are not solutions of (1), and yet they are solutions of (2).
Furthermore, observe that Figure 3B indicates that a saddle-node bifurcation occurs at β ≈ 1.037479. That this is indeed the case was proved in [21]. In fact, for any problem of the form (2), these are the only two types of bifurcations to be expected: pitchfork and saddle-node. , assigning equal probability of each y ∈ Y to belong to one of the four clusters in T. Subsequent items 2-5 point to a set of partially resolved quantizations, in which subsets of Y are assigned with high probability to one (2) or more (3)(4)(5) classes (dark colors, close to 1), while other subsets are still unresolved (gray levels), albeit as a higher probability than q 1 N (darker gray, as some of the classes are excluded after being resolved for another subset). Item 6 shows an almost fully resolved quantizer at sufficiently high β. They become fully resolved (deterministic; q(t|y) = 1 or 0) as β → ∞ (not shown).
A B A local solution of (4) and (7) A local solution of (4), not (7) Not a solution of (4) nor (7) A bifurcation point  (3), a problem of form (2). We found these points by annealing in β and finding stationary points for Problem (1) using the algorithm presented in [28]. A square indicates where a bifurcation occurs. (B) A close-up of the subcritical bifurcation at β ≈ 1.038706, indicated by a square. Observe the subcritical bifurcating branch, and the subsequent saddle-node bifurcation at β ≈ 1.037479, indicated by another square. We applied Theorem 6 to show that the subcritical bifurcating branch is composed of quantizers that are solutions of (3) but not of (1).

Conclusions and Discussion
The main goal of this contribution was to show that information-based distortionannealing problems such as (2) have an interesting mathematical structure. The most interesting aspects of that mathematical structure are driven by the symmetries present in the cost functions-their invariance to actions of the permutation group S N , represented as relabeling of the reproduction classes. Such a structure would hold for any biclustering problem [4] that relies on the intrinsic interaction of a pair of variables for unsupervised clustering. The second mathematical structure that we used successfully was bifurcation theory, which allowed us to identify and study the discrete points at which the character of the cost function changed. The combination of those two tools in [20] allowed us to explicitly compute the value of the annealing parameter β at which the initial maximum at the uniform quantizer q 1 N of (1) loses stability. We concluded that for a fixed system C → Y characterized by p(X, Y), this value is the same for both problems, that it does not depend on the number of elements of the reproduction variable T, and that it is always greater than 1. We further introduced an eigenvalue problem that links the critical values of β and q for bifurcations, or phase transitions, branching off arbitrary intermediate solutions.
Even though the cost functions F IB (4) and F H (3) have similar properties, they also differ in some important aspects. We have shown that the function F IB is degenerate since its constitutive functions I(X; Y) and I(X; T) are not strictly convex in q. That introduces additional invariances and singularities that are always preserved, which makes phase transitions more difficult to detect (e.g., the "uninformative quantizers" q(t|y) = f (t) only) and post-transition directions more difficult to determine. In contrast, F H is strictly convex except at points of phase transitions. The theory we developed here allows us to identify bifurcation directions and determine their stability. Despite the presence of a high-dimensional null space at bifurcations, the symmetries restrict the allowed transition dimensions to multiple co-dimension 1 transitions, all related by group transformations. We achieved that here with three main results. Theorem 8 extended the Equivariant Branching Lemma 1 to the Information Bottleneck case with additional translation invariance. Theorem 4 identified specific conditions at which a bifurcation of the gradient flow (8) occurs. This condition is computable analytically for the initial bifurcation off the uniform quantizer q 1 N and with numeric continuation for subsequent bifurcation. Finally, in Section 2.9, we provided checks for the types of bifurcations that occur, giving conditions to detect saddle-node and pitchfork bifurcations and to determine whether pitchforks are supercritical (second-order phase transitions) or subcritical (leading to first-order phase transitions discontinuous in β). The combination of the three results, together with our previous results in [20], completely characterize the local bifurcation structure of Information Bottleneck-type problems with or without the added translation symmetry.
Despite the further development of the bifurcation formalism for IB presented her, there are still open questions that this manuscript did not resolve. In particular, we still cannot confirm or reject the conjecture that the set of S K symmetric soft-clustering branches connected through symmetry-breaking bifurcations leads to the global hard-clustering optima at β → ∞ (multiple equivalent solutions connected by the permutation symmetry of the problem). We believe this is partially due to a discrepancy between practical observations and theoretical results. In particular, we and other practitioners [29,30] note that the only observed symmetry-breaking bifurcations during optimization are of the kind S M → S M−1 , while the theory allows for arbitrary S M → S m × S n bifurcations. The latter are known to happen and be stable in other biological systems and circumstances [26,31]. This suggests a research approach of comparing and contrasting the different systems that possess the same S N symmetry and symmetry-breaking bifurcations to lead to breakthroughs in this application to optimization in the Information Bottleneck problem.
An additional open problem involves the use of continuous variables, already noted in [5] and explored further in [32,33]. This approach, while important for many real-world problems, involves the application of additional mathematical tools, namely Calculus of Variations [34], which further increases the complexity of an otherwise already complex problem. These difficulties are illustrated in a pair of papers [35,36] that use the continuous formulation. They do present some significant results on conditions of learnability, but both papers manage to only get bounds on β under which learnability (optimal solutions beyond the "uninformative" quantizer) can be achieved. This is possibly due to the presence of continuous spectra in covariance operators of continuous quantizers, something that we avoid by focusing on finite spaces. As a consequence, here and in prior work [20], we show specific values for β for the initial bifurcation from the uniform quantizer, which supports nontrivial clustering. We consider formulation with continuous variables beyond the scope of this manuscript, but look forward to the development of additional techniques to incorporate this important case in the bifurcation framework presented here. Regardless of such developments, any practical problem with numeric optimization will involve discretization of the continuous variables, which effectively converts a continuous problem to the discrete state discussed here.