Next Article in Journal
Detecting a Photon-Number Splitting Attack in Decoy-State Measurement-Device-Independent Quantum Key Distribution via Statistical Hypothesis Testing
Next Article in Special Issue
The Double-Sided Information Bottleneck Function
Previous Article in Journal
Observations of Bell Inequality Violations with Causal Isolation between Source and Detectors
Previous Article in Special Issue
Revisiting Sequential Information Bottleneck: New Implementation and Evaluation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Symmetry-Breaking Bifurcations of the Information Bottleneck and Related Problems

by
Albert E. Parker
1 and
Alexander G. Dimitrov
2,*
1
Center for Biofilm Engineering, Department of Mathematical Sciences, Montana State University, Bozeman, MT 59717, USA
2
Department of Mathematics and Statistics, Washington State University Vancouver, Vancouver, WA 98686, USA
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(9), 1231; https://doi.org/10.3390/e24091231
Submission received: 29 June 2022 / Revised: 22 August 2022 / Accepted: 29 August 2022 / Published: 2 September 2022
(This article belongs to the Special Issue Theory and Application of the Information Bottleneck Method)

Abstract

:
In this paper, we investigate the bifurcations of solutions to a class of degenerate constrained optimization problems. This study was motivated by the Information Bottleneck and Information Distortion problems, which have been used to successfully cluster data in many different applications. In the problems we discuss in this paper, the distortion function is not a linear function of the quantizer. This leads to a challenging annealing optimization problem, which we recast as a fixed-point dynamics problem of a gradient flow of a related dynamical system. The gradient system possesses an S N symmetry due to its invariance in relabeling representative classes. Its flow hence passes through a series of bifurcations with specific symmetry breaks. Here, we show that the dynamical system related to the Information Bottleneck problem has an additional spurious symmetry that requires more-challenging analysis of the symmetry-breaking bifurcation. For the Information Bottleneck, we determine that when bifurcations occur, they are only of pitchfork type, and we give conditions that determine the stability of the bifurcating branches. We relate the existence of subcritical bifurcations to the existence of first-order phase transitions in the corresponding distortion function as a function of the annealing parameter, and provide criteria with which to detect such transitions.

1. Introduction

This paper analyzes bifurcations of solutions to constrained optimization problems of the form
max q Δ F ( q , β ) = max q Δ i = 1 N f ( q i , β )
as a function of a scalar parameter β and a quantizer or classifier q = ( q 1 , , q N ) with q i K . The real-valued function f is sufficiently smooth, and Δ is the constraint space of valid quantizers, a convex set of discrete probabilities (simplices).
This type of problem arises in Rate Distortion Theory [1,2], Deterministic Annealing [3] and biclustering [4]. The specific motivations for the abstract problem formulation given in (1) are the Information Bottleneck [5] and Information Distortion [6] functions
max q Δ F ( q , β ) = max q Δ D ( q ) β I ( Y ; T ) .
These were proposed in [5,7] to analyze the Markov chain X Y T in which X Y , characterized by a probability p ( X , Y ) , is the original system of interest, characterized by its mutual information I ( X ; Y ) , and T is a simplification (quantized version of) Y. Here we work mainly with discrete versions of Y and T, with cardinalities | Y | = K and | T | = N . Typically N < < K . I ( Y ; T ) is the mutual information between the K objects in Y and the N clusters in T. The goal is to cluster K objects in Y into N clusters in T given inputs X such that the function F is maximized in [ q i ] j ; the probability that the jth element of Y is classified as being a member of the cluster with label i T . We call such a set of conditional probabilities a stochastic quantizer, or just a quantizer, to relate to the vector quantization literature [8]. The annealing parameter β [ 0 , ) .
It has been shown that finding hard-clustering solutions to (2) is NP-complete (combinatorial search) when D ( q ) is the mutual information I ( X ; T ) [9], as in the Information Bottleneck [5,10,11] and the Information Distortion [7,12,13] methods. Information Bottleneck (IB) approaches are gaining in penetration into multiple scientific and engineering domains [14,15,16,17,18]. As they typically involve the nonlinear optimization problem (2), there is need for optimization methods for such problems that can avoid the rise in complexity implied by the NP-complete hard-clustering solutions [9]. Originally, Tishby et al. [5] approached this problem with an algorithm inspired by the Blahut–Arimoto approach to solving Rate-Distortion types of problems ([2], Chapter 10). The “self-consistent” equations in [5] optimize both the quantizer and the “relevance” distribution p ( x | t ) . However, unlike the classic Blahut–Arimoto algorithm, which can guarantee convergence to a unique solution to its iterative scheme because of the convex geometry of the two state spaces, the “self-consistent” equations have no such guarantee due to the more-complicated geometry of three convex sets over which the optimization is performed, as also noted in [5]. Accordingly, in this work, we use the original optimization problem (2) over a single variable: the quantizer (conditional probability) q ( t | y ) . It may be possible that a related Blahut–Arimoto style optimization coupled to the bifurcation structure of its gradient flow discussed here can lead to additional insights into this problem, but we consider this beyond the scope of this particular manuscript.
We have investigated the structure of soft-clustering annealing-type methods that reach the hard-clustering solution in the limit of the annealing parameter [19,20] through a series of bifurcations. A bifurcation in this context is a point that is a solution ( q * , β * ) to (2) such that the number of solutions to (2) changes in a small neighborhood of ( q * , β * ) . Because a bifurcation corresponds to a point at which some of the objects Y have just been classified, in the IB literature, a bifurcation is usually referred to as a phase transition. One of the goals of this and related work is to understand why annealing-type algorithms, such as the original optimization heuristics in [5,10], work as well as they do. This can help with designing further optimization heuristics and can assess how close those can get to the global solutions to IB problems. We believe that this amalgamation of optimization theory and dynamical systems theory, as stated in [19,20], can provide a solid foundation with which to address such optimization challenges.
Because of the form (1) of F, it possesses certain symmetries. That is, the value of F ( q , β ) does not change (is invariant) under arbitrary permutations of the vectors q i . In other words, F is S N -invariant. The form (1) further implies that the Hessian d q 2 F ( q ) is block diagonal with blocks { d q i 2 f ( q i ) } i = 1 N . These conditions are met by the Information Distortion function [6],
F H ( q , β ) = H ( T | Y ) + β I ( X , T ) ,
where H ( T | Y ) is the entropy, and by the cost function used in the original IB method [5],
F I B ( q , β ) = I ( Y , T ) + β I ( X , T ) ,
which is the focus of this manuscript. Both the Information Distortion and Informaton Bottleneck problems have the form given in (1) and (2). Importantly, d 2 F I B ( q ) has a “perpetual kernel“ since each block d 2 f ( q i ) has the eigenpair (0, q i ) for every q [20]. In other words, the Hessian d 2 F is singular for every q and every value of β . This makes bifurcation detection challenging because bifurcations can usually be detected by identifying isolated singularities of d 2 F . This degeneracy is a consequence of the translational symmetry of F I B : if k ker d q 2 F I B ( q * ) , then F I B ( q * ) = F I B ( q * + t k ) for all t such that q * + t k Δ . At bifurcations of solutions to (4), the translational symmetry never breaks.
To better understand bifurcations of solutions to problems of the form (1), which includes the problems (3) and (4), we consider the gradient flow
q ˙ λ ˙ = L ( q , λ , β )
Equilibria of this flow correspond to critical points of (1), where L is the Lagrangian with respect to the constraints imposed by Δ , and λ is the vector of Lagrange multipliers.
Previous work showed that when d 2 F is generically non-singular, as occurs for the Information Distortion (3), then there are isolated singularities of d 2 L that indicate possible bifurcations of solutions to (1). In this case, an M > 1 -dimensional ker d 2 F necessitates an M 1 -dimensional ker d 2 L , which admits a bifurcation of solutions to (1) where symmetry breaks from S M to S m × S n for every m , n > 0 such that m + n = M [21].
Here we allow d 2 F and d 2 L to be singular for every q Δ , as occurs for the Information Bottleneck (4). That is, the perpetual kernel for d 2 F implies that d 2 L also has a perpetual kernel ker d 2 L = K p ( q ) , which means that the eigenvalue crossing condition that must occur at a bifurcation (i.e., d 2 L must have a zero eigenvalue at a bifurcation) [20] is never satisfied in K p . There are a few challenges due to the existence of the perpetual kernel (i.e., degeneracy) of the Information Bottleneck that we address in this paper. First, detecting bifurcations may be problematic because one cannot simply monitor the determinant of either d 2 F or d 2 L . Second, the standard theory that assures the existence of bifurcating branches, the Equivariance Branching Lemma, cannot be applied directly. Lastly, the spaces that contain the bifurcating solutions are always at least two-dimensional, which makes tracking the bifurcating solutions problematic.
Here we address two of these three challenges. We show that at a bifurcation, new eigenvalue(s) of d 2 F I B and d 2 L must cross zero, causing ker d 2 L to expand so that ker d 2 L ( q * ) = K p K * , where K * is the span of the eigenvectors with crossing eigenvalues. Instead of detecting bifurcations by the expensive process of monitoring the expansion of ker d 2 L (from K p to K p K * ), we give a simple way to check the eigenvalue crossing condition for annealing problems F = G ( q ) + β D ( q ) as in (2) [20]. We prove the existence of the bifurcating branches by adapting the standard proof for the Equivariant Branching Lemma. This newly developed theory guarantees that bifurcating branches exist in K * , are generically pitchforks, and that symmetry breaks from S M to S m × S n . Additionally, we give conditions to check whether the pitchforks are subcritical or supercritical, and how stability of the bifurcating branches relates to optimality in the optimization problem (1).

2. Bifurcation Analysis

2.1. Equivariant Branching Lemma

The Equivariant Branching Lemma relates the subgroup structure of a symmetry group Γ with the existence of symmetry-breaking bifurcating branches of equilibria of x ˙ = f ( x , β ) . Observe that we present a version that does not require absolute irreducibility. For a proof see [22] p. 83.
Theorem 1.
(Equivariant Branching Lemma). Let f be a smooth function f : V × V that is Γ-equivariant for a compact Lie group Γ and a Banach space V. Let Σ be an isotropy subgroup of Γ with dim Fix ( Σ ) = 1 . Suppose that Fix ( Γ ) = { 0 } and the crossing condition d β x 2 f ( 0 , 0 ) x 0 0 for x 0 Fix ( Σ ) . Then there exists a unique smooth solution branch ( t x 0 , β ( t ) ) to f = 0 with isotropy subgroup Σ.
For an arbitrary Γ -equivariant system where bifurcation occurs at ( x * , β * ) , the requirement in Theorem 1 that the bifurcation occurs at the origin is accomplished by a translation. Assuring that the Jacobian vanishes, d x f ( 0 , 0 ) = 0 , can be effected by restricting and projecting the system onto the kernel of the Jacobian. This transform is called the Liapunov–Schmidt reduction (see [23]).
The Equivariant Branching Lemma does not directly apply to yield bifurcating branches for the problem (1) at q for which d 2 F is singular for the following reasons:
  • K p and K * have independent bases, which implies that each is invariant to the action of S N , and so the decomposition ker d 2 L ( q * ) = K p × K * shows that S N does not act absolutely irreducibly on ker d 2 F ( q * ) , but it does act absolutely irreducibly on each of these disjoint subspaces separately. This is why we present a version of the Equivariant Branching Lemma that does not require absolute irreducibility.
  • The Liapunov–Schmidt reduction onto ker d 2 L ( q * ) is clear, but not onto K * .
  • Fix ( S m × S n ) ker d 2 L ( q * ) is two-dimensional with basis
    { ( n v , , n v , m v , , m v ) , ( n y , , n y , m y , , m y ) } ,
    where v , y K .
We address these issues in the manuscript and show that a small modification of the Equivariant Branching Lemma allows for similar analysis to be successfully applied to Information Bottleneck-style problems such as (2) with minimal modifications to the original algorithm from [20].

2.2. A Gradient Flow

We now lay the groundwork necessary to determine the bifurcations of local solutions to (1)
max q Δ F ( q , β ) ,
where F = i = 1 N f ( q i , β ) , which includes as a special case the Information Distortion (3) and Information Bottleneck (4) problems. The convex set of discrete conditional probabilities is
Δ : = q N K | i = 1 N q k i = 1 k : 1 k K   and   q k i 0 i , k .
Due to the form of F, it has the following properties:
  • F ( q , β ) is an S N -invariant, real-valued function of q, where the action of S N on q permutes the component vectors q i , i = 1 , , N , of q Δ .
  • The N K × N K Hessian d q 2 F ( q , β ) is block diagonal, where the ith K × K block is d 2 f ( q i ) .
The Lagrangian of (1) with respect to the equality constraints from Δ is
L ( q , λ , β ) = F ( q , β ) + k = 1 K λ k i = 1 N q k i 1 .
The scalar λ k is the Lagrange multiplier for the constraint i = 1 N q k i 1 = 0 , and λ K is the vector of Lagrange multipliers λ = ( λ 1 , λ 2 , , λ K ) T . The gradient of the Lagrangian in (5) is
L : = q , λ L ( q , λ , β ) = q L λ L ,
where q L = F ( q , β ) + Λ and Λ = λ T , λ T , λ T T R N K . The gradient λ L is a vector of K constraints
λ L = i q 1 i 1 i q 2 i 1 i q K i 1 .
Let J be the Jacobian of d q λ L
J : = d q λ L = I K I K I K N   blocks .
Observe that J has full row rank. The Hessian of (5) with respect to the vector q λ N K + K is
d 2 L ( q ) : = d 2 L ( q , λ , β ) = d 2 F ( q , β ) J T J 0 ,
where 0 is K × K . The N K × N K matrix d 2 F ( q ) : = d q 2 F ( q , β ) is the block diagonal Hessian of F with K × K blocks { d 2 f ( q i , β ) } i = 1 N .
The dynamical system whose equilibria are stationary points of (1) is the gradient flow of the Lagrangian
q ˙ λ ˙ = L ( q , λ , β )
for L as defined in (5) and β [ 0 , ) . The equilibria of (8) are points q * λ * R N K + K where
L ( q * , λ * , β ) = 0 .
The Jacobian of this system is the Hessian d 2 L ( q , λ , β ) from (7).
Remark 1.
By the theory of constrained optimization [24], the equilibria ( q * , λ * , β ) of (8) where d 2 F ( q * , β ) is negative definite on ker J are local solutions of (1). Conversely, if ( q * , β ) is a local solution of (1), then there exists a vector of Lagrange multipliers λ * so that ( q * , λ * , β ) is an equilibrium of (8) (this necessary requirement is called the Karush–Kuhn–Tucker conditions) such that d 2 F ( q * , β ) is non-positive definite on ker J .

2.3. Equilibria with Symmetry

Next, we categorize the equilibria of (8) according to their symmetries, which allows us to determine when to expect symmetry-breaking bifurcations.
Let q Fix ( S M ) for some 1 M N . Then there exists a partition of { 1 , 2 , , N } into the sets U and R , where | U | = M , so that q i = q j if and only if i , j U . Clearly, d 2 F has M identical blocks, { d f ( q i ) } i U .
To ease the notation, and without loss of generality, we set
U : = { 1 , , M } and R : = { M + 1 , , N } .
To distinguish between the blocks of d 2 F , we write
B : = d 2 f ( q i )   for   1 i M   and   R i : = d 2 f ( q i )   for   M + 1 i N .
As mentioned in the introduction, we assume that for each q Δ , each block d 2 f ( q i ) always has at least a one-dimensional kernel with basis vector(s) which depend on q. Thus, dim ker d 2 F N . At an equilibrium of ( q * , λ * , β * ) of (8) where q Fix ( S M ) , we consider the following three cases:
  • dim ker d 2 F ( q * ) > N + 1 ;
  • dim ker d 2 F ( q * ) = N + 1 ;
  • dim ker d 2 F ( q * ) = N .
We will show that the first case necessitates a symmetry-breaking bifurcation (Theorem 3). In the second case, there is no bifurcation (Corollary 1). Finally, in the third case, we expect a saddle node [21], a symmetry-preserving bifurcation.
We are able to distinguish between the three cases above by considering which blocks of d 2 F ( q * ) have kernels that have more than one dimension. This motivates the following definition.
Definition 1.
An equilibrium ( q * , λ * , β * ) of (8) is M-singular (or, equivalently, q * is M-singular) if:
  • q Fix ( S M ) so that q i = q j for every 1 i , j M .
  • For B, the M block(s) of the Hessian defined in (9), ker B has dimension 2 with basis vectors v , y K . v is associated with the crossing eigenvalues, and y is associated with the constant zero eigenvalue of B.
  • The N M block(s) of the Hessian { R i } i , defined in (9), each have a one-dimensional kernel with basis vector z ( i ) K .
  • The vectors v , y and { z ( i ) } are linearly independent.
  • The matrix
    A : = B i = M + 1 N R i + M I K
    is nonsingular. R i is the Moore–Penrose inverse of R i . When M = N , we define A : = N I K .
We wish to emphasize that we showed in [21] that requirements 2–5 in Definition 1 hold generically.
A straightforward calculation shows that every block of the Hessian d 2 F of the Information Bottleneck cost function (2) is singular for every ( q , β ) , and the basis for ker d 2 f ( q i ) is y = q i for 1 i M and z ( i ) = q i for M + 1 i N (Lemma 42 in [25]), which assures that these vectors are linearly independent, as in Definition 1.4. At a bifurcation, the kernels of the identical blocks B expand by v as in Definition 1.2. Using the notation above, y = q i for each i U , and z ( i ) = q i for each i R .

2.4. The Kernel at a Bifurcation

The equilibria of (8) change their stability with β , and hence change the solutions to (1). The changes of stability are determined by the kernel of d 2 L ( q * ) at a bifurcation point q * . In this section we show that for any q Fix ( S M ) with M > 1 , d 2 L ( q * ) has a perpetual kernel K p that is at least M 1 dimensional. The zero eigenvalues associated with the eigenvectors in K p remain constant, so that at a bifurcation point ( q * , λ * , β * ) of (8) where q * is M-singular, new eigenvalues of d 2 L must cross zero. Thus, the kernel expands, and the bifurcating directions exist in an “expanded” kernel of d 2 L ( q * ) , ker d 2 L ( q * ) = K * × K p .
We determine a basis for ker d 2 L at an M-singular q * when M > 1 . If q is 1-singular with a trivial isotropy group (i.e., no symmetery), then d 2 L ( q * ) is non-singular— K p disappears. First, we ascertain a basis for ker d 2 F ( q * ) .
Recall that in the preliminaries, when x N K , we defined x j R K to be the jth vector component of x . We now define the linearly independent vectors { v i } i = 1 M , { y i } i = 1 M , and { z k } k = M + 1 N in N K by
v i j : = v   if   1 i = j M 0   otherwise , y i j : = y   if   1 i = j M 0   otherwise , z k j : = z ( i )   if   M + 1 j = k N 0   otherwise
where 0 K , and v and y are defined in Definition 1.2. For example, if M = 2 and N = 3 , then v 1 : = ( v T , 0 , 0 ) T and v 2 : = ( 0 , v T , 0 ) T .
Due to the block diagonal form of d 2 F ( q * ) , it is easy to see that the N + M vectors defined in (11) form a basis for ker d 2 F ( q * ) .
Now, let
V i = v i 0 v M 0 , Y i = y i 0 y M 0 , Z k = z k 0 z N 0
for i = 1 , , M 1 and M + 1 k N 1 where 0 K . From (7), it is easy to see that these three sets of vectors are in ker d 2 L ( q * ) . The next theorem shows that { V i } i = 1 M 1 { Y i } i = 1 M 1 are a basis for ker d 2 L ( q * ) . This natural partition of the basis vectors shows that ker d 2 L ( q * ) can be written as ker d 2 L ( q * ) = K p × K * . According to Definition 1, the “perpetual kernel” corresponding to constant zero eigenvalues of d 2 L ( q * ) is generated by
K p = < { Y i } i = 1 M 1 > .
The part of the kernel that arises at a bifurcation corresponding to eigenvalues crossing zero is
K * = < { V i } i = 1 M 1 > .
The vectors { Z k } do not contribute to ker d 2 L ( q * ) .
Theorem 2.
If q * is M-singular for 1 < M N , then { V i } { Y i } from (12) are a basis for ker d 2 L ( q * ) .
Proof. 
To show that { V i } i = 1 M 1 { Y i } i = 1 M 1 span ker d 2 L ( q * ) , let k ker d 2 L ( q * ) and decompose it as
k = k F k J
where k F is N K × 1 , and k J is K × 1 . Hence,
d 2 L ( q * , λ * , β ) k = d 2 F ( q * , β * ) J T J 0 k F k J = 0 d 2 F ( q * , β ) k F = J T k J J k F = 0 .
Now, from (6) and the fact that d 2 F is block diagonal, we have
d 2 f ( q 1 ) 0 0 0 d 2 f ( q 2 ) 0 0 0 d 2 f ( q N ) k F = k J k J k J .
We set
k F : = ( x 1 T x 2 T x N T ) T ,
and using the notation from (9), then (15) implies
B x i = k J   for   1 i M R i x i = k J   for   M + 1 i N .
It follows that x i = R i B x 1 for every M + 1 i N . By (14), we have that i = 1 N x i = 0 , and so
i = 1 M x i + i = M + 1 N x i = 0 i = 1 M x i + i = M + 1 N R i B x 1 + = 0 .
By (17), for every 1 i M , x i can be written as x i = x p + d i v + e i y , where x p range ( B ) , d η , e η , and v and y are the basis vectors of ker B from Definition 1.2. Thus,
B i = 1 M ( x p + d i v + e i y ) + B i = M + 1 N R i B ( x p + d 1 v + e 1 y ) = 0 ( B i = M + 1 N R i + M I K ) B x p = 0 B x p = 0
since A = B i = M + 1 N R i + M I K is nonsingular. This shows that x p = 0 . Therefore, x i = d i v + e i y for every 1 i M . Now (17) shows that k J = 0 , and so x i ker R i for M + 1 i N , which implies that
x i = c i z ( i )   for   M + 1 i N .
Hence, k = k F 0 , where k F i = d i v + e i y   if   1 i M c i z ( i )   if   M + 1 i N , from which it follows that
J k F = i = 1 N x i = i = 1 M d i v + i = 1 M e i y + i = M + 1 N c i z ( i ) = 0 .
Linear independence (Definition 1.4) implies that d i = e i = d i = 0 . Thus, k F = i = 1 M 1 d i ( v i v M ) + i = 1 M 1 e i ( y i y M ) . Therefore, the linearly independent vectors { V i } = { v i v M 0 } and { Y i } = { y i y M 0 } span ker d 2 L ( q * ) . □
Corollary 1.
If q * is 1-singular and has isotropy group equal to the identity, then d 2 L ( q * ) is nonsingular.
Proof. 
If q is 1-singular, then d 2 F ( q * ) has a single block B with a two-dimensional kernel. The other N 1 blocks { R i } are distinct with one-dimensional kernels. By constructing the vectors as in (11), we see that dim ker d 2 F ( q * ) = N + 1 with basis vectors v 1 , y 1 , { z i } i = 2 N . Now, following the proof of Theorem 2, we take an arbitrary k ker d 2 L ( q * , λ , β ) , and then decompose k as in (13) and (16). The proof to Theorem 2 holds for the present case up until, and including (18). Linear independence now shows that d i = e i = c i = 0 , which implies that k = 0 . □
Remark 2. 
The independent bases given for K p and K * in Theorem 2 imply that each is invariant to the action of S N , and so the decomposition ker d 2 L ( q * ) = K p × K * shows that S N does not act absolutely irreducibly on ker d 2 F ( q * ) . That is, by definition,
d x r ( 0 , β ) c ( β ) I 2 M 2 .
The explicit bases show that K p , K { x R M : [ x ] i = 0 } , which implies that S M acts absolutely irreducibly on K p and K * [26]. Thus, K p and K * are each S M -irreducible.

2.5. Liapunov–Schmidt Reduction

To show the existence of bifurcating branches from a bifurcation point ( q * , λ * , β * ) of equilibria of (8), the Equivariant Branching Lemma requires that the bifurcation is translated to ( 0 , 0 , 0 ) and that the Jacobian vanishes at bifurcation. To accomplish the former, consider
F ( q , λ , β ) : = L ( q + q * , λ + λ * , β + β * ) .
To assure that the Jacobian vanishes, we restrict and project F onto ker d 2 L ( q * ) in a neighborhood of ( 0 , 0 , 0 ) . This is the Liapunov–Schmidt reduction of F [23],
r : R M 1 × R R M 1 r ( x , β ) = W T ( I E ) F ( W x + U ( W x , β ) , β )
where W x + U ( W x , β ) = q λ . The ( N K + K ) × ( N K + K ) matrix I E is the projection matrix onto ker F ( 0 , 0 ) = ker d 2 L ( q * ) with ker ( I E ) = range d 2 L ( q * ) . W is the ( N K + K ) × ( 2 M 2 ) matrix whose columns are the basis vectors { V i } { Y i } of ker d 2 L ( q * ) from (12) so that W x is a vector in ker d 2 L ( q * ) . The vector function U ( W x , β ) is the component of ( q , λ ) that is in range d 2 L ( q * ) such that E F ( W x + U ( x , β ) , β ) = 0 , U ( 0 , 0 ) = 0 , and
d x U ( 0 , 0 ) = 0 .
The system defined by the Liapunov–Schmidt reduction, x ˙ = r ( x , β ) , has a bifurcation of equilibria at ( x = 0 , β = 0 ) , which are in 1 1 correspondence with equilibria of (8). However, the stability of these associated equilibria is not necessarily the same.
It is straightforward to verify the following derivatives ([23] p. 32), which we will require in the sequel. The ( 2 M 2 ) × ( 2 M 2 ) Jacobian of (19) is
d x r ( x , β ) = W T ( I E ) d q , λ 2 L ( q + q * , λ + λ * , β + β * ) ( W + d x U ( W x , β ) ) ,
which shows that
d x r ( 0 , 0 ) = 0
since ker ( I E ) = range d 2 L ( q * ) .
Our crossing condition at a bifurcation depends on the matrix of derivatives
2 r i β x j ( 0 , 0 ) = d β d 2 L [ w i , w j ] d 3 L [ w i , w j , L d β L ]
where the derivatives of L are evaluated at ( q * , λ * , β * ) , and L is the Moore–Penrose-generalized inverse [27] of d 2 L ( q * ) . The vectors { w i } i = 1 2 M 2 are the basis vectors of ker d 2 L ( q * ) from Theorem 2.
The ( 2 M 2 ) × ( 2 M 2 ) × ( 2 M 2 ) three-dimensional array of second derivatives is
2 r i x j x k ( 0 , 0 ) = d 3 L ( q * , λ * , β * ) [ w i , w j , w k ] .
In [21], we showed that 2 r i x j x k ( 0 , 0 ) = 0 whenever i = j = k M 1 . In the present case, there are more zero entries since now the basis vectors { w i } are of two types: w i = V i for 1 i M 1 (basis vectors of K * ); or w i = Y i M + 1 for M i 2 M 2 (basis vectors of K p , see (12)). We now consider the case when i , j M 1 and k > M 1 . All other cases are dealt with using a similar argument. Substituting in for w i we have
2 r i x j x k ( 0 , 0 ) = ν , δ , η = 1 N l , m , n = 1 K 3 F ( q * , β * ) q l ν q m δ q n η [ v i v M ] l ν [ v j v M ] m δ [ y k M + 1 y M ] n η = l , m , n = 1 K 3 f ( q ν * , β * ) q l ν q m ν q n ν δ i j ( k M + 1 ) [ v ] l [ v ] m [ y ] n [ v ] l [ v ] m [ y ] n .
The vectors v and y are defined in (2). An immediate consequence of this calculation is that 2 r i x j x k ( 0 , 0 ) = 0 whenever i = j = k M + 1 . Thus, similar arguments show that 2 r i x j x k ( 0 , 0 ) = 0 whenever:
  • i = j = k ;
  • i M + 1 = j = k , i = j M + 1 = k , i = j = k M + 1 ;
  • i M + 1 = j M + 1 = k i M + 1 = j = k M + 1 i = j M + 1 = k M + 1 .
Further, we get four different “cubes” of identical entries in the 3-D array. They are:
  • For i , j , k M 1 , not all equal, the value of the cube is
    l , m , n = 1 K 3 f ( q ν * , β * ) q l ν q m ν q n ν [ v ] l [ v ] m [ v ] n ;
  • For i , j M 1 , not both equal, and j > M 1 , the value of the cube is
    l , m , n = 1 K 3 f ( q ν * , β * ) q l ν q m ν q n ν [ v ] l [ v ] m [ y ] n ;
  • For i M 1 and j , k > M 1 , not both equal, the value of the cube is
    l , m , n = 1 K 3 f ( q ν * , β * ) q l ν q m ν q n ν [ v ] l [ y ] m [ y ] n ;
  • For i , j , k > M 1 , not all equal, the value of the cube is
    l , m , n = 1 K 3 f ( q ν * , β * ) q l ν q m ν q n ν [ y ] l [ y ] m [ y ] n .
The points above will prove useful when proving that d 2 r ( 0 , 0 ) = 0 .
The four-dimensional array of third derivatives of r is
3 r i x j x k x l ( 0 , 0 ) = d 4 L [ w i , w j , w k , w l ] d 3 L [ w i , w j , L d 3 L [ w k , w l ] ] d 3 L [ w i , w k , L d 3 L [ w j , w l ] ] d 3 L [ w i , w l , L d 3 L [ w j , w k ] ]
where the derivatives of L are evaluated at ( q * , λ * , β * ) , and L is the Moore–Penrose-generalized inverse [27] of d 2 L ( q * ) .
Since ker d 2 L ( q * ) is not absolutely irreducible, but K * is, one might try to define a Liapunov–Schmidt reduction by restricting and projecting L onto K * . One issue with projecting the reduction onto K * is how to define the projection matrix E so that
E F = 0   and   ( I E ) F = 0   if     and     only     if   F = 0
holds and E d x r ( 0 , 0 ) is non-singular in range ( E ) so that the Implicit Function Theorem assures the restriction ( q , λ ) = W x + U ( W x , β ) , where U ( W x ) range ( d 2 L ( q * ) ) , and W x K * instead of W x ker d 2 L ( q * ) as in (19) [23]. Simply ignoring the space K p by considering U range ( d 2 L ( q * ) ) and W x K * amounts to setting W x = k * + k p and k p = 0 . Since W x + U is still embedded in the larger N K + K , which contains K p , then derivatives are affected by the implicit k p = 0 constraint. This constraint P K p ( q , λ ) = k * + U is nonlinear (and may not even be tractable) since K p depends on q, where P K p is a projection matrix that depends on q (see Theorem 7).

2.6. Isotropy Subgroups S m × S n of S N

The decomposition ker d 2 L ( q * ) = K p × K * shows that Fix ( S m × S n ) ker d 2 L ( q * ) is two-dimensional with basis vectors
{ ( n y T , , n y T , m y T , , m y T ) T , ( n v T , , n v T , m v T , , m v T ) T } .
Restricted to K * , these isotropy subgroups S m × S n of S M have one-dimensional fixed point spaces. This assures that we can use Theorem 1. We have the following Lemma.
Lemma 1.
Let M = m + n such that M > 1 and m , n > 0 . Let U m be a set of m classes, and let U n be a set of n classes such that U m U n = and U m U n = { 1 , , M } . Now define u ^ ( m , n ) N K such that
u ^ ( m , n ) i = n v i f   i U m m v i f   i U n 0 o t h e r w i s e
where v is defined as in Definition 1.2, and let
u ( m , n ) = u ^ ( m , n ) 0
where 0 R K . Then the isotropy subgroup of u ( m , n ) is Σ ( m , n ) Γ U such that Σ ( m , n ) S m × S n , where S m permutes u i when i U m , and S n permutes u i when i U n . The fixed point space of Σ ( m , n ) restricted to K * d 2 L ( q * ) is one dimensional.

2.7. Bifurcating Branches

Theorem 3.
Let ( q * , λ * , β * ) be an equilibrium of (8) such that q * is M-singular for 1 < M N , and the crossing condition
d β d 2 L [ u , u ] d 3 L [ u , u , L d β L ] 0
is satisfied. Then there exists bifurcating solutions, q * λ * β * + t u ( m , n ) β ( t ) , where u ( m , n ) K * is defined in (26), for every pair ( m , n ) such that M = m + n , each with an isotropy group isomorphic to S m × S n .
Proof. 
We mimic the proof of the Equivariant Branching Lemma. Let u : = u ( m , n ) Fix ( S m × S n ) K * and let V be a matrix with columns composed of the M 1 vectors { V i } . Thus, there exists x 0 M 1 so that u = V x 0 . Since r ( Fix ( S m × S n ) K * ) Fix ( S m × S n ) K * (for every σ S m × S n , r ( V x ) = r ( σ V x ) ( u Fix ( S M × S n ) that equals σ r ( V x ) (by equivariance)), then r ( t x 0 , β ) = h ( t , β ) x 0 , where r is the Liapunov–Schmidt reduction (19), and h is a polynomial in t.
Since K * is S M -irreducible, then Fix ( S M ) K * = { 0 } (otherwise, σ x = x for some x K * for every σ S M , which implies that span ( x ) is an invariant subspace of K * ). Now [22] p. 75 shows that r ( 0 , β ) = 0 , and so h ( 0 , β ) = 0 , from which it follows that h ( t , β ) = t k ( t , β ) . Thus,
r ( t x 0 , β ) = t k ( t , β ) x 0 .
Differentiating with respect to t yields
d x r ( t x 0 , β ) x 0 = ( k ( t , β ) + t d t k ( t , β ) ) x 0 ,
from which it follows that
k ( t , β ) x 0 = d x r ( t x 0 , β ) x 0 t d t k ( t , β ) x 0 ,
and so k ( 0 , 0 ) = 0 . Furthermore, we see that d β k ( 0 , 0 ) x 0 = d x , β 2 r ( 0 , 0 ) x 0 0 by assumption (see (23)). This shows that d β k ( 0 , 0 ) is a non-zero eigenvalue of d x r ( t x 0 , β ) with associated eigenvector x 0 . By the Implicit Function Theorem, k ( t , β ) = 0 has a non-zero unique solution for β = β ( t ) . □

2.8. The Crossing Condition for Annealing Problemsn

We next determine how to check the crossing condition in Theorem 3 when F is an annealing problem, as in (2)
F ( q , β ) = H ( q ) + β D ( q ) .
First, we show that the crossing condition can be checked in terms of the Hessian of the function D. Furthermore, when G is strictly concave on span ( { v i } ) , then the crossing condition is always satisfied, and every singularity is a bifurcation.
Theorem 4.
The crossing condition
d β d 2 L [ u , u ] d 3 L [ u , u , L d β L ] 0
given in Theorem 3 is satisfied for M-singular q for M > 1 if d 2 D ( q ) is either positive or negative definite on span ( { v i } ) .
Proof. 
Let x 0 2 M 2 so that u = W x 0 Fix ( S m × S n ) K * . Multiplying Equation (21) on the left by x 0 T and on the right by x 0 yields
x 0 T d x r ( 0 , β ) x 0 = u T d q , λ 2 L ( q * , λ * , β + β * ) ( I N K + K + d w U ( 0 , β ) ) u .
By Theorem 2, an arbitrary u K * can be written as u = u ^ 0 , where u ^ span ( { v i } ) ker d 2 F ( q * , β * ) . Substituting this into (29) and observing that d 2 F ( q * , β + β * ) = d 2 G ( q * ) + ( β + β * ) d 2 D ( q * ) = d 2 F ( q * , β * ) + β d 2 D ( q * ) yields
x 0 T d x r ( 0 , β ) x 0 = β u ^ T d 2 D ( q * ) 0 T I N K + K + w U ( 0 , β ) u ^ 0 .
Differentiating with respect to β , evaluating at β = 0 , and using (20) yields
x 0 T d x , β 2 r ( 0 , 0 ) x 0 = u ^ T d 2 D ( q * ) u ^ ,
which must be non-zero since we assume that d 2 D ( q ) is either positive or negative definite on span ( { v i } ) . □
From (30), we can get an expression for ξ , the eigenvalue of d x , β 2 r ( 0 , 0 ) with eigenvector x 0 . Substituting d x , β 2 r ( 0 , 0 ) x 0 = ξ x 0 and observing that x 0 T x 0 = x 0 T W T W x 0 = u ^ T u ^ yields
ξ = u ^ T d 2 D ( q * ) u ^ | | u ^ | | 2 .
The requirement that d 2 D ( q ) is either positive or negative definite on span ( { v i } ) holds when d 2 G ( q * ) is either negative or positive definite, respectively, on span ( { v i } ) .
Lemma 2.
Let d 2 F ( q * , β * 0 ) be singular where q * is M-singular such that d 2 G ( q * ) is negative (or positive) definite on span ( { v i } ) . Then d 2 D ( q * ) is positive (or negative) definite on span ( { v i } ) .
Proof. 
If u span ( { V i } ) ker d 2 F ( q * ) , then u T d 2 G ( q * ) u + β * u T d 2 D ( q * ) u = 0 . Since u T d 2 G ( q * ) u < 0 , then u T d 2 D ( q * ) u > 0 . □
These results are important for the Information Bottleneck problem (2), where d 2 G ( q ) = d 2 I ( Y ; Z ) is only non-positive definite on ker d 2 F ( q * ) , but is negative definite on span ( { v i } ) . Thus, every singularity of the Information Bottleneck with ker d 2 L ( q * ) = K * × K p is a bifurcation point. The space K p does not contain bifurcating branches since the crossing condition is never satisfied there: for u K p , u ^ T d 2 G ( q ) u ^ + β u ^ T d 2 D ( q ) u ^ = 0 + 0 (by Lemma 42 in [25]), and so (Theorem 109, [25]) ξ = u ^ T d 2 D ( q ) u ^ u ^ = 0 .

2.9. Bifurcation Type

Suppose that a bifurcation occurs at ( q * , λ * , β * ) , where q * is M-singular. This section examines the type of bifurcation from which emanate the branches
q * λ * + t u , β * + β ( t ) ,
whose existence is guaranteed by Theorem 3.
As we showed in [21], the derivative β ( 0 ) 0 indicates a transcritical bifurcation. If β ( 0 ) = 0 , then the bifurcation is degenerate, and if β ( 0 ) 0 , then we have a pitchfork-like bifurcation. Further, t β ( t ) < 0 for small t indicates a subcritical bifurcating branch, and t β ( t ) > 0 for small t indicates a supercritical bifurcating branch.
Expressions for β ( 0 ) and β ( 0 ) are derived as follows. Differentiating k ( t , β ) = 0 from (27) yields
d t k ( t , β ( t ) ) + d β k ( t , β ( t ) ) β ( t ) = 0 ,
so that β ( t ) = d t k ( t , β ( t ) ) d β k ( t , β ( t ) ) . Differentiating (28) with respect to t and then evaluating at t = 0 shows that
β ( 0 ) = d x 2 r ( 0 , 0 ) [ x 0 , x 0 , x 0 ] 2 | | x 0 | | 2 ξ
where d x 2 r ( 0 , 0 ) [ x 0 , x 0 , x 0 ] = i , j , k 2 r [ x ] i [ x ] j [ x ] k ( 0 , 0 ) [ x 0 ] i [ x 0 ] j [ x 0 ] k (see (24)). As shown in the proof to Theorem 3, ξ = d β k ( 0 , 0 ) is the non-zero eigenvalue of d x , β 2 r ( 0 , 0 ) with eigenvector x 0 .
This expression is similar to the one given in [22] p. 90. The numerator can be calculated via (24). In [21], we showed that β ( 0 ) = 0 . We have the same result in the present case.
Theorem 5.
If q * is M-singular for 1 < M N , then all of the bifurcating branches guaranteed by Theorem 3 are degenerate, i.e., β ( 0 ) = 0 .
Proof. 
To show that the numerator of (33) d x 2 r ( 0 , 0 ) = 0 , expand r i , the ith component of r, about x = 0 ,
r i ( x , β ) = r i ( 0 , β ) + d x r i ( 0 , β ) T x + x T d x 2 r i ( 0 , β ) x + O ( x 3 ) = d x r i ( 0 , β ) T x + x T d x 2 r i ( 0 , β ) x + O ( x 3 ) ,
and so
r i ( x , 0 ) = x T d x 2 r i ( 0 , 0 ) x + O ( x 3 ) .
Applying the equivariance relation A r ( x , 0 ) = r ( A x , 0 ) , where A is any element of the group isomorphic to S M that acts on r in R M 1 , and equating the quadratic terms yields
A x T d x 2 r 1 x x T d x 2 r 2 x x T d x 2 r M 1 x = x T A T d x 2 r 1 A x x T A T d x 2 r 2 A x x T A T d x 2 r M 1 A x .
By (24), the diagonal 2 r i x i x i ( 0 , 0 ) = 0 for each i as well as for all of the “multi-diagonals”. This shows that 2 r i x j x k ( 0 , 0 ) = 0 for every i , j , k (see Theorem 124 in [25]). □
When β ( 0 ) = 0 , we need to compute β ( 0 ) to determine whether a branch is subcritical or supercritical. Differentiating (32) and setting t = 0 shows that β ( 0 ) = d t 2 k ( 0 , 0 ) d β k ( 0 , 0 ) . Differentiating (28) twice and solving for d t 2 k ( 0 , 0 ) shows that
β ( 0 ) = d x 3 r ( 0 , 0 ) [ x 0 , x 0 , x 0 , x 0 ] 3 | | x 0 | | 2 ξ
where W x 0 = u = u ( m , n ) . Use Equation (25) to calculate the numerator, and ξ = d β k ( 0 , 0 ) is the non-zero eigenvalue of d x , β 2 r ( 0 , 0 ) with eigenvector x 0 , for which we give an explicit expression in (31) when F is an annealing problem.
If β ( 0 ) 0 , which we expect to be true generically, then Theorem 5 shows that the bifurcation guaranteed by Theorem 3 is pitchfork-like.

2.10. Stability and Optimality

The next Theorem relates the stability of equilibria ( q * , λ * , β ) in the flow (8) with optimality of q * in Problem (1). In particular, if a bifurcating branch corresponds to an eigenvalue of d 2 L ( q * ) changing from negative to positive, then the branch consists of stationary points ( q * , β * ) that are not solutions of (1). Positive eigenvalues of d 2 L ( q * ) do not necessarily show that q * is not a solution of (1) (see Remark 1). For example, see page 668 of [21]. A proof of this theorem is given in [21].
Theorem 6.
For each bifurcating branch guaranteed by Theorem 3, u is an eigenvector of d 2 L ( q * λ * + t u , β * + β ( t ) ) for sufficiently small t. Furthermore, if the corresponding eigenvalue is positive, then the branch consists of unstable stationary points that are not solutions to (1).

2.11. Structure of the Symmetry Projection

The matrix P R ( q * ) that projects ( q , λ ) N K + K onto range ( d 2 L ( q * ) ) × K * by annihilating K p is important for numerical computations for equilibria of IB, since we may want to take each equilibrium found by Newton’s method and take out any part in K p . P R is written as a function of q since its constitutive vectors y (from Definition 1) depend on q. The following theorems clarify the structure of this projection.
Theorem 7.
P R ( q ) = I P K p ( q ) , where P K p = A 0 0 0 . P R and P K p are ( N K + K ) × ( N K + K ) . The matrix A is N K × N K with N 2 blocks, { A i j } i , j = 1 N , of size K × K , defined by
A i , j = ( M 1 ) y y T i f   1 i = j M y y T i f   1 i j M 0 o t h e r w i s e
For example, if M = N = 3 , then
P R = I 2 y y T y y T y y T 0 y y T 2 y y T y y T 0 y y T y y T 2 y y T 0 0 0 0 0 = I ( N 1 ) 1 1 0 1 ( N 1 ) 1 0 1 1 ( N 1 ) 0 0 0 0 0 y y T .
Proof. 
Theorem 2 gives the basis of K p as { Y i } i = 1 M 1 . Let Y be the ( N K + K ) × ( M 1 ) matrix whose columns are the vectors { Y i } . For example, if M = 3 and N = 4 , then Y = y 0 0 y y y 0 0 0 0 . Thus, the matrix that projects onto K p is P K p = Y ( Y T Y ) 1 Y T , and the projection matrix onto range ( d 2 L ( q * ) ) is P R = I P K p . Direct multiplication of Y ( Y T Y ) 1 Y T , with an appeal to Lemma 34 in [25] to compute the inverse, shows that P K p = 1 N y T y A 0 0 0 . Dropping the constant yields the result. □
For the Information Bottleneck, the matrix P R is easy to calculate, since y = q i for any i U . For example, when q = q 1 N , then y T y = K N 2 and y y T = 1 N 2 1 , and so
P K p = 1 N K ( N 1 ) 1 1 0 1 ( N 1 ) 1 0 1 1 ( N 1 ) 0 0 0 0 0 0 1
where 1 is a K × K matrix of 1s. Thus,
P R = I N K + K ( N 1 ) 1 1 0 1 ( N 1 ) 1 0 1 1 ( N 1 ) 0 0 0 0 0 0 1 .
Theorem 8.
The symmetry group S M commutes with the matrix P R , which projects onto N K + K K p .
Proof. 
Let P : = P R be the matrix that projects onto range d 2 L ( q * ) × K * = N K + K K p . Since N K + K = range d 2 L ( q * ) × K * × K p , then any x N K + K can be decomposed in the respective subspaces as x = r + k * + k p . Let σ be an arbitrary permutation matrix in S M . Then σ P q λ = σ P ( r + k * + k p ) = σ ( r + v ) . Since range d 2 L ( q * ) , K * and K p are all S M invariant; then σ ( r + v ) range d 2 L ( q * ) × K * implies that σ ( r + v ) = P σ ( r + v ) , and σ z K p implies that P σ ( r + v ) = P σ ( r + v + z ) . Thus, σ P x = P σ x . □

2.12. Visualizations of Sample Resultsn

We illustrate these structures numerically. In [7], we introduced the toy “Four-blob” probability distribution p ( x , y ) shown in Figure 1.
For the Information Distortion problem (3) [7,12,13] and the synthetic dataset composed of a mixture of four Gaussians (Figure 1), we determined the bifurcation structure of solutions to (3) by annealing in β and finding the corresponding stationary points to (1). A typical run of the derived gradient dynamical system tends to follow the main bifurcation branch S K S K 1 from the fully symmetric uniform quantizer q 1 N ( N = 4 here) to the fully resolved deterministic quantizer (hard clustering) seen at the end in Figure 2. The permutation symmetry is also obvious there—the value of the cost function does not change if the classes along the vertical axis in T are permuted/relabeled. The uniform quantizer q 1 N (Item 1 in the figure) plays a special role in the formulation (3), as it is the unique solution to the problem for β = 0 as the maximum entropy solution of max q H ( T | Y ) . Its loss of stability at the first bifurcation for increasing β can hence be determined analytically and the first bifurcation structure characterized completely. Because of the “perpetual kernel” of the cost function in (4), the uniform quantizer is just one of a continuous set of “uninformative” quantizers for the IB problem (4): all { q ( t | y ) : q ( t | y ) = f ( t ) } , having constant probability of assignment of each y to class t, but the assignment weight can be different for different classes. Such a structure does not change the value of the cost function in the IB problem (4) (but does change it for (3), which hence does not have this degeneracy). We address the degeneracy of the IB optimization by projecting onto the subspace that has the correct symmetry (i.e., just the uniform quantizer q 1 N in this case), as outlined in Remark 2.
A more-thorough structure of the bifurcation diagram, using the analysis presented above, is shown in Figure 3.
Similar to the results we presented in [28], the close-up of the bifurcation at β 1.038706 in Figure 3B shows a subcritical bifurcating branch (a first-order phase transition) that consists of stationary points of Problem (1). By projecting the Hessian Δ q ( G ( q * ) + β D ( q * ) ) onto each of the kernels referenced in Theorem 6, we determined that the points on this subcritical branch are not solutions of (1), and yet they are solutions of (2).
Furthermore, observe that Figure 3B indicates that a saddle-node bifurcation occurs at β 1.037479 . That this is indeed the case was proved in [21]. In fact, for any problem of the form (2), these are the only two types of bifurcations to be expected: pitchfork and saddle-node.

3. Conclusions and Discussion

The main goal of this contribution was to show that information-based distortion-annealing problems such as (2) have an interesting mathematical structure. The most interesting aspects of that mathematical structure are driven by the symmetries present in the cost functions—their invariance to actions of the permutation group S N , represented as relabeling of the reproduction classes. Such a structure would hold for any biclustering problem [4] that relies on the intrinsic interaction of a pair of variables for unsupervised clustering. The second mathematical structure that we used successfully was bifurcation theory, which allowed us to identify and study the discrete points at which the character of the cost function changed. The combination of those two tools in [20] allowed us to explicitly compute the value of the annealing parameter β at which the initial maximum at the uniform quantizer q 1 N of (1) loses stability. We concluded that for a fixed system C Y characterized by p ( X , Y ) , this value is the same for both problems, that it does not depend on the number of elements of the reproduction variable T, and that it is always greater than 1. We further introduced an eigenvalue problem that links the critical values of β and q for bifurcations, or phase transitions, branching off arbitrary intermediate solutions.
Even though the cost functions F I B (4) and F H (3) have similar properties, they also differ in some important aspects. We have shown that the function F I B is degenerate since its constitutive functions I ( X ; Y ) and I ( X ; T ) are not strictly convex in q. That introduces additional invariances and singularities that are always preserved, which makes phase transitions more difficult to detect (e.g., the ”uninformative quantizers” q ( t | y ) = f ( t ) only) and post-transition directions more difficult to determine. In contrast, F H is strictly convex except at points of phase transitions. The theory we developed here allows us to identify bifurcation directions and determine their stability. Despite the presence of a high-dimensional null space at bifurcations, the symmetries restrict the allowed transition dimensions to multiple co-dimension 1 transitions, all related by group transformations. We achieved that here with three main results. Theorem 8 extended the Equivariant Branching Lemma 1 to the Information Bottleneck case with additional translation invariance. Theorem 4 identified specific conditions at which a bifurcation of the gradient flow (8) occurs. This condition is computable analytically for the initial bifurcation off the uniform quantizer q 1 N and with numeric continuation for subsequent bifurcation. Finally, in Section 2.9, we provided checks for the types of bifurcations that occur, giving conditions to detect saddle-node and pitchfork bifurcations and to determine whether pitchforks are supercritical (second-order phase transitions) or subcritical (leading to first-order phase transitions discontinuous in β ). The combination of the three results, together with our previous results in [20], completely characterize the local bifurcation structure of Information Bottleneck-type problems with or without the added translation symmetry.
Despite the further development of the bifurcation formalism for IB presented her, there are still open questions that this manuscript did not resolve. In particular, we still cannot confirm or reject the conjecture that the set of S K symmetric soft-clustering branches connected through symmetry-breaking bifurcations leads to the global hard-clustering optima at β (multiple equivalent solutions connected by the permutation symmetry of the problem). We believe this is partially due to a discrepancy between practical observations and theoretical results. In particular, we and other practitioners [29,30] note that the only observed symmetry-breaking bifurcations during optimization are of the kind S M S M 1 , while the theory allows for arbitrary S M S m × S n bifurcations. The latter are known to happen and be stable in other biological systems and circumstances [26,31]. This suggests a research approach of comparing and contrasting the different systems that possess the same S N symmetry and symmetry-breaking bifurcations to lead to breakthroughs in this application to optimization in the Information Bottleneck problem.
An additional open problem involves the use of continuous variables, already noted in [5] and explored further in [32,33]. This approach, while important for many real-world problems, involves the application of additional mathematical tools, namely Calculus of Variations [34], which further increases the complexity of an otherwise already complex problem. These difficulties are illustrated in a pair of papers [35,36] that use the continuous formulation. They do present some significant results on conditions of learnability, but both papers manage to only get bounds on β under which learnability (optimal solutions beyond the “uninformative” quantizer) can be achieved. This is possibly due to the presence of continuous spectra in covariance operators of continuous quantizers, something that we avoid by focusing on finite spaces. As a consequence, here and in prior work [20], we show specific values for β for the initial bifurcation from the uniform quantizer, which supports nontrivial clustering. We consider formulation with continuous variables beyond the scope of this manuscript, but look forward to the development of additional techniques to incorporate this important case in the bifurcation framework presented here. Regardless of such developments, any practical problem with numeric optimization will involve discretization of the continuous variables, which effectively converts a continuous problem to the discrete state discussed here.

Author Contributions

Conceptualization, A.G.D.; Formal analysis, A.G.D. and A.E.P.; Investigation, A.G.D. and A.E.P.; Writing–original draft, A.G.D. and A.E.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Gray, R.M. Entropy and Information Theory; Springer: Berlin/Heidelberg, Germany, 1990. [Google Scholar]
  2. Cover, T.; Thomas, J. Elements of Information Theory; Wiley Series in Communication; Wiley: New York, NY, USA, 1991. [Google Scholar]
  3. Rose, K. Deteministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems. Proc. IEEE 1998, 86, 2210–2239. [Google Scholar] [CrossRef]
  4. Madeira, S.C.; Oliveira, A.L. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Comput. Biol. Bioinform. 2004, 1, 24–45. [Google Scholar] [CrossRef] [PubMed]
  5. Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In 37th Annual Allerton Conference on Communication, Control, and Computing; University of Illinois: Champaign, IL, USA, 1999. [Google Scholar]
  6. Dimitrov, A.G.; Miller, J.P.; Aldworth, Z.; Gedeon, T.; Parker, A.E. Analysis of neural coding through quantization with an information-based distortion measure. Netw. Comput. Neural Syst. 2003, 14, 151–176. [Google Scholar] [CrossRef]
  7. Dimitrov, A.G.; Miller, J.P. Neural coding and decoding: Communication channels and quantization. Netw. Comput. Neural Syst. 2001, 12, 441–472. [Google Scholar] [CrossRef]
  8. Gersho, A.; Gray, R.M. Vector Quantization and Signal Compression; Kluwer Academic Publishers: New York, NY, USA, 1992. [Google Scholar]
  9. Mumey, B.; Gedeon, T. Optimal mutual information quantization is NP-complete. In Proceedings of the Neural Information Coding (NIC) Workshop, Snowbird, UT, USA, 1–4 March 2003. [Google Scholar]
  10. Slonim, N.; Tishby, N. Agglomerative Information Bottleneck. In Advances in Neural Information Processing Systems; Solla, S.A., Leen, T.K., Müller, K.R., Eds.; MIT Press: Cambridge, MA, USA, 2000; Volume 12, pp. 617–623. [Google Scholar]
  11. Slonim, N. The Information Bottleneck: Theory and Applications. Ph.D. Thesis, Hebrew University, Jerusalem, Israel, 2002. [Google Scholar]
  12. Dimitrov, A.G.; Miller, J.P. Analyzing sensory systems with the information distortion function. In Proceedings of the Pacific Symposium on Biocomputing 2001; Altman, R.B., Ed.; World Scientific Publishing Co.: Singapore, 2000. [Google Scholar]
  13. Gedeon, T.; Parker, A.E.; Dimitrov, A.G. Information Distortion and Neural Coding. Can. Appl. Math. Q. 2003, 10, 33–70. [Google Scholar]
  14. Slonim, N.; Somerville, R.; Tishby, N.; Lahav, O. Objective classification of galaxy spectra using the information bottleneck method. Mon. Not. R. Astron. Soc. 2001, 323, 270–284. [Google Scholar] [CrossRef]
  15. Bardera, A.; Rigau, J.; Boada, I.; Feixas, M.; Sbert, M. Image segmentation using information bottleneck method. IEEE Trans. Image Process. 2009, 18, 1601–1612. [Google Scholar] [CrossRef]
  16. Aldworth, Z.N.; Dimitrov, A.G.; Cummins, G.I.; Gedeon, T.; Miller, J.P. Temporal encoding in a nervous system. PLoS Comput. Biol. 2011, 7, e1002041. [Google Scholar] [CrossRef]
  17. Buddha, S.K.; So, K.; Carmena, J.M.; Gastpar, M.C. Function identification in neuron populations via information bottleneck. Entropy 2013, 15, 1587–1608. [Google Scholar] [CrossRef]
  18. Lewandowsky, J.; Bauch, G. Information-optimum LDPC decoders based on the information bottleneck method. IEEE Access 2018, 6, 4054–4071. [Google Scholar] [CrossRef]
  19. Parker, A.E.; Dimitrov, A.G.; Gedeon, T. Symmetry breaking in soft clustering decoding of neural codes. IEEE Trans. Inf. Theory 2010, 56, 901–927. [Google Scholar] [CrossRef] [Green Version]
  20. Gedeon, T.; Parker, A.E.; Dimitrov, A.G. The mathematical structure of information bottleneck methods. Entropy 2012, 14, 456–479. [Google Scholar] [CrossRef]
  21. Parker, A.E.; Gedeon, T. Bifurcations of a class of SN-invariant constrained optimization problems. J. Dyn. Differ. Equ. 2004, 16, 629–678. [Google Scholar] [CrossRef]
  22. Golubitsky, M.; Stewart, I.; Schaeffer, D.G. Singularities and Groups in Bifurcation Theory II; Springer: New York, NY, USA, 1988. [Google Scholar]
  23. Golubitsky, M.; Schaeffer, D.G. Singularities and Groups in Bifurcation Theory I; Springer: New York, NY, USA, 1985. [Google Scholar]
  24. Nocedal, J.; Wright, S.J. Numerical Optimization; Springer: New York, NY, USA, 2000. [Google Scholar]
  25. Parker, A.E. Symmetry Breaking Bifurcations of the Information Distortion. Ph.D. Thesis, Montana State University, Bozeman, MT, USA, 2003. [Google Scholar]
  26. Golubitsky, M.; Stewart, I. The Symmetry Perspective: From Equilibrium to Chaos in Phase Space and Physical Space; Birkhauser Verlag: Boston, MA, USA, 2002. [Google Scholar]
  27. Schott, J.R. Matrix Analysis for Statistics; John Wiley and Sons: New York, NY, USA, 1997. [Google Scholar]
  28. Parker, A.; Gedeon, T.; Dimitrov, A. Annealing and the rate distortion problem. In Advances in Neural Information Processing Systems 15; Becker, S.T., Obermayer, K., Eds.; MIT Press: Cambridge, MA, USA, 2003; Volume 15, pp. 969–976. [Google Scholar]
  29. Dimitrov, A.G.; Cummins, G.I.; Baker, A.; Aldworth, Z.N. Characterizing the fine structure of a neural sensory code through information distortion. J. Comput. Neurosci. 2011, 30, 163–179. [Google Scholar] [CrossRef] [PubMed]
  30. Schneidman, E.; Slonim, N.; Tishby, N.; de Ruyter van Steveninck, R.R.; Bialek, W. Analyzing neural codes using the information bottleneck method. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2003; Volume 15. [Google Scholar]
  31. Stewart, I. Self-Organization in evolution: A mathematical perspective. Philos. Trans. R. Soc. 2003, 361, 1101–1123. [Google Scholar] [CrossRef] [PubMed]
  32. Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. In Proceedings of the Advances in Neural Information Processing Systems 16 (NIPS 2003), Vancouver, BC, Canada, 8–13 December 2003. [Google Scholar]
  33. Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information Bottleneck for Gaussian Variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
  34. Gelfand, I.M.; Fomin, S.V. Calculus of Variations; Dover Publications: Mineola, NY, USA, 2000. [Google Scholar]
  35. Wu, T.; Fischer, I.; Chuang, I.L.; Tegmark, M. Learnability for the information bottleneck. In Proceedings of the Uncertainty in Artificial Intelligence, PMLR, Virtual, 3–6 August 2020; pp. 1050–1060. [Google Scholar]
  36. Ngampruetikorn, V.; Schwab, D.J. Perturbation theory for the information bottleneck. Adv. Neural Inf. Process. Syst. 2021, 34, 21008–21018. [Google Scholar]
Figure 1. The probability distribution p ( x , y ) for the “Four-blob” toy problem for a system of interest X Y . We use this probability to illustrate some results of the bifurcation analysis reported here.
Figure 1. The probability distribution p ( x , y ) for the “Four-blob” toy problem for a system of interest X Y . We use this probability to illustrate some results of the bifurcation analysis reported here.
Entropy 24 01231 g001
Figure 2. The bifurcations of the solutions ( q * , β ) to the Information Distortion problem (3). For the mixture of 4 well-separated Gaussians shown in Figure 1, the behavior of D ( q ) = I ( X ; T ) as a function of β is shown in the top panel, and some of the solutions q * ( T | Y ) are shown in the bottom panels. Item 1 shows the uniform quantizer q 1 N , assigning equal probability of each y Y to belong to one of the four clusters in T. Subsequent items 2–5 point to a set of partially resolved quantizations, in which subsets of Y are assigned with high probability to one (2) or more (3–5) classes (dark colors, close to 1), while other subsets are still unresolved (gray levels), albeit as a higher probability than q 1 N (darker gray, as some of the classes are excluded after being resolved for another subset). Item 6 shows an almost fully resolved quantizer at sufficiently high β . They become fully resolved (deterministic; q ( t | y ) = 1 or 0) as β (not shown).
Figure 2. The bifurcations of the solutions ( q * , β ) to the Information Distortion problem (3). For the mixture of 4 well-separated Gaussians shown in Figure 1, the behavior of D ( q ) = I ( X ; T ) as a function of β is shown in the top panel, and some of the solutions q * ( T | Y ) are shown in the bottom panels. Item 1 shows the uniform quantizer q 1 N , assigning equal probability of each y Y to belong to one of the four clusters in T. Subsequent items 2–5 point to a set of partially resolved quantizations, in which subsets of Y are assigned with high probability to one (2) or more (3–5) classes (dark colors, close to 1), while other subsets are still unresolved (gray levels), albeit as a higher probability than q 1 N (darker gray, as some of the classes are excluded after being resolved for another subset). Item 6 shows an almost fully resolved quantizer at sufficiently high β . They become fully resolved (deterministic; q ( t | y ) = 1 or 0) as β (not shown).
Entropy 24 01231 g002
Figure 3. (A) The bifurcation structure of stationary points of the Information Distortion problem (3), a problem of form (2). We found these points by annealing in β and finding stationary points for Problem (1) using the algorithm presented in [28]. A square indicates where a bifurcation occurs. (B) A close-up of the subcritical bifurcation at β 1.038706 , indicated by a square. Observe the subcritical bifurcating branch, and the subsequent saddle-node bifurcation at β 1.037479 , indicated by another square. We applied Theorem 6 to show that the subcritical bifurcating branch is composed of quantizers that are solutions of (3) but not of (1).
Figure 3. (A) The bifurcation structure of stationary points of the Information Distortion problem (3), a problem of form (2). We found these points by annealing in β and finding stationary points for Problem (1) using the algorithm presented in [28]. A square indicates where a bifurcation occurs. (B) A close-up of the subcritical bifurcation at β 1.038706 , indicated by a square. Observe the subcritical bifurcating branch, and the subsequent saddle-node bifurcation at β 1.037479 , indicated by another square. We applied Theorem 6 to show that the subcritical bifurcating branch is composed of quantizers that are solutions of (3) but not of (1).
Entropy 24 01231 g003
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Parker, A.E.; Dimitrov, A.G. Symmetry-Breaking Bifurcations of the Information Bottleneck and Related Problems. Entropy 2022, 24, 1231. https://doi.org/10.3390/e24091231

AMA Style

Parker AE, Dimitrov AG. Symmetry-Breaking Bifurcations of the Information Bottleneck and Related Problems. Entropy. 2022; 24(9):1231. https://doi.org/10.3390/e24091231

Chicago/Turabian Style

Parker, Albert E., and Alexander G. Dimitrov. 2022. "Symmetry-Breaking Bifurcations of the Information Bottleneck and Related Problems" Entropy 24, no. 9: 1231. https://doi.org/10.3390/e24091231

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop