Abstract
In this paper, we investigate the bifurcations of solutions to a class of degenerate constrained optimization problems. This study was motivated by the Information Bottleneck and Information Distortion problems, which have been used to successfully cluster data in many different applications. In the problems we discuss in this paper, the distortion function is not a linear function of the quantizer. This leads to a challenging annealing optimization problem, which we recast as a fixed-point dynamics problem of a gradient flow of a related dynamical system. The gradient system possesses an symmetry due to its invariance in relabeling representative classes. Its flow hence passes through a series of bifurcations with specific symmetry breaks. Here, we show that the dynamical system related to the Information Bottleneck problem has an additional spurious symmetry that requires more-challenging analysis of the symmetry-breaking bifurcation. For the Information Bottleneck, we determine that when bifurcations occur, they are only of pitchfork type, and we give conditions that determine the stability of the bifurcating branches. We relate the existence of subcritical bifurcations to the existence of first-order phase transitions in the corresponding distortion function as a function of the annealing parameter, and provide criteria with which to detect such transitions.
1. Introduction
This paper analyzes bifurcations of solutions to constrained optimization problems of the form
as a function of a scalar parameter and a quantizer or classifier with . The real-valued function f is sufficiently smooth, and is the constraint space of valid quantizers, a convex set of discrete probabilities (simplices).
This type of problem arises in Rate Distortion Theory [,], Deterministic Annealing [] and biclustering []. The specific motivations for the abstract problem formulation given in (1) are the Information Bottleneck [] and Information Distortion [] functions
These were proposed in [,] to analyze the Markov chain in which , characterized by a probability , is the original system of interest, characterized by its mutual information , and T is a simplification (quantized version of) Y. Here we work mainly with discrete versions of Y and T, with cardinalities and . Typically . is the mutual information between the K objects in Y and the N clusters in T. The goal is to cluster K objects in Y into N clusters in T given inputs X such that the function F is maximized in ; the probability that the jth element of Y is classified as being a member of the cluster with label . We call such a set of conditional probabilities a stochastic quantizer, or just a quantizer, to relate to the vector quantization literature []. The annealing parameter .
It has been shown that finding hard-clustering solutions to (2) is NP-complete (combinatorial search) when is the mutual information [], as in the Information Bottleneck [,,] and the Information Distortion [,,] methods. Information Bottleneck (IB) approaches are gaining in penetration into multiple scientific and engineering domains [,,,,]. As they typically involve the nonlinear optimization problem (2), there is need for optimization methods for such problems that can avoid the rise in complexity implied by the NP-complete hard-clustering solutions []. Originally, Tishby et al. [] approached this problem with an algorithm inspired by the Blahut–Arimoto approach to solving Rate-Distortion types of problems ([], Chapter 10). The “self-consistent” equations in [] optimize both the quantizer and the “relevance” distribution . However, unlike the classic Blahut–Arimoto algorithm, which can guarantee convergence to a unique solution to its iterative scheme because of the convex geometry of the two state spaces, the “self-consistent” equations have no such guarantee due to the more-complicated geometry of three convex sets over which the optimization is performed, as also noted in []. Accordingly, in this work, we use the original optimization problem (2) over a single variable: the quantizer (conditional probability) . It may be possible that a related Blahut–Arimoto style optimization coupled to the bifurcation structure of its gradient flow discussed here can lead to additional insights into this problem, but we consider this beyond the scope of this particular manuscript.
We have investigated the structure of soft-clustering annealing-type methods that reach the hard-clustering solution in the limit of the annealing parameter [,] through a series of bifurcations. A bifurcation in this context is a point that is a solution to (2) such that the number of solutions to (2) changes in a small neighborhood of . Because a bifurcation corresponds to a point at which some of the objects Y have just been classified, in the IB literature, a bifurcation is usually referred to as a phase transition. One of the goals of this and related work is to understand why annealing-type algorithms, such as the original optimization heuristics in [,], work as well as they do. This can help with designing further optimization heuristics and can assess how close those can get to the global solutions to IB problems. We believe that this amalgamation of optimization theory and dynamical systems theory, as stated in [,], can provide a solid foundation with which to address such optimization challenges.
Because of the form (1) of F, it possesses certain symmetries. That is, the value of does not change (is invariant) under arbitrary permutations of the vectors . In other words, F is -invariant. The form (1) further implies that the Hessian is block diagonal with blocks . These conditions are met by the Information Distortion function [],
where is the entropy, and by the cost function used in the original IB method [],
which is the focus of this manuscript. Both the Information Distortion and Informaton Bottleneck problems have the form given in (1) and (2). Importantly, has a “perpetual kernel“ since each block has the eigenpair (0, ) for every q []. In other words, the Hessian is singular for every q and every value of . This makes bifurcation detection challenging because bifurcations can usually be detected by identifying isolated singularities of . This degeneracy is a consequence of the translational symmetry of : if , then for all such that . At bifurcations of solutions to (4), the translational symmetry never breaks.
To better understand bifurcations of solutions to problems of the form (1), which includes the problems (3) and (4), we consider the gradient flow
Equilibria of this flow correspond to critical points of (1), where is the Lagrangian with respect to the constraints imposed by , and is the vector of Lagrange multipliers.
Previous work showed that when is generically non-singular, as occurs for the Information Distortion (3), then there are isolated singularities of that indicate possible bifurcations of solutions to (1). In this case, an -dimensional necessitates an -dimensional , which admits a bifurcation of solutions to (1) where symmetry breaks from to for every such that [].
Here we allow and to be singular for every , as occurs for the Information Bottleneck (4). That is, the perpetual kernel for implies that also has a perpetual kernel , which means that the eigenvalue crossing condition that must occur at a bifurcation (i.e., must have a zero eigenvalue at a bifurcation) [] is never satisfied in . There are a few challenges due to the existence of the perpetual kernel (i.e., degeneracy) of the Information Bottleneck that we address in this paper. First, detecting bifurcations may be problematic because one cannot simply monitor the determinant of either or . Second, the standard theory that assures the existence of bifurcating branches, the Equivariance Branching Lemma, cannot be applied directly. Lastly, the spaces that contain the bifurcating solutions are always at least two-dimensional, which makes tracking the bifurcating solutions problematic.
Here we address two of these three challenges. We show that at a bifurcation, new eigenvalue(s) of and must cross zero, causing to expand so that , where is the span of the eigenvectors with crossing eigenvalues. Instead of detecting bifurcations by the expensive process of monitoring the expansion of (from to ), we give a simple way to check the eigenvalue crossing condition for annealing problems as in (2) []. We prove the existence of the bifurcating branches by adapting the standard proof for the Equivariant Branching Lemma. This newly developed theory guarantees that bifurcating branches exist in , are generically pitchforks, and that symmetry breaks from to . Additionally, we give conditions to check whether the pitchforks are subcritical or supercritical, and how stability of the bifurcating branches relates to optimality in the optimization problem (1).
2. Bifurcation Analysis
2.1. Equivariant Branching Lemma
The Equivariant Branching Lemma relates the subgroup structure of a symmetry group with the existence of symmetry-breaking bifurcating branches of equilibria of . Observe that we present a version that does not require absolute irreducibility. For a proof see [] p. 83.
Theorem 1.
(Equivariant Branching Lemma). Let f be a smooth function that is Γ-equivariant for a compact Lie group Γ and a Banach space V. Let Σ be an isotropy subgroup of with . Suppose that and the crossing condition for . Then there exists a unique smooth solution branch to with isotropy subgroup Σ.
For an arbitrary -equivariant system where bifurcation occurs at , the requirement in Theorem 1 that the bifurcation occurs at the origin is accomplished by a translation. Assuring that the Jacobian vanishes, , can be effected by restricting and projecting the system onto the kernel of the Jacobian. This transform is called the Liapunov–Schmidt reduction (see []).
The Equivariant Branching Lemma does not directly apply to yield bifurcating branches for the problem (1) at q for which is singular for the following reasons:
- and have independent bases, which implies that each is invariant to the action of , and so the decomposition shows that does not act absolutely irreducibly on , but it does act absolutely irreducibly on each of these disjoint subspaces separately. This is why we present a version of the Equivariant Branching Lemma that does not require absolute irreducibility.
- The Liapunov–Schmidt reduction onto is clear, but not onto .
- is two-dimensional with basiswhere .
We address these issues in the manuscript and show that a small modification of the Equivariant Branching Lemma allows for similar analysis to be successfully applied to Information Bottleneck-style problems such as (2) with minimal modifications to the original algorithm from [].
2.2. A Gradient Flow
We now lay the groundwork necessary to determine the bifurcations of local solutions to (1)
where , which includes as a special case the Information Distortion (3) and Information Bottleneck (4) problems. The convex set of discrete conditional probabilities is
Due to the form of F, it has the following properties:
- is an -invariant, real-valued function of q, where the action of on q permutes the component vectors , , of .
- The Hessian is block diagonal, where the ith block is .
The Lagrangian of (1) with respect to the equality constraints from is
The scalar is the Lagrange multiplier for the constraint , and is the vector of Lagrange multipliers The gradient of the Lagrangian in (5) is
where and . The gradient is a vector of K constraints
Let J be the Jacobian of
Observe that J has full row rank. The Hessian of (5) with respect to the vector is
where is . The matrix is the block diagonal Hessian of F with blocks .
The dynamical system whose equilibria are stationary points of (1) is the gradient flow of the Lagrangian
for as defined in (5) and . The equilibria of (8) are points where
The Jacobian of this system is the Hessian from (7).
Remark 1.
By the theory of constrained optimization [], the equilibria of (8) where is negative definite on are local solutions of (1). Conversely, if is a local solution of (1), then there exists a vector of Lagrange multipliers so that is an equilibrium of (8) (this necessary requirement is called the Karush–Kuhn–Tucker conditions) such that is non-positive definite on .
2.3. Equilibria with Symmetry
Next, we categorize the equilibria of (8) according to their symmetries, which allows us to determine when to expect symmetry-breaking bifurcations.
Let for some . Then there exists a partition of into the sets and , where , so that if and only if . Clearly, has M identical blocks, .
To ease the notation, and without loss of generality, we set
To distinguish between the blocks of , we write
As mentioned in the introduction, we assume that for each , each block always has at least a one-dimensional kernel with basis vector(s) which depend on q. Thus, . At an equilibrium of of (8) where , we consider the following three cases:
- ;
- ;
- .
We will show that the first case necessitates a symmetry-breaking bifurcation (Theorem 3). In the second case, there is no bifurcation (Corollary 1). Finally, in the third case, we expect a saddle node [], a symmetry-preserving bifurcation.
We are able to distinguish between the three cases above by considering which blocks of have kernels that have more than one dimension. This motivates the following definition.
Definition 1.
An equilibrium of (8) is M-singular (or, equivalently, is M-singular) if:
- so that for every .
- For B, the M block(s) of the Hessian defined in (9), has dimension 2 with basis vectors . is associated with the crossing eigenvalues, and is associated with the constant zero eigenvalue of B.
- The block(s) of the Hessian , defined in (9), each have a one-dimensional kernel with basis vector .
- The vectors , and are linearly independent.
- The matrixis nonsingular. is the Moore–Penrose inverse of . When , we define .
We wish to emphasize that we showed in [] that requirements 2–5 in Definition 1 hold generically.
A straightforward calculation shows that every block of the Hessian of the Information Bottleneck cost function (2) is singular for every , and the basis for is for and for (Lemma 42 in []), which assures that these vectors are linearly independent, as in Definition 1.4. At a bifurcation, the kernels of the identical blocks B expand by as in Definition 1.2. Using the notation above, for each , and for each .
2.4. The Kernel at a Bifurcation
The equilibria of (8) change their stability with , and hence change the solutions to (1). The changes of stability are determined by the kernel of at a bifurcation point . In this section we show that for any with , has a perpetual kernel that is at least dimensional. The zero eigenvalues associated with the eigenvectors in remain constant, so that at a bifurcation point of (8) where is M-singular, new eigenvalues of must cross zero. Thus, the kernel expands, and the bifurcating directions exist in an “expanded” kernel of , .
We determine a basis for at an M-singular when . If q is 1-singular with a trivial isotropy group (i.e., no symmetery), then is non-singular— disappears. First, we ascertain a basis for .
Recall that in the preliminaries, when , we defined to be the jth vector component of . We now define the linearly independent vectors , , and in by
where , and and are defined in Definition 1.2. For example, if and , then and .
Due to the block diagonal form of , it is easy to see that the vectors defined in (11) form a basis for .
Now, let
for and where . From (7), it is easy to see that these three sets of vectors are in . The next theorem shows that are a basis for . This natural partition of the basis vectors shows that can be written as . According to Definition 1, the “perpetual kernel” corresponding to constant zero eigenvalues of is generated by
The part of the kernel that arises at a bifurcation corresponding to eigenvalues crossing zero is
The vectors do not contribute to .
Theorem 2.
If is M-singular for , then from (12) are a basis for .
Proof.
To show that span , let and decompose it as
where is , and is . Hence,
Now, from (6) and the fact that is block diagonal, we have
We set
and using the notation from (9), then (15) implies
It follows that for every . By (14), we have that , and so
By (17), for every can be written as , where , , and and are the basis vectors of from Definition 1.2. Thus,
since is nonsingular. This shows that . Therefore, for every . Now (17) shows that , and so for , which implies that
Hence, , where , from which it follows that
Linear independence (Definition 1.4) implies that . Thus, Therefore, the linearly independent vectors and span . □
Corollary 1.
If is 1-singular and has isotropy group equal to the identity, then is nonsingular.
Proof.
If q is 1-singular, then has a single block B with a two-dimensional kernel. The other blocks are distinct with one-dimensional kernels. By constructing the vectors as in (11), we see that with basis vectors . Now, following the proof of Theorem 2, we take an arbitrary and then decompose as in (13) and (16). The proof to Theorem 2 holds for the present case up until, and including (18). Linear independence now shows that , which implies that . □
Remark 2.
The independent bases given for and in Theorem 2 imply that each is invariant to the action of , and so the decomposition shows that does not act absolutely irreducibly on . That is, by definition,
The explicit bases show that , which implies that acts absolutely irreducibly on and []. Thus, and are each -irreducible.
2.5. Liapunov–Schmidt Reduction
To show the existence of bifurcating branches from a bifurcation point of equilibria of (8), the Equivariant Branching Lemma requires that the bifurcation is translated to and that the Jacobian vanishes at bifurcation. To accomplish the former, consider
To assure that the Jacobian vanishes, we restrict and project onto in a neighborhood of . This is the Liapunov–Schmidt reduction of [],
where . The matrix is the projection matrix onto with . W is the matrix whose columns are the basis vectors of from (12) so that is a vector in . The vector function is the component of that is in range such that , , and
The system defined by the Liapunov–Schmidt reduction, , has a bifurcation of equilibria at , which are in correspondence with equilibria of (8). However, the stability of these associated equilibria is not necessarily the same.
It is straightforward to verify the following derivatives ([] p. 32), which we will require in the sequel. The Jacobian of (19) is
which shows that
since .
Our crossing condition at a bifurcation depends on the matrix of derivatives
where the derivatives of are evaluated at , and is the Moore–Penrose-generalized inverse [] of . The vectors are the basis vectors of from Theorem 2.
The three-dimensional array of second derivatives is
In [], we showed that whenever . In the present case, there are more zero entries since now the basis vectors are of two types: for (basis vectors of ); or for (basis vectors of , see (12)). We now consider the case when and . All other cases are dealt with using a similar argument. Substituting in for we have
The vectors and are defined in (2). An immediate consequence of this calculation is that whenever . Thus, similar arguments show that whenever:
- ;
- , , ;
- , , .
Further, we get four different “cubes” of identical entries in the 3-D array. They are:
- For , not all equal, the value of the cube is
- For , not both equal, and , the value of the cube is
- For and , not both equal, the value of the cube is
- For , not all equal, the value of the cube is
The points above will prove useful when proving that .
The four-dimensional array of third derivatives of r is
where the derivatives of are evaluated at , and is the Moore–Penrose-generalized inverse [] of .
Since is not absolutely irreducible, but is, one might try to define a Liapunov–Schmidt reduction by restricting and projecting onto . One issue with projecting the reduction onto is how to define the projection matrix E so that
holds and is non-singular in so that the Implicit Function Theorem assures the restriction , where , and instead of as in (19) []. Simply ignoring the space by considering and amounts to setting and . Since is still embedded in the larger , which contains , then derivatives are affected by the implicit constraint. This constraint is nonlinear (and may not even be tractable) since depends on q, where is a projection matrix that depends on q (see Theorem 7).
2.6. Isotropy Subgroups of
The decomposition shows that is two-dimensional with basis vectors
Restricted to , these isotropy subgroups of have one-dimensional fixed point spaces. This assures that we can use Theorem 1. We have the following Lemma.
Lemma 1.
Let such that and . Let be a set of m classes, and let be a set of n classes such that and . Now define such that
where is defined as in Definition 1.2, and let
where . Then the isotropy subgroup of is such that , where permutes when , and permutes when . The fixed point space of restricted to is one dimensional.
2.7. Bifurcating Branches
Theorem 3.
Proof.
We mimic the proof of the Equivariant Branching Lemma. Let and let V be a matrix with columns composed of the vectors . Thus, there exists so that . Since (for every , () that equals (by equivariance)), then , where r is the Liapunov–Schmidt reduction (19), and h is a polynomial in t.
Since is -irreducible, then (otherwise, for some for every , which implies that is an invariant subspace of ). Now [] p. 75 shows that , and so , from which it follows that . Thus,
Differentiating with respect to t yields
from which it follows that
and so . Furthermore, we see that by assumption (see (23)). This shows that is a non-zero eigenvalue of with associated eigenvector . By the Implicit Function Theorem, has a non-zero unique solution for . □
2.8. The Crossing Condition for Annealing Problemsn
We next determine how to check the crossing condition in Theorem 3 when F is an annealing problem, as in (2)
First, we show that the crossing condition can be checked in terms of the Hessian of the function D. Furthermore, when G is strictly concave on , then the crossing condition is always satisfied, and every singularity is a bifurcation.
Theorem 4.
The crossing condition
given in Theorem 3 is satisfied for M-singular q for if is either positive or negative definite on .
Proof.
Let so that Multiplying Equation (21) on the left by and on the right by yields
By Theorem 2, an arbitrary can be written as , where . Substituting this into (29) and observing that yields
Differentiating with respect to , evaluating at , and using (20) yields
which must be non-zero since we assume that is either positive or negative definite on . □
From (30), we can get an expression for , the eigenvalue of with eigenvector . Substituting and observing that yields
The requirement that is either positive or negative definite on holds when is either negative or positive definite, respectively, on .
Lemma 2.
Let be singular where is M-singular such that is negative (or positive) definite on . Then is positive (or negative) definite on .
Proof.
If , then Since , then . □
These results are important for the Information Bottleneck problem (2), where is only non-positive definite on , but is negative definite on . Thus, every singularity of the Information Bottleneck with is a bifurcation point. The space does not contain bifurcating branches since the crossing condition is never satisfied there: for , (by Lemma 42 in []), and so (Theorem 109, []) .
2.9. Bifurcation Type
Suppose that a bifurcation occurs at , where is M-singular. This section examines the type of bifurcation from which emanate the branches
whose existence is guaranteed by Theorem 3.
As we showed in [], the derivative indicates a transcritical bifurcation. If , then the bifurcation is degenerate, and if , then we have a pitchfork-like bifurcation. Further, for small t indicates a subcritical bifurcating branch, and for small t indicates a supercritical bifurcating branch.
Expressions for and are derived as follows. Differentiating from (27) yields
so that Differentiating (28) with respect to t and then evaluating at shows that
where (see (24)). As shown in the proof to Theorem 3, is the non-zero eigenvalue of with eigenvector .
This expression is similar to the one given in [] p. 90. The numerator can be calculated via (24). In [], we showed that . We have the same result in the present case.
Theorem 5.
If is M-singular for , then all of the bifurcating branches guaranteed by Theorem 3 are degenerate, i.e., .
Proof.
To show that the numerator of (33) , expand , the ith component of r, about ,
and so
Applying the equivariance relation , where A is any element of the group isomorphic to that acts on r in , and equating the quadratic terms yields
By (24), the diagonal for each i as well as for all of the “multi-diagonals”. This shows that for every (see Theorem 124 in []). □
When , we need to compute to determine whether a branch is subcritical or supercritical. Differentiating (32) and setting shows that Differentiating (28) twice and solving for shows that
where . Use Equation (25) to calculate the numerator, and is the non-zero eigenvalue of with eigenvector , for which we give an explicit expression in (31) when F is an annealing problem.
If , which we expect to be true generically, then Theorem 5 shows that the bifurcation guaranteed by Theorem 3 is pitchfork-like.
2.10. Stability and Optimality
The next Theorem relates the stability of equilibria in the flow (8) with optimality of in Problem (1). In particular, if a bifurcating branch corresponds to an eigenvalue of changing from negative to positive, then the branch consists of stationary points that are not solutions of (1). Positive eigenvalues of do not necessarily show that is not a solution of (1) (see Remark 1). For example, see page 668 of []. A proof of this theorem is given in [].
Theorem 6.
For each bifurcating branch guaranteed by Theorem 3, is an eigenvector of for sufficiently small t. Furthermore, if the corresponding eigenvalue is positive, then the branch consists of unstable stationary points that are not solutions to (1).
2.11. Structure of the Symmetry Projection
The matrix that projects onto by annihilating is important for numerical computations for equilibria of IB, since we may want to take each equilibrium found by Newton’s method and take out any part in . is written as a function of q since its constitutive vectors (from Definition 1) depend on q. The following theorems clarify the structure of this projection.
Theorem 7.
, where and are . The matrix A is with blocks, , of size , defined by
For example, if , then
Proof.
Theorem 2 gives the basis of as . Let Y be the matrix whose columns are the vectors . For example, if and , then . Thus, the matrix that projects onto is , and the projection matrix onto is . Direct multiplication of , with an appeal to Lemma 34 in [] to compute the inverse, shows that . Dropping the constant yields the result. □
For the Information Bottleneck, the matrix is easy to calculate, since for any . For example, when , then and , and so
where 1 is a matrix of 1s. Thus,
Theorem 8.
The symmetry group commutes with the matrix , which projects onto .
Proof.
Let be the matrix that projects onto . Since , then any can be decomposed in the respective subspaces as . Let be an arbitrary permutation matrix in . Then . Since , and are all invariant; then implies that , and implies that Thus, . □
2.12. Visualizations of Sample Resultsn
We illustrate these structures numerically. In [], we introduced the toy “Four-blob” probability distribution shown in Figure 1.
Figure 1.
The probability distribution for the “Four-blob” toy problem for a system of interest . We use this probability to illustrate some results of the bifurcation analysis reported here.
For the Information Distortion problem (3) [,,] and the synthetic dataset composed of a mixture of four Gaussians (Figure 1), we determined the bifurcation structure of solutions to (3) by annealing in and finding the corresponding stationary points to (1). A typical run of the derived gradient dynamical system tends to follow the main bifurcation branch from the fully symmetric uniform quantizer ( here) to the fully resolved deterministic quantizer (hard clustering) seen at the end in Figure 2. The permutation symmetry is also obvious there—the value of the cost function does not change if the classes along the vertical axis in T are permuted/relabeled. The uniform quantizer (Item 1 in the figure) plays a special role in the formulation (3), as it is the unique solution to the problem for as the maximum entropy solution of . Its loss of stability at the first bifurcation for increasing can hence be determined analytically and the first bifurcation structure characterized completely. Because of the “perpetual kernel” of the cost function in (4), the uniform quantizer is just one of a continuous set of “uninformative” quantizers for the IB problem (4): all , having constant probability of assignment of each y to class t, but the assignment weight can be different for different classes. Such a structure does not change the value of the cost function in the IB problem (4) (but does change it for (3), which hence does not have this degeneracy). We address the degeneracy of the IB optimization by projecting onto the subspace that has the correct symmetry (i.e., just the uniform quantizer in this case), as outlined in Remark 2.
Figure 2.
The bifurcations of the solutions to the Information Distortion problem (3). For the mixture of 4 well-separated Gaussians shown in Figure 1, the behavior of as a function of is shown in the top panel, and some of the solutions are shown in the bottom panels. Item 1 shows the uniform quantizer , assigning equal probability of each to belong to one of the four clusters in T. Subsequent items 2–5 point to a set of partially resolved quantizations, in which subsets of Y are assigned with high probability to one (2) or more (3–5) classes (dark colors, close to 1), while other subsets are still unresolved (gray levels), albeit as a higher probability than (darker gray, as some of the classes are excluded after being resolved for another subset). Item 6 shows an almost fully resolved quantizer at sufficiently high . They become fully resolved (deterministic; or 0) as (not shown).
A more-thorough structure of the bifurcation diagram, using the analysis presented above, is shown in Figure 3.
Figure 3.
(A) The bifurcation structure of stationary points of the Information Distortion problem (3), a problem of form (2). We found these points by annealing in and finding stationary points for Problem (1) using the algorithm presented in []. A square indicates where a bifurcation occurs. (B) A close-up of the subcritical bifurcation at , indicated by a square. Observe the subcritical bifurcating branch, and the subsequent saddle-node bifurcation at , indicated by another square. We applied Theorem 6 to show that the subcritical bifurcating branch is composed of quantizers that are solutions of (3) but not of (1).
Similar to the results we presented in [], the close-up of the bifurcation at in Figure 3B shows a subcritical bifurcating branch (a first-order phase transition) that consists of stationary points of Problem (1). By projecting the Hessian onto each of the kernels referenced in Theorem 6, we determined that the points on this subcritical branch are not solutions of (1), and yet they are solutions of (2).
3. Conclusions and Discussion
The main goal of this contribution was to show that information-based distortion-annealing problems such as (2) have an interesting mathematical structure. The most interesting aspects of that mathematical structure are driven by the symmetries present in the cost functions—their invariance to actions of the permutation group , represented as relabeling of the reproduction classes. Such a structure would hold for any biclustering problem [] that relies on the intrinsic interaction of a pair of variables for unsupervised clustering. The second mathematical structure that we used successfully was bifurcation theory, which allowed us to identify and study the discrete points at which the character of the cost function changed. The combination of those two tools in [] allowed us to explicitly compute the value of the annealing parameter at which the initial maximum at the uniform quantizer of (1) loses stability. We concluded that for a fixed system characterized by , this value is the same for both problems, that it does not depend on the number of elements of the reproduction variable T, and that it is always greater than 1. We further introduced an eigenvalue problem that links the critical values of and q for bifurcations, or phase transitions, branching off arbitrary intermediate solutions.
Even though the cost functions (4) and (3) have similar properties, they also differ in some important aspects. We have shown that the function is degenerate since its constitutive functions and are not strictly convex in q. That introduces additional invariances and singularities that are always preserved, which makes phase transitions more difficult to detect (e.g., the ”uninformative quantizers” only) and post-transition directions more difficult to determine. In contrast, is strictly convex except at points of phase transitions. The theory we developed here allows us to identify bifurcation directions and determine their stability. Despite the presence of a high-dimensional null space at bifurcations, the symmetries restrict the allowed transition dimensions to multiple co-dimension 1 transitions, all related by group transformations. We achieved that here with three main results. Theorem 8 extended the Equivariant Branching Lemma 1 to the Information Bottleneck case with additional translation invariance. Theorem 4 identified specific conditions at which a bifurcation of the gradient flow (8) occurs. This condition is computable analytically for the initial bifurcation off the uniform quantizer and with numeric continuation for subsequent bifurcation. Finally, in Section 2.9, we provided checks for the types of bifurcations that occur, giving conditions to detect saddle-node and pitchfork bifurcations and to determine whether pitchforks are supercritical (second-order phase transitions) or subcritical (leading to first-order phase transitions discontinuous in ). The combination of the three results, together with our previous results in [], completely characterize the local bifurcation structure of Information Bottleneck-type problems with or without the added translation symmetry.
Despite the further development of the bifurcation formalism for IB presented her, there are still open questions that this manuscript did not resolve. In particular, we still cannot confirm or reject the conjecture that the set of symmetric soft-clustering branches connected through symmetry-breaking bifurcations leads to the global hard-clustering optima at (multiple equivalent solutions connected by the permutation symmetry of the problem). We believe this is partially due to a discrepancy between practical observations and theoretical results. In particular, we and other practitioners [,] note that the only observed symmetry-breaking bifurcations during optimization are of the kind , while the theory allows for arbitrary bifurcations. The latter are known to happen and be stable in other biological systems and circumstances [,]. This suggests a research approach of comparing and contrasting the different systems that possess the same symmetry and symmetry-breaking bifurcations to lead to breakthroughs in this application to optimization in the Information Bottleneck problem.
An additional open problem involves the use of continuous variables, already noted in [] and explored further in [,]. This approach, while important for many real-world problems, involves the application of additional mathematical tools, namely Calculus of Variations [], which further increases the complexity of an otherwise already complex problem. These difficulties are illustrated in a pair of papers [,] that use the continuous formulation. They do present some significant results on conditions of learnability, but both papers manage to only get bounds on under which learnability (optimal solutions beyond the “uninformative” quantizer) can be achieved. This is possibly due to the presence of continuous spectra in covariance operators of continuous quantizers, something that we avoid by focusing on finite spaces. As a consequence, here and in prior work [], we show specific values for for the initial bifurcation from the uniform quantizer, which supports nontrivial clustering. We consider formulation with continuous variables beyond the scope of this manuscript, but look forward to the development of additional techniques to incorporate this important case in the bifurcation framework presented here. Regardless of such developments, any practical problem with numeric optimization will involve discretization of the continuous variables, which effectively converts a continuous problem to the discrete state discussed here.
Author Contributions
Conceptualization, A.G.D.; Formal analysis, A.G.D. and A.E.P.; Investigation, A.G.D. and A.E.P.; Writing–original draft, A.G.D. and A.E.P. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Gray, R.M. Entropy and Information Theory; Springer: Berlin/Heidelberg, Germany, 1990. [Google Scholar]
- Cover, T.; Thomas, J. Elements of Information Theory; Wiley Series in Communication; Wiley: New York, NY, USA, 1991. [Google Scholar]
- Rose, K. Deteministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems. Proc. IEEE 1998, 86, 2210–2239. [Google Scholar] [CrossRef]
- Madeira, S.C.; Oliveira, A.L. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Comput. Biol. Bioinform. 2004, 1, 24–45. [Google Scholar] [CrossRef] [PubMed]
- Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In 37th Annual Allerton Conference on Communication, Control, and Computing; University of Illinois: Champaign, IL, USA, 1999. [Google Scholar]
- Dimitrov, A.G.; Miller, J.P.; Aldworth, Z.; Gedeon, T.; Parker, A.E. Analysis of neural coding through quantization with an information-based distortion measure. Netw. Comput. Neural Syst. 2003, 14, 151–176. [Google Scholar] [CrossRef]
- Dimitrov, A.G.; Miller, J.P. Neural coding and decoding: Communication channels and quantization. Netw. Comput. Neural Syst. 2001, 12, 441–472. [Google Scholar] [CrossRef]
- Gersho, A.; Gray, R.M. Vector Quantization and Signal Compression; Kluwer Academic Publishers: New York, NY, USA, 1992. [Google Scholar]
- Mumey, B.; Gedeon, T. Optimal mutual information quantization is NP-complete. In Proceedings of the Neural Information Coding (NIC) Workshop, Snowbird, UT, USA, 1–4 March 2003. [Google Scholar]
- Slonim, N.; Tishby, N. Agglomerative Information Bottleneck. In Advances in Neural Information Processing Systems; Solla, S.A., Leen, T.K., Müller, K.R., Eds.; MIT Press: Cambridge, MA, USA, 2000; Volume 12, pp. 617–623. [Google Scholar]
- Slonim, N. The Information Bottleneck: Theory and Applications. Ph.D. Thesis, Hebrew University, Jerusalem, Israel, 2002. [Google Scholar]
- Dimitrov, A.G.; Miller, J.P. Analyzing sensory systems with the information distortion function. In Proceedings of the Pacific Symposium on Biocomputing 2001; Altman, R.B., Ed.; World Scientific Publishing Co.: Singapore, 2000. [Google Scholar]
- Gedeon, T.; Parker, A.E.; Dimitrov, A.G. Information Distortion and Neural Coding. Can. Appl. Math. Q. 2003, 10, 33–70. [Google Scholar]
- Slonim, N.; Somerville, R.; Tishby, N.; Lahav, O. Objective classification of galaxy spectra using the information bottleneck method. Mon. Not. R. Astron. Soc. 2001, 323, 270–284. [Google Scholar] [CrossRef]
- Bardera, A.; Rigau, J.; Boada, I.; Feixas, M.; Sbert, M. Image segmentation using information bottleneck method. IEEE Trans. Image Process. 2009, 18, 1601–1612. [Google Scholar] [CrossRef]
- Aldworth, Z.N.; Dimitrov, A.G.; Cummins, G.I.; Gedeon, T.; Miller, J.P. Temporal encoding in a nervous system. PLoS Comput. Biol. 2011, 7, e1002041. [Google Scholar] [CrossRef]
- Buddha, S.K.; So, K.; Carmena, J.M.; Gastpar, M.C. Function identification in neuron populations via information bottleneck. Entropy 2013, 15, 1587–1608. [Google Scholar] [CrossRef]
- Lewandowsky, J.; Bauch, G. Information-optimum LDPC decoders based on the information bottleneck method. IEEE Access 2018, 6, 4054–4071. [Google Scholar] [CrossRef]
- Parker, A.E.; Dimitrov, A.G.; Gedeon, T. Symmetry breaking in soft clustering decoding of neural codes. IEEE Trans. Inf. Theory 2010, 56, 901–927. [Google Scholar] [CrossRef] [Green Version]
- Gedeon, T.; Parker, A.E.; Dimitrov, A.G. The mathematical structure of information bottleneck methods. Entropy 2012, 14, 456–479. [Google Scholar] [CrossRef]
- Parker, A.E.; Gedeon, T. Bifurcations of a class of SN-invariant constrained optimization problems. J. Dyn. Differ. Equ. 2004, 16, 629–678. [Google Scholar] [CrossRef]
- Golubitsky, M.; Stewart, I.; Schaeffer, D.G. Singularities and Groups in Bifurcation Theory II; Springer: New York, NY, USA, 1988. [Google Scholar]
- Golubitsky, M.; Schaeffer, D.G. Singularities and Groups in Bifurcation Theory I; Springer: New York, NY, USA, 1985. [Google Scholar]
- Nocedal, J.; Wright, S.J. Numerical Optimization; Springer: New York, NY, USA, 2000. [Google Scholar]
- Parker, A.E. Symmetry Breaking Bifurcations of the Information Distortion. Ph.D. Thesis, Montana State University, Bozeman, MT, USA, 2003. [Google Scholar]
- Golubitsky, M.; Stewart, I. The Symmetry Perspective: From Equilibrium to Chaos in Phase Space and Physical Space; Birkhauser Verlag: Boston, MA, USA, 2002. [Google Scholar]
- Schott, J.R. Matrix Analysis for Statistics; John Wiley and Sons: New York, NY, USA, 1997. [Google Scholar]
- Parker, A.; Gedeon, T.; Dimitrov, A. Annealing and the rate distortion problem. In Advances in Neural Information Processing Systems 15; Becker, S.T., Obermayer, K., Eds.; MIT Press: Cambridge, MA, USA, 2003; Volume 15, pp. 969–976. [Google Scholar]
- Dimitrov, A.G.; Cummins, G.I.; Baker, A.; Aldworth, Z.N. Characterizing the fine structure of a neural sensory code through information distortion. J. Comput. Neurosci. 2011, 30, 163–179. [Google Scholar] [CrossRef] [PubMed]
- Schneidman, E.; Slonim, N.; Tishby, N.; de Ruyter van Steveninck, R.R.; Bialek, W. Analyzing neural codes using the information bottleneck method. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2003; Volume 15. [Google Scholar]
- Stewart, I. Self-Organization in evolution: A mathematical perspective. Philos. Trans. R. Soc. 2003, 361, 1101–1123. [Google Scholar] [CrossRef] [PubMed]
- Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. In Proceedings of the Advances in Neural Information Processing Systems 16 (NIPS 2003), Vancouver, BC, Canada, 8–13 December 2003. [Google Scholar]
- Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information Bottleneck for Gaussian Variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
- Gelfand, I.M.; Fomin, S.V. Calculus of Variations; Dover Publications: Mineola, NY, USA, 2000. [Google Scholar]
- Wu, T.; Fischer, I.; Chuang, I.L.; Tegmark, M. Learnability for the information bottleneck. In Proceedings of the Uncertainty in Artificial Intelligence, PMLR, Virtual, 3–6 August 2020; pp. 1050–1060. [Google Scholar]
- Ngampruetikorn, V.; Schwab, D.J. Perturbation theory for the information bottleneck. Adv. Neural Inf. Process. Syst. 2021, 34, 21008–21018. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).