Symmetry-Breaking Bifurcations of the Information Bottleneck and Related Problems

Parker, Albert E.; Dimitrov, Alexander G.

doi:10.3390/e24091231

Open AccessArticle

Symmetry-Breaking Bifurcations of the Information Bottleneck and Related Problems

by

Albert E. Parker

¹

and

Alexander G. Dimitrov

^2,*

¹

Center for Biofilm Engineering, Department of Mathematical Sciences, Montana State University, Bozeman, MT 59717, USA

²

Department of Mathematics and Statistics, Washington State University Vancouver, Vancouver, WA 98686, USA

^*

Author to whom correspondence should be addressed.

Entropy 2022, 24(9), 1231; https://doi.org/10.3390/e24091231

Submission received: 29 June 2022 / Revised: 22 August 2022 / Accepted: 29 August 2022 / Published: 2 September 2022

(This article belongs to the Special Issue Theory and Application of the Information Bottleneck Method)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this paper, we investigate the bifurcations of solutions to a class of degenerate constrained optimization problems. This study was motivated by the Information Bottleneck and Information Distortion problems, which have been used to successfully cluster data in many different applications. In the problems we discuss in this paper, the distortion function is not a linear function of the quantizer. This leads to a challenging annealing optimization problem, which we recast as a fixed-point dynamics problem of a gradient flow of a related dynamical system. The gradient system possesses an

S_{N}

symmetry due to its invariance in relabeling representative classes. Its flow hence passes through a series of bifurcations with specific symmetry breaks. Here, we show that the dynamical system related to the Information Bottleneck problem has an additional spurious symmetry that requires more-challenging analysis of the symmetry-breaking bifurcation. For the Information Bottleneck, we determine that when bifurcations occur, they are only of pitchfork type, and we give conditions that determine the stability of the bifurcating branches. We relate the existence of subcritical bifurcations to the existence of first-order phase transitions in the corresponding distortion function as a function of the annealing parameter, and provide criteria with which to detect such transitions.

Keywords:

information bottleneck; optimization; annealing; gradient flow; bifurcations; symmetry

1. Introduction

This paper analyzes bifurcations of solutions to constrained optimization problems of the form

\begin{matrix} max_{q \in Δ} F (q, β) = max_{q \in Δ} (\sum_{i = 1}^{N} f (q^{i}, β)) \end{matrix}

(1)

as a function of a scalar parameter

β

and a quantizer or classifier

q = (q^{1}, \dots, q^{N})

with

q^{i} \in ℜ^{K}

. The real-valued function f is sufficiently smooth, and

Δ

is the constraint space of valid quantizers, a convex set of discrete probabilities (simplices).

This type of problem arises in Rate Distortion Theory [1,2], Deterministic Annealing [3] and biclustering [4]. The specific motivations for the abstract problem formulation given in (1) are the Information Bottleneck [5] and Information Distortion [6] functions

\begin{matrix} max_{q \in Δ} F (q, β) = max_{q \in Δ} (D (q) - β I (Y; T)) . \end{matrix}

(2)

These were proposed in [5,7] to analyze the Markov chain

X \to Y \to T

in which

X \to Y

, characterized by a probability

p (X, Y)

, is the original system of interest, characterized by its mutual information

I (X; Y)

, and T is a simplification (quantized version of) Y. Here we work mainly with discrete versions of Y and T, with cardinalities

| Y | = K

and

| T | = N

. Typically

N < < K

.

I (Y; T)

is the mutual information between the K objects in Y and the N clusters in T. The goal is to cluster K objects in Y into N clusters in T given inputs X such that the function F is maximized in

{[q^{i}]}_{j}

; the probability that the jth element of Y is classified as being a member of the cluster with label

i \in T

. We call such a set of conditional probabilities a stochastic quantizer, or just a quantizer, to relate to the vector quantization literature [8]. The annealing parameter

β \in [0, \infty)

.

It has been shown that finding hard-clustering solutions to (2) is NP-complete (combinatorial search) when

D (q)

is the mutual information

I (X; T)

[9], as in the Information Bottleneck [5,10,11] and the Information Distortion [7,12,13] methods. Information Bottleneck (IB) approaches are gaining in penetration into multiple scientific and engineering domains [14,15,16,17,18]. As they typically involve the nonlinear optimization problem (2), there is need for optimization methods for such problems that can avoid the rise in complexity implied by the NP-complete hard-clustering solutions [9]. Originally, Tishby et al. [5] approached this problem with an algorithm inspired by the Blahut–Arimoto approach to solving Rate-Distortion types of problems ([2], Chapter 10). The “self-consistent” equations in [5] optimize both the quantizer and the “relevance” distribution

p (x | t)

. However, unlike the classic Blahut–Arimoto algorithm, which can guarantee convergence to a unique solution to its iterative scheme because of the convex geometry of the two state spaces, the “self-consistent” equations have no such guarantee due to the more-complicated geometry of three convex sets over which the optimization is performed, as also noted in [5]. Accordingly, in this work, we use the original optimization problem (2) over a single variable: the quantizer (conditional probability)

q (t | y)

. It may be possible that a related Blahut–Arimoto style optimization coupled to the bifurcation structure of its gradient flow discussed here can lead to additional insights into this problem, but we consider this beyond the scope of this particular manuscript.

We have investigated the structure of soft-clustering annealing-type methods that reach the hard-clustering solution in the limit of the annealing parameter [19,20] through a series of bifurcations. A bifurcation in this context is a point that is a solution

(q^{*}, β^{*})

to (2) such that the number of solutions to (2) changes in a small neighborhood of

(q^{*}, β^{*})

. Because a bifurcation corresponds to a point at which some of the objects Y have just been classified, in the IB literature, a bifurcation is usually referred to as a phase transition. One of the goals of this and related work is to understand why annealing-type algorithms, such as the original optimization heuristics in [5,10], work as well as they do. This can help with designing further optimization heuristics and can assess how close those can get to the global solutions to IB problems. We believe that this amalgamation of optimization theory and dynamical systems theory, as stated in [19,20], can provide a solid foundation with which to address such optimization challenges.

Because of the form (1) of F, it possesses certain symmetries. That is, the value of

F (q, β)

does not change (is invariant) under arbitrary permutations of the vectors

q^{i}

. In other words, F is

S_{N}

-invariant. The form (1) further implies that the Hessian

d_{q}^{2} F (q)

is block diagonal with blocks

{d_{q^{i}}^{2} f (q^{i})}_{i = 1}^{N}

. These conditions are met by the Information Distortion function [6],

\begin{matrix} F_{H} (q, β) = H (T | Y) + β I (X, T), \end{matrix}

(3)

where

H (T | Y)

is the entropy, and by the cost function used in the original IB method [5],

\begin{matrix} F_{I B} (q, β) = - I (Y, T) + β I (X, T), \end{matrix}

(4)

which is the focus of this manuscript. Both the Information Distortion and Informaton Bottleneck problems have the form given in (1) and (2). Importantly,

d^{2} F_{I B} (q)

has a “perpetual kernel“ since each block

d^{2} f (q^{i})

has the eigenpair (0,

q^{i}

) for every q [20]. In other words, the Hessian

d^{2} F

is singular for every q and every value of

β

. This makes bifurcation detection challenging because bifurcations can usually be detected by identifying isolated singularities of

d^{2} F

. This degeneracy is a consequence of the translational symmetry of

F_{I B}

: if

k \in ker d_{q}^{2} F_{I B} (q^{*})

, then

F_{I B} (q^{*}) = F_{I B} (q^{*} + t k)

for all

t \in ℜ

such that

q^{*} + t k \in Δ

. At bifurcations of solutions to (4), the translational symmetry never breaks.

To better understand bifurcations of solutions to problems of the form (1), which includes the problems (3) and (4), we consider the gradient flow

(\begin{matrix} \dot{q} \\ \dot{λ} \end{matrix}) = \nabla L (q, λ, β)

Equilibria of this flow correspond to critical points of (1), where

L

is the Lagrangian with respect to the constraints imposed by

Δ

, and

λ

is the vector of Lagrange multipliers.

Previous work showed that when

d^{2} F

is generically non-singular, as occurs for the Information Distortion (3), then there are isolated singularities of

d^{2} L

that indicate possible bifurcations of solutions to (1). In this case, an

M > 1

-dimensional

ker d^{2} F

necessitates an

M - 1

-dimensional

ker d^{2} L

, which admits a bifurcation of solutions to (1) where symmetry breaks from

S_{M}

to

S_{m} \times S_{n}

for every

m, n > 0

such that

m + n = M

[21].

Here we allow

d^{2} F

and

d^{2} L

to be singular for every

q \in Δ

, as occurs for the Information Bottleneck (4). That is, the perpetual kernel for

d^{2} F

implies that

d^{2} L

also has a perpetual kernel

ker d^{2} L = K_{p} (q)

, which means that the eigenvalue crossing condition that must occur at a bifurcation (i.e.,

d^{2} L

must have a zero eigenvalue at a bifurcation) [20] is never satisfied in

K_{p}

. There are a few challenges due to the existence of the perpetual kernel (i.e., degeneracy) of the Information Bottleneck that we address in this paper. First, detecting bifurcations may be problematic because one cannot simply monitor the determinant of either

d^{2} F

or

d^{2} L

. Second, the standard theory that assures the existence of bifurcating branches, the Equivariance Branching Lemma, cannot be applied directly. Lastly, the spaces that contain the bifurcating solutions are always at least two-dimensional, which makes tracking the bifurcating solutions problematic.

Here we address two of these three challenges. We show that at a bifurcation, new eigenvalue(s) of

d^{2} F_{I B}

and

d^{2} L

must cross zero, causing

ker d^{2} L

to expand so that

ker d^{2} L (q^{*}) = K_{p} \cup K^{*}

, where

K^{*}

is the span of the eigenvectors with crossing eigenvalues. Instead of detecting bifurcations by the expensive process of monitoring the expansion of

ker d^{2} L

(from

K_{p}

to

K_{p} \cup K^{*}

), we give a simple way to check the eigenvalue crossing condition for annealing problems

F = G (q) + β D (q)

as in (2) [20]. We prove the existence of the bifurcating branches by adapting the standard proof for the Equivariant Branching Lemma. This newly developed theory guarantees that bifurcating branches exist in

K^{*}

, are generically pitchforks, and that symmetry breaks from

S_{M}

to

S_{m} \times S_{n}

. Additionally, we give conditions to check whether the pitchforks are subcritical or supercritical, and how stability of the bifurcating branches relates to optimality in the optimization problem (1).

2. Bifurcation Analysis

2.1. Equivariant Branching Lemma

The Equivariant Branching Lemma relates the subgroup structure of a symmetry group

Γ

with the existence of symmetry-breaking bifurcating branches of equilibria of

\dot{x} = f (x, β)

. Observe that we present a version that does not require absolute irreducibility. For a proof see [22] p. 83.

Theorem 1.

(Equivariant Branching Lemma). Let f be a smooth function

f : V \times ℜ \to V

that is Γ-equivariant for a compact Lie group Γ and a Banach space V. Let Σ be an isotropy subgroup of

Γ

with

dim Fix (Σ) = 1

. Suppose that

Fix (Γ) = {0}

and the crossing condition

d_{β x}^{2} f (0, 0) x_{0} \neq 0

for

x_{0} \in Fix (Σ)

. Then there exists a unique smooth solution branch

(t x_{0}, β (t))

to

f = 0

with isotropy subgroup Σ.

For an arbitrary

Γ

-equivariant system where bifurcation occurs at

(x^{*}, β^{*})

, the requirement in Theorem 1 that the bifurcation occurs at the origin is accomplished by a translation. Assuring that the Jacobian vanishes,

d_{x} f (0, 0) = 0

, can be effected by restricting and projecting the system onto the kernel of the Jacobian. This transform is called the Liapunov–Schmidt reduction (see [23]).

The Equivariant Branching Lemma does not directly apply to yield bifurcating branches for the problem (1) at q for which

d^{2} F

is singular for the following reasons:

$K_{p}$ and $K^{*}$ have independent bases, which implies that each is invariant to the action of $S_{N}$ , and so the decomposition $ker d^{2} L (q^{*}) = K_{p} \times K^{*}$ shows that $S_{N}$ does not act absolutely irreducibly on $ker d^{2} F (q^{*})$ , but it does act absolutely irreducibly on each of these disjoint subspaces separately. This is why we present a version of the Equivariant Branching Lemma that does not require absolute irreducibility.
The Liapunov–Schmidt reduction onto $ker d^{2} L (q^{*})$ is clear, but not onto $K^{*}$ .
$Fix (S_{m} \times S_{n}) \cap ker d^{2} L (q^{*})$ is two-dimensional with basis

${(n v, \dots, n v, - m v, \dots, - m v), (n y, \dots, n y, - m y, \dots, - m y)},$

where $v, y \in ℜ^{K}$ .

We address these issues in the manuscript and show that a small modification of the Equivariant Branching Lemma allows for similar analysis to be successfully applied to Information Bottleneck-style problems such as (2) with minimal modifications to the original algorithm from [20].

2.2. A Gradient Flow

We now lay the groundwork necessary to determine the bifurcations of local solutions to (1)

max_{q \in Δ} F (q, β),

where

F = \sum_{i = 1}^{N} f (q^{i}, β)

, which includes as a special case the Information Distortion (3) and Information Bottleneck (4) problems. The convex set of discrete conditional probabilities is

Δ : = \{q \in ℜ^{N K} | \sum_{i = 1}^{N} q_{k}^{i} = 1 \forall k : 1 \leq k \leq K and q_{k}^{i} \geq 0 \forall i, k\} .

Due to the form of F, it has the following properties:

$F (q, β)$ is an $S_{N}$ -invariant, real-valued function of q, where the action of $S_{N}$ on q permutes the component vectors $q^{i}$ , $i = 1, \dots, N$ , of $q \in Δ$ .
The $N K \times N K$ Hessian $d_{q}^{2} F (q, β)$ is block diagonal, where the ith $K \times K$ block is $d^{2} f (q^{i})$ .

The Lagrangian of (1) with respect to the equality constraints from

Δ

is

\begin{matrix} L (q, λ, β) = F (q, β) + \sum_{k = 1}^{K} λ_{k} (\sum_{i = 1}^{N} q_{k}^{i} - 1) . \end{matrix}

(5)

The scalar

λ_{k}

is the Lagrange multiplier for the constraint

\sum_{i = 1}^{N} q_{k}^{i} - 1 = 0

, and

λ \in ℜ^{K}

is the vector of Lagrange multipliers

λ = {(λ_{1}, λ_{2}, \dots, λ_{K})}^{T} .

The gradient of the Lagrangian in (5) is

\nabla L : = \nabla_{q, λ} L (q, λ, β) = (\begin{matrix} \nabla_{q} L \\ \nabla_{λ} L \end{matrix}),

where

\nabla_{q} L = \nabla F (q, β) + Λ

and

Λ = {(λ^{T}, λ^{T}, \dots λ^{T})}^{T} \in R^{N K}

. The gradient

\nabla_{λ} L

is a vector of K constraints

\begin{matrix} \nabla_{λ} L = (\begin{matrix} \sum_{i} q_{1}^{i} - 1 \\ \sum_{i} q_{2}^{i} - 1 \\ ⋮ \\ \sum_{i} q_{K}^{i} - 1 \end{matrix}) . \end{matrix}

Let J be the Jacobian of

d_{q} \nabla_{λ} L

\begin{matrix} J : = d_{q} \nabla_{λ} L = \underset{N blocks}{\underset{︸}{(\begin{matrix} I_{K} & I_{K} & \dots & I_{K} \end{matrix})}} . \end{matrix}

(6)

Observe that J has full row rank. The Hessian of (5) with respect to the vector

(\begin{matrix} q \\ λ \end{matrix}) \in ℜ^{N K + K}

is

\begin{matrix} d^{2} L (q) : = d^{2} L (q, λ, β) = (\begin{matrix} d^{2} F (q, β) & J^{T} \\ J & 0 \end{matrix}), \end{matrix}

(7)

where

0

is

K \times K

. The

N K \times N K

matrix

d^{2} F (q) : = d_{q}^{2} F (q, β)

is the block diagonal Hessian of F with

K \times K

blocks

{d^{2} f (q^{i}, β)}_{i = 1}^{N}

.

The dynamical system whose equilibria are stationary points of (1) is the gradient flow of the Lagrangian

\begin{matrix} (\begin{matrix} \dot{q} \\ \dot{λ} \end{matrix}) = \nabla L (q, λ, β) \end{matrix}

(8)

for

L

as defined in (5) and

β \in [0, \infty)

. The equilibria of (8) are points

(\begin{matrix} q^{*} \\ λ^{*} \end{matrix}) \in R^{N K + K}

where

\nabla L (q^{*}, λ^{*}, β) = 0 .

The Jacobian of this system is the Hessian

d^{2} L (q, λ, β)

from (7).

Remark 1.

By the theory of constrained optimization [24], the equilibria

(q^{*}, λ^{*}, β)

of (8) where

d^{2} F (q^{*}, β)

is negative definite on

ker J

are local solutions of (1). Conversely, if

(q^{*}, β)

is a local solution of (1), then there exists a vector of Lagrange multipliers

λ^{*}

so that

(q^{*}, λ^{*}, β)

is an equilibrium of (8) (this necessary requirement is called the Karush–Kuhn–Tucker conditions) such that

d^{2} F (q^{*}, β)

is non-positive definite on

ker J

.

2.3. Equilibria with Symmetry

Next, we categorize the equilibria of (8) according to their symmetries, which allows us to determine when to expect symmetry-breaking bifurcations.

Let

q \in Fix (S_{M})

for some

1 \leq M \leq N

. Then there exists a partition of

{1, 2, \dots, N}

into the sets

U

and

R

, where

| U | = M

, so that

q^{i} = q^{j}

if and only if

i, j \in U

. Clearly,

d^{2} F

has M identical blocks,

{d^{f} (q^{i})}_{i \in U}

.

To ease the notation, and without loss of generality, we set

U : = {1, \dots, M} and R : = {M + 1, \dots, N} .

To distinguish between the blocks of

d^{2} F

, we write

\begin{matrix} B : = d^{2} f (q^{i}) for \leq 1 \leq i \leq M and R_{i} : = d^{2} f (q^{i}) for M + 1 \leq i \leq N . \end{matrix}

(9)

As mentioned in the introduction, we assume that for each

q \in Δ

, each block

d^{2} f (q^{i})

always has at least a one-dimensional kernel with basis vector(s) which depend on q. Thus,

dim ker d^{2} F \geq N

. At an equilibrium of

(q^{*}, λ^{*}, β^{*})

of (8) where

q \in Fix (S_{M})

, we consider the following three cases:

$dim ker d^{2} F (q^{*}) > N + 1$ ;
$dim ker d^{2} F (q^{*}) = N + 1$ ;
$dim ker d^{2} F (q^{*}) = N$ .

We will show that the first case necessitates a symmetry-breaking bifurcation (Theorem 3). In the second case, there is no bifurcation (Corollary 1). Finally, in the third case, we expect a saddle node [21], a symmetry-preserving bifurcation.

We are able to distinguish between the three cases above by considering which blocks of

d^{2} F (q^{*})

have kernels that have more than one dimension. This motivates the following definition.

Definition 1.

An equilibrium

(q^{*}, λ^{*}, β^{*})

of (8) is M-singular (or, equivalently,

q^{*}

is M-singular) if:

$q \in Fix (S_{M})$ so that $q^{i} = q^{j}$ for every $1 \leq i, j \leq M$ .
For B, the M block(s) of the Hessian defined in (9), $ker B$ has dimension 2 with basis vectors $v, y \in ℜ^{K}$ . $v$ is associated with the crossing eigenvalues, and $y$ is associated with the constant zero eigenvalue of B.
The $N - M$ block(s) of the Hessian ${R_{i}}_{i \in ℜ}$ , defined in (9), each have a one-dimensional kernel with basis vector $z (i) \in ℜ^{K}$ .
The vectors $v$ , $y$ and ${z (i)}$ are linearly independent.
The matrix

$\begin{matrix} A : = B \sum_{i = M + 1}^{N} R_{i}^{-} + M I_{K} \end{matrix}$

(10)

is nonsingular. $R_{i}^{-}$ is the Moore–Penrose inverse of $R_{i}$ . When $M = N$ , we define $A : = N I_{K}$ .

We wish to emphasize that we showed in [21] that requirements 2–5 in Definition 1 hold generically.

A straightforward calculation shows that every block of the Hessian

d^{2} F

of the Information Bottleneck cost function (2) is singular for every

(q, β)

, and the basis for

ker d^{2} f (q^{i})

is

y = q^{i}

for

1 \leq i \leq M

and

z (i) = q^{i}

for

M + 1 \leq i \leq N

(Lemma 42 in [25]), which assures that these vectors are linearly independent, as in Definition 1.4. At a bifurcation, the kernels of the identical blocks B expand by

v

as in Definition 1.2. Using the notation above,

y = q^{i}

for each

i \in U

, and

z (i) = q^{i}

for each

i \in R

.

2.4. The Kernel at a Bifurcation

The equilibria of (8) change their stability with

β

, and hence change the solutions to (1). The changes of stability are determined by the kernel of

d^{2} L (q^{*})

at a bifurcation point

q^{*}

. In this section we show that for any

q \in Fix (S_{M})

with

M > 1

,

d^{2} L (q^{*})

has a perpetual kernel

K_{p}

that is at least

M - 1

dimensional. The zero eigenvalues associated with the eigenvectors in

K_{p}

remain constant, so that at a bifurcation point

(q^{*}, λ^{*}, β^{*})

of (8) where

q^{*}

is M-singular, new eigenvalues of

d^{2} L

must cross zero. Thus, the kernel expands, and the bifurcating directions exist in an “expanded” kernel of

d^{2} L (q^{*})

,

ker d^{2} L (q^{*}) = K^{*} \times K_{p}

.

We determine a basis for

ker d^{2} L

at an M-singular

q^{*}

when

M > 1

. If q is 1-singular with a trivial isotropy group (i.e., no symmetery), then

d^{2} L (q^{*})

is non-singular—

K_{p}

disappears. First, we ascertain a basis for

ker d^{2} F (q^{*})

.

Recall that in the preliminaries, when

x \in ℜ^{N K}

, we defined

x^{j} \in R^{K}

to be the jth vector component of

x

. We now define the linearly independent vectors

{v_{i}}_{i = 1}^{M}

,

{y_{i}}_{i = 1}^{M}

, and

{z_{k}}_{k = M + 1}^{N}

in

ℜ^{N K}

by

\begin{matrix} v_{i}^{j} : = \{\begin{matrix} v if 1 \leq i = j \leq M \\ 0 otherwise \end{matrix}, & y_{i}^{j} : = \{\begin{matrix} y if 1 \leq i = j \leq M \\ 0 otherwise \end{matrix}, & z_{k}^{j} : = \{\begin{matrix} z (i) if M + 1 \leq j = k \leq N \\ 0 otherwise \end{matrix} \end{matrix}

(11)

where

0 \in ℜ^{K}

, and

v

and

y

are defined in Definition 1.2. For example, if

M = 2

and

N = 3

, then

v_{1} : = {(v^{T}, 0, 0)}^{T}

and

v_{2} : = {(0, v^{T}, 0)}^{T}

.

Due to the block diagonal form of

d^{2} F (q^{*})

, it is easy to see that the

N + M

vectors defined in (11) form a basis for

ker d^{2} F (q^{*})

.

Now, let

\begin{matrix} V_{i} = (\begin{matrix} v_{i} \\ 0 \end{matrix}) - (\begin{matrix} v_{M} \\ 0 \end{matrix}), & Y_{i} = (\begin{matrix} y_{i} \\ 0 \end{matrix}) - (\begin{matrix} y_{M} \\ 0 \end{matrix}), & Z_{k} = (\begin{matrix} z_{k} \\ 0 \end{matrix}) - (\begin{matrix} z_{N} \\ 0 \end{matrix}) \end{matrix}

(12)

for

i = 1, \dots, M - 1

and

M + 1 \leq k \leq N - 1

where

0 \in ℜ^{K}

. From (7), it is easy to see that these three sets of vectors are in

ker d^{2} L (q^{*})

. The next theorem shows that

{V_{i}}_{i = 1}^{M - 1} ⋃ {Y_{i}}_{i = 1}^{M - 1}

are a basis for

ker d^{2} L (q^{*})

. This natural partition of the basis vectors shows that

ker d^{2} L (q^{*})

can be written as

ker d^{2} L (q^{*}) = K_{p} \times K^{*}

. According to Definition 1, the “perpetual kernel” corresponding to constant zero eigenvalues of

d^{2} L (q^{*})

is generated by

K_{p} = < {Y_{i}}_{i = 1}^{M - 1} > .

The part of the kernel that arises at a bifurcation corresponding to eigenvalues crossing zero is

K^{*} = < {V_{i}}_{i = 1}^{M - 1} > .

The vectors

{Z_{k}}

do not contribute to

ker d^{2} L (q^{*})

.

Theorem 2.

If

q^{*}

is M-singular for

1 < M \leq N

, then

{V_{i}} ⋃ {Y_{i}}

from (12) are a basis for

ker d^{2} L (q^{*})

.

Proof.

To show that

{V_{i}}_{i = 1}^{M - 1} ⋃ {Y_{i}}_{i = 1}^{M - 1}

span

ker d^{2} L (q^{*})

, let

k \in ker d^{2} L (q^{*})

and decompose it as

\begin{matrix} k = (\begin{matrix} k_{F} \\ k_{J} \end{matrix}) \end{matrix}

(13)

where

k_{F}

is

N K \times 1

, and

k_{J}

is

K \times 1

. Hence,

\begin{matrix} d^{2} L (q^{*}, λ^{*}, β) k = (\begin{matrix} d^{2} F (q^{*}, β^{*}) & J^{T} \\ J & 0 \end{matrix}) (\begin{matrix} k_{F} \\ k_{J} \end{matrix}) = 0 \\ \Rightarrow d^{2} F (q^{*}, β) k_{F} = - J^{T} k_{J} \\ J k_{F} = 0 . \end{matrix}

(14)

Now, from (6) and the fact that

d^{2} F

is block diagonal, we have

\begin{matrix} (\begin{matrix} d^{2} f (q^{1}) & 0 & \dots & 0 \\ 0 & d^{2} f (q^{2}) & \dots & 0 \\ ⋮ & ⋮ & ⋮ \\ 0 & 0 & \dots & d^{2} f (q^{N}) \end{matrix}) k_{F} = - (\begin{matrix} k_{J} \\ k_{J} \\ ⋮ \\ k_{J} \end{matrix}) . \end{matrix}

(15)

We set

\begin{matrix} k_{F} : = {(x_{1}^{T} x_{2}^{T} \dots x_{N}^{T})}^{T}, \end{matrix}

(16)

and using the notation from (9), then (15) implies

\begin{matrix} B x_{i} & = & - k_{J} for 1 \leq i \leq M \\ R_{i} x_{i} & = & - k_{J} for M + 1 \leq i \leq N . \end{matrix}

(17)

It follows that

x_{i} = R_{i}^{-} B x_{1}

for every

M + 1 \leq i \leq N

. By (14), we have that

\sum_{i = 1}^{N} x_{i} = 0

, and so

\begin{matrix} \sum_{i = 1}^{M} x_{i} + \sum_{i = M + 1}^{N} x_{i} = 0 \\ \Rightarrow \sum_{i = 1}^{M} x_{i} + \sum_{i = M + 1}^{N} R_{i}^{-} B x_{1} + = 0 . \end{matrix}

By (17), for every

1 \leq i \leq M, x_{i}

can be written as

x_{i} = x_{p} + d_{i} v + e_{i} y

, where

x_{p} \in range (B)

,

d_{η}, e_{η} \in ℜ

, and

v

and

y

are the basis vectors of

ker B

from Definition 1.2. Thus,

\begin{matrix} B \sum_{i = 1}^{M} (x_{p} + d_{i} v + e_{i} y) & + & B \sum_{i = M + 1}^{N} R_{i}^{-} B (x_{p} + d_{1} v + e_{1} y) = 0 \\ \Leftrightarrow & (B \sum_{i = M + 1}^{N} R_{i}^{-} + M I_{K}) B x_{p} = 0 \\ \Leftrightarrow & B x_{p} = 0 \end{matrix}

since

A = B \sum_{i = M + 1}^{N} R_{i}^{-} + M I_{K}

is nonsingular. This shows that

x_{p} = 0

. Therefore,

x_{i} = d_{i} v + e_{i} y

for every

1 \leq i \leq M

. Now (17) shows that

k_{J} = 0

, and so

x_{i} \in ker R_{i}

for

M + 1 \leq i \leq N

, which implies that

\begin{matrix} x_{i} = c_{i} z (i) for M + 1 \leq i \leq N . \end{matrix}

Hence,

k = (\begin{matrix} k_{F} \\ 0 \end{matrix})

, where

k_{F}^{i} = \{\begin{matrix} d_{i} v + e_{i} y if 1 \leq i \leq M \\ c_{i} z (i) if M + 1 \leq i \leq N \end{matrix}

, from which it follows that

\begin{matrix} J k_{F} = \sum_{i = 1}^{N} x_{i} = \sum_{i = 1}^{M} d_{i} v + \sum_{i = 1}^{M} e_{i} y + \sum_{i = M + 1}^{N} c_{i} z (i) = 0 . \end{matrix}

(18)

Linear independence (Definition 1.4) implies that

\sum d_{i} = \sum e_{i} = d_{i} = 0

. Thus,

k_{F} = \sum_{i = 1}^{M - 1} d_{i} (v_{i} - v_{M}) + \sum_{i = 1}^{M - 1} e_{i} (y_{i} - y_{M}) .

Therefore, the linearly independent vectors

{V_{i}} = {(\begin{matrix} v_{i} - v_{M} \\ 0 \end{matrix})}

and

{Y_{i}} = {(\begin{matrix} y_{i} - y_{M} \\ 0 \end{matrix})}

span

ker d^{2} L (q^{*})

. □

Corollary 1.

If

q^{*}

is 1-singular and has isotropy group equal to the identity, then

d^{2} L (q^{*})

is nonsingular.

Proof.

If q is 1-singular, then

d^{2} F (q^{*})

has a single block B with a two-dimensional kernel. The other

N - 1

blocks

{R_{i}}

are distinct with one-dimensional kernels. By constructing the vectors as in (11), we see that

dim ker d^{2} F (q^{*}) = N + 1

with basis vectors

v_{1}, y_{1},

{z_{i}}_{i = 2}^{N}

. Now, following the proof of Theorem 2, we take an arbitrary

k \in ker d^{2} L (q^{*}, λ, β),

and then decompose

k

as in (13) and (16). The proof to Theorem 2 holds for the present case up until, and including (18). Linear independence now shows that

d_{i} = e_{i} = c_{i} = 0

, which implies that

k = 0

. □

Remark 2.

The independent bases given for

K_{p}

and

K^{*}

in Theorem 2 imply that each is invariant to the action of

S_{N}

, and so the decomposition

ker d^{2} L (q^{*}) = K_{p} \times K^{*}

shows that

S_{N}

does not act absolutely irreducibly on

ker d^{2} F (q^{*})

. That is, by definition,

d_{x} r (0, β) \neq c (β) I_{2 M - 2} .

The explicit bases show that

K_{p}, K ≅ {x \in R^{M} : \sum {[x]}_{i} = 0}

, which implies that

S_{M}

acts absolutely irreducibly on

K_{p}

and

K^{*}

[26]. Thus,

K_{p}

and

K^{*}

are each

S_{M}

-irreducible.

2.5. Liapunov–Schmidt Reduction

To show the existence of bifurcating branches from a bifurcation point

(q^{*}, λ^{*}, β^{*})

of equilibria of (8), the Equivariant Branching Lemma requires that the bifurcation is translated to

(0, 0, 0)

and that the Jacobian vanishes at bifurcation. To accomplish the former, consider

\begin{matrix} F (q, λ, β) : = \nabla L (q + q^{*}, λ + λ^{*}, β + β^{*}) . \end{matrix}

To assure that the Jacobian vanishes, we restrict and project

F

onto

ker d^{2} L (q^{*})

in a neighborhood of

(0, 0, 0)

. This is the Liapunov–Schmidt reduction of

F

[23],

\begin{matrix} r & : & R^{M - 1} \times R \to R^{M - 1} \\ r (x, β) & = & W^{T} (I - E) F (W x + U (W x, β), β) \end{matrix}

(19)

where

W x + U (W x, β) = (\begin{matrix} q \\ λ \end{matrix})

. The

(N K + K) \times (N K + K)

matrix

I - E

is the projection matrix onto

ker F (0, 0) = ker d^{2} L (q^{*})

with

ker (I - E) = range d^{2} L (q^{*})

. W is the

(N K + K) \times (2 M - 2)

matrix whose columns are the basis vectors

{V_{i}} \cup {Y_{i}}

of

ker d^{2} L (q^{*})

from (12) so that

W x

is a vector in

ker d^{2} L (q^{*})

. The vector function

U (W x, β)

is the component of

(q, λ)

that is in range

d^{2} L (q^{*})

such that

E F (W x + U (x, β), β) = 0

,

U (0, 0) = 0

, and

\begin{matrix} d_{x} U (0, 0) = 0 . \end{matrix}

(20)

The system defined by the Liapunov–Schmidt reduction,

\dot{x} = r (x, β)

, has a bifurcation of equilibria at

(x = 0, β = 0)

, which are in

1 - 1

correspondence with equilibria of (8). However, the stability of these associated equilibria is not necessarily the same.

It is straightforward to verify the following derivatives ([23] p. 32), which we will require in the sequel. The

(2 M - 2) \times (2 M - 2)

Jacobian of (19) is

\begin{matrix} d_{x} r (x, β) = W^{T} (I - E) d_{q, λ}^{2} L (q + q^{*}, λ + λ^{*}, β + β^{*}) (W + d_{x} U (W x, β)), \end{matrix}

(21)

which shows that

\begin{matrix} d_{x} r (0, 0) = 0 \end{matrix}

(22)

since

ker (I - E) = range d^{2} L (q^{*})

.

Our crossing condition at a bifurcation depends on the matrix of derivatives

\begin{matrix} \frac{\partial^{2} r_{i}}{\partial β \partial x_{j}} (0, 0) = d_{β} d^{2} L [w_{i}, w_{j}] - d^{3} L [w_{i}, w_{j}, L^{-} d_{β} \nabla L] \end{matrix}

(23)

where the derivatives of

L

are evaluated at

(q^{*}, λ^{*}, β^{*})

, and

L^{-}

is the Moore–Penrose-generalized inverse [27] of

d^{2} L (q^{*})

. The vectors

{w_{i}}_{i = 1}^{2 M - 2}

are the basis vectors of

ker d^{2} L (q^{*})

from Theorem 2.

The

(2 M - 2) \times (2 M - 2) \times (2 M - 2)

three-dimensional array of second derivatives is

\begin{matrix} \frac{\partial^{2} r_{i}}{\partial x_{j} \partial x_{k}} (0, 0) & = & d^{3} L (q^{*}, λ^{*}, β^{*}) [w_{i}, w_{j}, w_{k}] . \end{matrix}

In [21], we showed that

\frac{\partial^{2} r_{i}}{\partial x_{j} \partial x_{k}} (0, 0) = 0

whenever

i = j = k \leq M - 1

. In the present case, there are more zero entries since now the basis vectors

{w_{i}}

are of two types:

w_{i} = V_{i}

for

1 \leq i \leq M - 1

(basis vectors of

K^{*}

); or

w_{i} = Y_{i - M + 1}

for

M \leq i \leq 2 M - 2

(basis vectors of

K_{p}

, see (12)). We now consider the case when

i, j \leq M - 1

and

k > M - 1

. All other cases are dealt with using a similar argument. Substituting in for

w_{i}

we have

\begin{matrix} \frac{\partial^{2} r_{i}}{\partial x_{j} \partial x_{k}} (0, 0) & = & \sum_{ν, δ, η = 1}^{N} \sum_{l, m, n = 1}^{K} \frac{\partial^{3} F (q^{*}, β^{*})}{\partial q_{l}^{ν} \partial q_{m}^{δ} \partial q_{n}^{η}} {[v_{i} - v_{M}]}_{l}^{ν} {[v_{j} - v_{M}]}_{m}^{δ} {[y_{k - M + 1} - y_{M}]}_{n}^{η} \\ = & \sum_{l, m, n = 1}^{K} \frac{\partial^{3} f (q^{ν} *, β^{*})}{\partial q_{l}^{ν} \partial q_{m}^{ν} \partial q_{n}^{ν}} (δ_{i j (k - M + 1)} {[v]}_{l} {[v]}_{m} {[y]}_{n} - {[v]}_{l} {[v]}_{m} {[y]}_{n}) . \end{matrix}

(24)

The vectors

v

and

y

are defined in (2). An immediate consequence of this calculation is that

\frac{\partial^{2} r_{i}}{\partial x_{j} \partial x_{k}} (0, 0) = 0

whenever

i = j = k - M + 1

. Thus, similar arguments show that

\frac{\partial^{2} r_{i}}{\partial x_{j} \partial x_{k}} (0, 0) = 0

whenever:

$i = j = k$ ;
$i - M + 1 = j = k$ , $i = j - M + 1 = k$ , $i = j = k - M + 1$ ;
$i - M + 1 = j - M + 1 = k$ , $i - M + 1 = j = k - M + 1$ , $i = j - M + 1 = k - M + 1$ .

Further, we get four different “cubes” of identical entries in the 3-D array. They are:

For $i, j, k \leq M - 1$ , not all equal, the value of the cube is

$- \sum_{l, m, n = 1}^{K} \frac{\partial^{3} f (q^{ν} *, β^{*})}{\partial q_{l}^{ν} \partial q_{m}^{ν} \partial q_{n}^{ν}} {[v]}_{l} {[v]}_{m} {[v]}_{n};$
For $i, j \leq M - 1$ , not both equal, and $j > M - 1$ , the value of the cube is

$- \sum_{l, m, n = 1}^{K} \frac{\partial^{3} f (q^{ν} *, β^{*})}{\partial q_{l}^{ν} \partial q_{m}^{ν} \partial q_{n}^{ν}} {[v]}_{l} {[v]}_{m} {[y]}_{n};$
For $i \leq M - 1$ and $j, k > M - 1$ , not both equal, the value of the cube is

$- \sum_{l, m, n = 1}^{K} \frac{\partial^{3} f (q^{ν} *, β^{*})}{\partial q_{l}^{ν} \partial q_{m}^{ν} \partial q_{n}^{ν}} {[v]}_{l} {[y]}_{m} {[y]}_{n};$
For $i, j, k > M - 1$ , not all equal, the value of the cube is

$- \sum_{l, m, n = 1}^{K} \frac{\partial^{3} f (q^{ν} *, β^{*})}{\partial q_{l}^{ν} \partial q_{m}^{ν} \partial q_{n}^{ν}} {[y]}_{l} {[y]}_{m} {[y]}_{n} .$

The points above will prove useful when proving that

d^{2} r (0, 0) = 0

.

The four-dimensional array of third derivatives of r is

\begin{matrix} \frac{\partial^{3} r_{i}}{\partial x_{j} \partial x_{k} \partial x_{l}} (0, 0) = d^{4} L [w_{i}, w_{j}, w_{k}, w_{l}] & - & d^{3} L [w_{i}, w_{j}, L^{-} d^{3} L [w_{k}, w_{l}]] \\ - & d^{3} L [w_{i}, w_{k}, L^{-} d^{3} L [w_{j}, w_{l}]] \\ - & d^{3} L [w_{i}, w_{l}, L^{-} d^{3} L [w_{j}, w_{k}]] \end{matrix}

(25)

where the derivatives of

L

are evaluated at

(q^{*}, λ^{*}, β^{*})

, and

L^{-}

is the Moore–Penrose-generalized inverse [27] of

d^{2} L (q^{*})

.

Since

ker d^{2} L (q^{*})

is not absolutely irreducible, but

K^{*}

is, one might try to define a Liapunov–Schmidt reduction by restricting and projecting

\nabla L

onto

K^{*}

. One issue with projecting the reduction onto

K^{*}

is how to define the projection matrix E so that

E F = 0 and (I - E) F = 0 if and only if F = 0

holds and

E d_{x} r (0, 0)

is non-singular in

range (E)

so that the Implicit Function Theorem assures the restriction

(q, λ) = W x + U (W x, β)

, where

U (W x) \in range (d^{2} L (q^{*}))

, and

W x \in K^{*}

instead of

W x \in ker d^{2} L (q^{*})

as in (19) [23]. Simply ignoring the space

K_{p}

by considering

U \in range (d^{2} L (q^{*}))

and

W x \in K^{*}

amounts to setting

W x = k^{*} + k_{p}

and

k_{p} = 0

. Since

W x + U

is still embedded in the larger

ℜ^{N K + K}

, which contains

K_{p}

, then derivatives are affected by the implicit

k_{p} = 0

constraint. This constraint

P_{K_{p}} (q, λ) = k^{*} + U

is nonlinear (and may not even be tractable) since

K_{p}

depends on q, where

P_{K_{p}}

is a projection matrix that depends on q (see Theorem 7).

2.6. Isotropy Subgroups $S_{m} \times S_{n}$ of $S_{N}$

The decomposition

ker d^{2} L (q^{*}) = K_{p} \times K^{*}

shows that

Fix (S_{m} \times S_{n}) \cap ker d^{2} L (q^{*})

is two-dimensional with basis vectors

{{(n y^{T}, \dots, n y^{T}, - m y^{T}, \dots, - m y^{T})}^{T}, {(n v^{T}, \dots, n v^{T}, - m v^{T}, \dots, - m v^{T})}^{T}} .

Restricted to

K^{*}

, these isotropy subgroups

S_{m} \times S_{n}

of

S_{M}

have one-dimensional fixed point spaces. This assures that we can use Theorem 1. We have the following Lemma.

Lemma 1.

Let

M = m + n

such that

M > 1

and

m, n > 0

. Let

U_{m}

be a set of m classes, and let

U_{n}

be a set of n classes such that

U_{m} \cap U_{n} = \emptyset

and

U_{m} \cup U_{n} = {1, \dots, M}

. Now define

{\hat{u}}_{(m, n)} \in ℜ^{N K}

such that

\begin{matrix} {\hat{u}}_{(m, n)}^{i} = \{\begin{matrix} n v & i f i \in U_{m} \\ - m v & i f i \in U_{n} \\ 0 & o t h e r w i s e \end{matrix} \end{matrix}

where

v

is defined as in Definition 1.2, and let

\begin{matrix} u_{(m, n)} = (\begin{matrix} {\hat{u}}_{(m, n)} \\ 0 \end{matrix}) \end{matrix}

(26)

where

0 \in R^{K}

. Then the isotropy subgroup of

u_{(m, n)}

is

Σ_{(m, n)} \subset Γ_{U}

such that

Σ_{(m, n)} ≅ S_{m} \times S_{n}

, where

S_{m}

permutes

u^{i}

when

i \in U_{m}

, and

S_{n}

permutes

u^{i}

when

i \in U_{n}

. The fixed point space of

Σ_{(m, n)}

restricted to

K^{*} \subset d^{2} L (q^{*})

is one dimensional.

2.7. Bifurcating Branches

Theorem 3.

Let

(q^{*}, λ^{*}, β^{*})

be an equilibrium of (8) such that

q^{*}

is M-singular for

1 < M \leq N

, and the crossing condition

d_{β} d^{2} L [u, u] - d^{3} L [u, u, L^{-} d_{β} \nabla L] \neq 0

is satisfied. Then there exists bifurcating solutions,

(\begin{matrix} q^{*} \\ λ^{*} \\ β^{*} \end{matrix}) + (\begin{matrix} t u_{(m, n)} \\ β (t) \end{matrix})

, where

u_{(m, n)} \in K^{*}

is defined in (26), for every pair

(m, n)

such that

M = m + n

, each with an isotropy group isomorphic to

S_{m} \times S_{n}

.

Proof.

We mimic the proof of the Equivariant Branching Lemma. Let

u : = u_{(m, n)} \in Fix (S_{m} \times S_{n}) \cap K^{*}

and let V be a matrix with columns composed of the

M - 1

vectors

{V_{i}}

. Thus, there exists

x_{0} \in ℜ^{M - 1}

so that

u = V x_{0}

. Since

r (Fix (S_{m} \times S_{n}) \cap K^{*}) \subseteq Fix (S_{m} \times S_{n}) \cap K^{*}

(for every

σ \in S_{m} \times S_{n}

,

r (V x) = r (σ V x)

(

u \in Fix (S_{M} \times S_{n}

) that equals

σ r (V x)

(by equivariance)), then

r (t x_{0}, β) = h (t, β) x_{0}

, where r is the Liapunov–Schmidt reduction (19), and h is a polynomial in t.

Since

K^{*}

is

S_{M}

-irreducible, then

Fix (S_{M}) \cap K^{*} = {0}

(otherwise,

σ x = x

for some

x \in K^{*}

for every

σ \in S_{M}

, which implies that

span (x)

is an invariant subspace of

K^{*}

). Now [22] p. 75 shows that

r (0, β) = 0

, and so

h (0, β) = 0

, from which it follows that

h (t, β) = t k (t, β)

. Thus,

\begin{matrix} r (t x_{0}, β) = t k (t, β) x_{0} . \end{matrix}

(27)

Differentiating with respect to t yields

\begin{matrix} d_{x} r (t x_{0}, β) x_{0} = (k (t, β) + t d_{t} k (t, β)) x_{0}, \end{matrix}

(28)

from which it follows that

k (t, β) x_{0} = d_{x} r (t x_{0}, β) x_{0} - t d_{t} k (t, β) x_{0},

and so

k (0, 0) = 0

. Furthermore, we see that

d_{β} k (0, 0) x_{0} = d_{x, β}^{2} r (0, 0) x_{0} \neq 0

by assumption (see (23)). This shows that

d_{β} k (0, 0)

is a non-zero eigenvalue of

d_{x} r (t x_{0}, β)

with associated eigenvector

x_{0}

. By the Implicit Function Theorem,

k (t, β) = 0

has a non-zero unique solution for

β = β (t)

. □

2.8. The Crossing Condition for Annealing Problemsn

We next determine how to check the crossing condition in Theorem 3 when F is an annealing problem, as in (2)

F (q, β) = H (q) + β D (q) .

First, we show that the crossing condition can be checked in terms of the Hessian of the function D. Furthermore, when G is strictly concave on

span ({v_{i}})

, then the crossing condition is always satisfied, and every singularity is a bifurcation.

Theorem 4.

The crossing condition

d_{β} d^{2} L [u, u] - d^{3} L [u, u, L^{-} d_{β} \nabla L] \neq 0

given in Theorem 3 is satisfied for M-singular q for

M > 1

if

d^{2} D (q)

is either positive or negative definite on

span ({v_{i}})

.

Proof.

Let

x_{0} \in ℜ^{2 M - 2}

so that

u = W x_{0} \in Fix (S_{m} \times S_{n}) \cap K^{*} .

Multiplying Equation (21) on the left by

x_{0}^{T}

and on the right by

x_{0}

yields

\begin{matrix} x_{0}^{T} d_{x} r (0, β) x_{0} = u^{T} d_{q, λ}^{2} L (q^{*}, λ^{*}, β + β^{*}) (I_{N K + K} + d_{w} U (0, β)) u . \end{matrix}

(29)

By Theorem 2, an arbitrary

u \in K^{*}

can be written as

u = (\begin{matrix} \hat{u} \\ 0 \end{matrix})

, where

\hat{u} \in span ({v_{i}}) \subset ker d^{2} F (q^{*}, β^{*})

. Substituting this into (29) and observing that

d^{2} F (q^{*}, β + β^{*}) = d^{2} G (q^{*}) + (β + β^{*}) d^{2} D (q^{*}) = d^{2} F (q^{*}, β^{*}) + β d^{2} D (q^{*})

yields

\begin{matrix} x_{0}^{T} d_{x} r (0, β) x_{0} = β (\begin{matrix} {\hat{u}}^{T} d^{2} D (q^{*}) & 0^{T} \end{matrix}) (I_{N K + K} + \partial_{w} U (0, β)) (\begin{matrix} \hat{u} \\ 0 \end{matrix}) . \end{matrix}

Differentiating with respect to

β

, evaluating at

β = 0

, and using (20) yields

\begin{matrix} x_{0}^{T} d_{x, β}^{2} r (0, 0) x_{0} = {\hat{u}}^{T} d^{2} D (q^{*}) \hat{u}, \end{matrix}

(30)

which must be non-zero since we assume that

d^{2} D (q)

is either positive or negative definite on

span ({v_{i}})

. □

From (30), we can get an expression for

ξ

, the eigenvalue of

d_{x, β}^{2} r (0, 0)

with eigenvector

x_{0}

. Substituting

d_{x, β}^{2} r (0, 0) x_{0} = ξ x_{0}

and observing that

x_{0}^{T} x_{0} = x_{0}^{T} W^{T} W x_{0} = {\hat{u}}^{T} \hat{u}

yields

\begin{matrix} ξ = \frac{{\hat{u}}^{T} d^{2} D (q^{*}) \hat{u}}{| | \hat{u} {| |}^{2}} . \end{matrix}

(31)

The requirement that

d^{2} D (q)

is either positive or negative definite on

span ({v_{i}})

holds when

d^{2} G (q^{*})

is either negative or positive definite, respectively, on

span ({v_{i}})

.

Lemma 2.

Let

d^{2} F (q^{*}, β^{*} \neq 0)

be singular where

q^{*}

is M-singular such that

d^{2} G (q^{*})

is negative (or positive) definite on

span ({v_{i}})

. Then

d^{2} D (q^{*})

is positive (or negative) definite on

span ({v_{i}})

.

Proof.

If

u \in span ({V_{i}}) \subset ker d^{2} F (q^{*})

, then

u^{T} d^{2} G (q^{*}) u + β^{*} u^{T} d^{2} D (q^{*}) u = 0 .

Since

u^{T} d^{2} G (q^{*}) u < 0

, then

u^{T} d^{2} D (q^{*}) u > 0

. □

These results are important for the Information Bottleneck problem (2), where

d^{2} G (q) = - d^{2} I (Y; Z)

is only non-positive definite on

ker d^{2} F (q^{*})

, but is negative definite on

span ({v_{i}})

. Thus, every singularity of the Information Bottleneck with

ker d^{2} L (q^{*}) = K^{*} \times K_{p}

is a bifurcation point. The space

K_{p}

does not contain bifurcating branches since the crossing condition is never satisfied there: for

u \in K_{p}

,

{\hat{u}}^{T} d^{2} G (q) \hat{u} + β {\hat{u}}^{T} d^{2} D (q) \hat{u} = 0 + 0

(by Lemma 42 in [25]), and so (Theorem 109, [25])

ξ = \frac{{\hat{u}}^{T} d^{2} D (q) \hat{u}}{∥ \hat{u} ∥} = 0

.

2.9. Bifurcation Type

Suppose that a bifurcation occurs at

(q^{*}, λ^{*}, β^{*})

, where

q^{*}

is M-singular. This section examines the type of bifurcation from which emanate the branches

\begin{matrix} ((\begin{matrix} q^{*} \\ λ^{*} \end{matrix}) + t u, β^{*} + β (t)), \end{matrix}

whose existence is guaranteed by Theorem 3.

As we showed in [21], the derivative

β^{'} (0) \neq 0

indicates a transcritical bifurcation. If

β^{'} (0) = 0

, then the bifurcation is degenerate, and if

β^{″} (0) \neq 0

, then we have a pitchfork-like bifurcation. Further,

t β^{'} (t) < 0

for small t indicates a subcritical bifurcating branch, and

t β^{'} (t) > 0

for small t indicates a supercritical bifurcating branch.

Expressions for

β^{'} (0)

and

β^{″} (0)

are derived as follows. Differentiating

k (t, β) = 0

from (27) yields

\begin{matrix} d_{t} k (t, β (t)) + d_{β} k (t, β (t)) β^{'} (t) = 0, \end{matrix}

(32)

so that

β^{'} (t) = - \frac{d_{t} k (t, β (t))}{d_{β} k (t, β (t))} .

Differentiating (28) with respect to t and then evaluating at

t = 0

shows that

\begin{matrix} β^{'} (0) = \frac{- d_{x}^{2} r (0, 0) [x_{0}, x_{0}, x_{0}]}{2 | | x_{0} {| |}^{2} ξ} \end{matrix}

(33)

where

d_{x}^{2} r (0, 0) [x_{0}, x_{0}, x_{0}] = \sum_{i, j, k} \frac{\partial^{2} r}{\partial {[x]}_{i} \partial {[x]}_{j} \partial {[x]}_{k}} (0, 0) {[x_{0}]}_{i} {[x_{0}]}_{j} {[x_{0}]}_{k}

(see (24)). As shown in the proof to Theorem 3,

ξ = d_{β} k (0, 0)

is the non-zero eigenvalue of

d_{x, β}^{2} r (0, 0)

with eigenvector

x_{0}

.

This expression is similar to the one given in [22] p. 90. The numerator can be calculated via (24). In [21], we showed that

β^{'} (0) = 0

. We have the same result in the present case.

Theorem 5.

If

q^{*}

is M-singular for

1 < M \leq N

, then all of the bifurcating branches guaranteed by Theorem 3 are degenerate, i.e.,

β^{'} (0) = 0

.

Proof.

To show that the numerator of (33)

d_{x}^{2} r (0, 0) = 0

, expand

r_{i}

, the ith component of r, about

x = 0

,

\begin{matrix} r_{i} (x, β) & = & r_{i} (0, β) + d_{x} r_{i} {(0, β)}^{T} x + x^{T} d_{x}^{2} r_{i} (0, β) x + O (x^{3}) \\ = & d_{x} r_{i} {(0, β)}^{T} x + x^{T} d_{x}^{2} r_{i} (0, β) x + O (x^{3}), \end{matrix}

and so

\begin{matrix} r_{i} (x, 0) & = & x^{T} d_{x}^{2} r_{i} (0, 0) x + O (x^{3}) . \end{matrix}

Applying the equivariance relation

A r (x, 0) = r (A x, 0)

, where A is any element of the group isomorphic to

S_{M}

that acts on r in

R^{M - 1}

, and equating the quadratic terms yields

A (\begin{matrix} x^{T} d_{x}^{2} r_{1} x \\ x^{T} d_{x}^{2} r_{2} x \\ ⋮ \\ x^{T} d_{x}^{2} r_{M - 1} x \end{matrix}) = (\begin{matrix} x^{T} A^{T} d_{x}^{2} r_{1} A x \\ x^{T} A^{T} d_{x}^{2} r_{2} A x \\ ⋮ \\ x^{T} A^{T} d_{x}^{2} r_{M - 1} A x \end{matrix}) .

By (24), the diagonal

\frac{\partial^{2} r_{i}}{\partial x_{i} \partial x_{i}} (0, 0) = 0

for each i as well as for all of the “multi-diagonals”. This shows that

\frac{\partial^{2} r_{i}}{\partial x_{j} \partial x_{k}} (0, 0) = 0

for every

i, j, k

(see Theorem 124 in [25]). □

When

β^{'} (0) = 0

, we need to compute

β^{″} (0)

to determine whether a branch is subcritical or supercritical. Differentiating (32) and setting

t = 0

shows that

β^{″} (0) = - \frac{d_{t}^{2} k (0, 0)}{d_{β} k (0, 0)} .

Differentiating (28) twice and solving for

d_{t}^{2} k (0, 0)

shows that

\begin{matrix} β^{″} (0) = \frac{- d_{x}^{3} r (0, 0) [x_{0}, x_{0}, x_{0}, x_{0}]}{3 | | x_{0} {| |}^{2} ξ} \end{matrix}

(34)

where

W x_{0} = u = u_{(m, n)}

. Use Equation (25) to calculate the numerator, and

ξ = d_{β} k (0, 0)

is the non-zero eigenvalue of

d_{x, β}^{2} r (0, 0)

with eigenvector

x_{0}

, for which we give an explicit expression in (31) when F is an annealing problem.

If

β^{″} (0) \neq 0

, which we expect to be true generically, then Theorem 5 shows that the bifurcation guaranteed by Theorem 3 is pitchfork-like.

2.10. Stability and Optimality

The next Theorem relates the stability of equilibria

(q^{*}, λ^{*}, β)

in the flow (8) with optimality of

q^{*}

in Problem (1). In particular, if a bifurcating branch corresponds to an eigenvalue of

d^{2} L (q^{*})

changing from negative to positive, then the branch consists of stationary points

(q^{*}, β^{*})

that are not solutions of (1). Positive eigenvalues of

d^{2} L (q^{*})

do not necessarily show that

q^{*}

is not a solution of (1) (see Remark 1). For example, see page 668 of [21]. A proof of this theorem is given in [21].

Theorem 6.

For each bifurcating branch guaranteed by Theorem 3,

u

is an eigenvector of

d^{2} L ((\begin{matrix} q^{*} \\ λ^{*} \end{matrix}) + t u, β^{*} + β (t))

for sufficiently small t. Furthermore, if the corresponding eigenvalue is positive, then the branch consists of unstable stationary points that are not solutions to (1).

2.11. Structure of the Symmetry Projection

The matrix

P_{R} (q^{*})

that projects

(q, λ) \in ℜ^{N K + K}

onto

range (d^{2} L (q^{*})) \times K^{*}

by annihilating

K_{p}

is important for numerical computations for equilibria of IB, since we may want to take each equilibrium found by Newton’s method and take out any part in

K_{p}

.

P_{R}

is written as a function of q since its constitutive vectors

y

(from Definition 1) depend on q. The following theorems clarify the structure of this projection.

Theorem 7.

P_{R} (q) = I - P_{K_{p}} (q)

, where

P_{K_{p}} = (\begin{matrix} A & 0 \\ 0 & 0 \end{matrix}) .

P_{R}

and

P_{K_{p}}

are

(N K + K) \times (N K + K)

. The matrix A is

N K \times N K

with

N^{2}

blocks,

{A_{i j}}_{i, j = 1}^{N}

, of size

K \times K

, defined by

\begin{matrix} A_{i, j} = \{\begin{matrix} (M - 1) y y^{T} & i f 1 \leq i = j \leq M \\ - y y^{T} & i f 1 \leq i \neq j \leq M \\ 0 & o t h e r w i s e \end{matrix} \end{matrix}

For example, if

M = N = 3

, then

\begin{matrix} P_{R} = I - (\begin{matrix} 2 y y^{T} & - y y^{T} & - y y^{T} & 0 \\ - y y^{T} & 2 y y^{T} & - y y^{T} & 0 \\ - y y^{T} & - y y^{T} & 2 y y^{T} & 0 \\ 0 & 0 & 0 & 0 \end{matrix}) = I - (\begin{matrix} (N - 1) & - 1 & - 1 & 0 \\ - 1 & (N - 1) & - 1 & 0 \\ - 1 & - 1 & (N - 1) & 0 \\ 0 & 0 & 0 & 0 \end{matrix}) \otimes y y^{T} . \end{matrix}

Proof.

Theorem 2 gives the basis of

K_{p}

as

{Y_{i}}_{i = 1}^{M - 1}

. Let Y be the

(N K + K) \times (M - 1)

matrix whose columns are the vectors

{Y_{i}}

. For example, if

M = 3

and

N = 4

, then

Y = (\begin{matrix} y & 0 \\ 0 & y \\ - y & - y \\ 0 & 0 \\ 0 & 0 \end{matrix})

. Thus, the matrix that projects onto

K_{p}

is

P_{K_{p}} = Y {(Y^{T} Y)}^{- 1} Y^{T}

, and the projection matrix onto

range (d^{2} L (q^{*}))

is

P_{R} = I - P_{K_{p}}

. Direct multiplication of

Y {(Y^{T} Y)}^{- 1} Y^{T}

, with an appeal to Lemma 34 in [25] to compute the inverse, shows that

P_{K_{p}} = \frac{1}{N y^{T} y} (\begin{matrix} A & 0 \\ 0 & 0 \end{matrix})

. Dropping the constant yields the result. □

For the Information Bottleneck, the matrix

P_{R}

is easy to calculate, since

y = q^{i}

for any

i \in U

. For example, when

q = q_{\frac{1}{N}}

, then

y^{T} y = \frac{K}{N^{2}}

and

y y^{T} = \frac{1}{N^{2}} 1

, and so

P_{K_{p}} = \frac{1}{N K} (\begin{matrix} (N - 1) & - 1 & \dots & - 1 & 0 \\ - 1 & (N - 1) & \dots & - 1 & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ - 1 & - 1 & \dots & (N - 1) & 0 \\ 0 & 0 & 0 & 0 & 0 \end{matrix}) \otimes 1

where 1 is a

K \times K

matrix of 1s. Thus,

P_{R} = I_{N K + K} - (\begin{matrix} (N - 1) & - 1 & \dots & - 1 & 0 \\ - 1 & (N - 1) & \dots & - 1 & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ - 1 & - 1 & \dots & (N - 1) & 0 \\ 0 & 0 & 0 & 0 & 0 \end{matrix}) \otimes 1 .

Theorem 8.

The symmetry group

S_{M}

commutes with the matrix

P_{R}

, which projects onto

ℜ^{N K + K} ∖ K_{p}

.

Proof.

Let

P : = P_{R}

be the matrix that projects onto

range d^{2} L (q^{*}) \times K^{*} = ℜ^{N K + K} ∖ K_{p}

. Since

ℜ^{N K + K} = range d^{2} L (q^{*}) \times K^{*} \times K_{p}

, then any

x \in ℜ^{N K + K}

can be decomposed in the respective subspaces as

x = r + k^{*} + k_{p}

. Let

σ

be an arbitrary permutation matrix in

S_{M}

. Then

σ P (\begin{matrix} q \\ λ \end{matrix}) = σ P (r + k^{*} + k_{p}) = σ (r + v)

. Since

range d^{2} L (q^{*})

,

K^{*}

and

K_{p}

are all

S_{M}

invariant; then

σ (r + v) \in range d^{2} L (q^{*}) \times K^{*}

implies that

σ (r + v) = P σ (r + v)

, and

σ z \in K_{p}

implies that

P σ (r + v) = P σ (r + v + z) .

Thus,

σ P x = P σ x

. □

2.12. Visualizations of Sample Resultsn

We illustrate these structures numerically. In [7], we introduced the toy “Four-blob” probability distribution

p (x, y)

shown in Figure 1.

For the Information Distortion problem (3) [7,12,13] and the synthetic dataset composed of a mixture of four Gaussians (Figure 1), we determined the bifurcation structure of solutions to (3) by annealing in

β

and finding the corresponding stationary points to (1). A typical run of the derived gradient dynamical system tends to follow the main bifurcation branch

S_{K} \to S_{K - 1}

from the fully symmetric uniform quantizer

q_{\frac{1}{N}}

(

N = 4

here) to the fully resolved deterministic quantizer (hard clustering) seen at the end in Figure 2. The permutation symmetry is also obvious there—the value of the cost function does not change if the classes along the vertical axis in T are permuted/relabeled. The uniform quantizer

q_{\frac{1}{N}}

(Item 1 in the figure) plays a special role in the formulation (3), as it is the unique solution to the problem for

β = 0

as the maximum entropy solution of

{max}_{q} H (T | Y)

. Its loss of stability at the first bifurcation for increasing

β

can hence be determined analytically and the first bifurcation structure characterized completely. Because of the “perpetual kernel” of the cost function in (4), the uniform quantizer is just one of a continuous set of “uninformative” quantizers for the IB problem (4): all

{q (t | y) : q (t | y) = f (t)}

, having constant probability of assignment of each y to class t, but the assignment weight can be different for different classes. Such a structure does not change the value of the cost function in the IB problem (4) (but does change it for (3), which hence does not have this degeneracy). We address the degeneracy of the IB optimization by projecting onto the subspace that has the correct symmetry (i.e., just the uniform quantizer

q_{\frac{1}{N}}

in this case), as outlined in Remark 2.

A more-thorough structure of the bifurcation diagram, using the analysis presented above, is shown in Figure 3.

Similar to the results we presented in [28], the close-up of the bifurcation at

β \approx 1.038706

in Figure 3B shows a subcritical bifurcating branch (a first-order phase transition) that consists of stationary points of Problem (1). By projecting the Hessian

Δ_{q} (G (q^{*}) + β D (q^{*}))

onto each of the kernels referenced in Theorem 6, we determined that the points on this subcritical branch are not solutions of (1), and yet they are solutions of (2).

Furthermore, observe that Figure 3B indicates that a saddle-node bifurcation occurs at

β \approx 1.037479

. That this is indeed the case was proved in [21]. In fact, for any problem of the form (2), these are the only two types of bifurcations to be expected: pitchfork and saddle-node.

3. Conclusions and Discussion

The main goal of this contribution was to show that information-based distortion-annealing problems such as (2) have an interesting mathematical structure. The most interesting aspects of that mathematical structure are driven by the symmetries present in the cost functions—their invariance to actions of the permutation group

S_{N}

, represented as relabeling of the reproduction classes. Such a structure would hold for any biclustering problem [4] that relies on the intrinsic interaction of a pair of variables for unsupervised clustering. The second mathematical structure that we used successfully was bifurcation theory, which allowed us to identify and study the discrete points at which the character of the cost function changed. The combination of those two tools in [20] allowed us to explicitly compute the value of the annealing parameter

β

at which the initial maximum at the uniform quantizer

q_{\frac{1}{N}}

of (1) loses stability. We concluded that for a fixed system

C \to Y

characterized by

p (X, Y)

, this value is the same for both problems, that it does not depend on the number of elements of the reproduction variable T, and that it is always greater than 1. We further introduced an eigenvalue problem that links the critical values of

β

and q for bifurcations, or phase transitions, branching off arbitrary intermediate solutions.

Even though the cost functions

F_{I B}

(4) and

F_{H}

(3) have similar properties, they also differ in some important aspects. We have shown that the function

F_{I B}

is degenerate since its constitutive functions

I (X; Y)

and

I (X; T)

are not strictly convex in q. That introduces additional invariances and singularities that are always preserved, which makes phase transitions more difficult to detect (e.g., the ”uninformative quantizers”

q (t | y) = f (t)

only) and post-transition directions more difficult to determine. In contrast,

F_{H}

is strictly convex except at points of phase transitions. The theory we developed here allows us to identify bifurcation directions and determine their stability. Despite the presence of a high-dimensional null space at bifurcations, the symmetries restrict the allowed transition dimensions to multiple co-dimension 1 transitions, all related by group transformations. We achieved that here with three main results. Theorem 8 extended the Equivariant Branching Lemma 1 to the Information Bottleneck case with additional translation invariance. Theorem 4 identified specific conditions at which a bifurcation of the gradient flow (8) occurs. This condition is computable analytically for the initial bifurcation off the uniform quantizer

q_{\frac{1}{N}}

and with numeric continuation for subsequent bifurcation. Finally, in Section 2.9, we provided checks for the types of bifurcations that occur, giving conditions to detect saddle-node and pitchfork bifurcations and to determine whether pitchforks are supercritical (second-order phase transitions) or subcritical (leading to first-order phase transitions discontinuous in

β

). The combination of the three results, together with our previous results in [20], completely characterize the local bifurcation structure of Information Bottleneck-type problems with or without the added translation symmetry.

Despite the further development of the bifurcation formalism for IB presented her, there are still open questions that this manuscript did not resolve. In particular, we still cannot confirm or reject the conjecture that the set of

S_{K}

symmetric soft-clustering branches connected through symmetry-breaking bifurcations leads to the global hard-clustering optima at

β \to \infty

(multiple equivalent solutions connected by the permutation symmetry of the problem). We believe this is partially due to a discrepancy between practical observations and theoretical results. In particular, we and other practitioners [29,30] note that the only observed symmetry-breaking bifurcations during optimization are of the kind

S_{M} \to S_{M - 1}

, while the theory allows for arbitrary

S_{M} \to S_{m} \times S_{n}

bifurcations. The latter are known to happen and be stable in other biological systems and circumstances [26,31]. This suggests a research approach of comparing and contrasting the different systems that possess the same

S_{N}

symmetry and symmetry-breaking bifurcations to lead to breakthroughs in this application to optimization in the Information Bottleneck problem.

An additional open problem involves the use of continuous variables, already noted in [5] and explored further in [32,33]. This approach, while important for many real-world problems, involves the application of additional mathematical tools, namely Calculus of Variations [34], which further increases the complexity of an otherwise already complex problem. These difficulties are illustrated in a pair of papers [35,36] that use the continuous formulation. They do present some significant results on conditions of learnability, but both papers manage to only get bounds on

β

under which learnability (optimal solutions beyond the “uninformative” quantizer) can be achieved. This is possibly due to the presence of continuous spectra in covariance operators of continuous quantizers, something that we avoid by focusing on finite spaces. As a consequence, here and in prior work [20], we show specific values for

β

for the initial bifurcation from the uniform quantizer, which supports nontrivial clustering. We consider formulation with continuous variables beyond the scope of this manuscript, but look forward to the development of additional techniques to incorporate this important case in the bifurcation framework presented here. Regardless of such developments, any practical problem with numeric optimization will involve discretization of the continuous variables, which effectively converts a continuous problem to the discrete state discussed here.

Author Contributions

Conceptualization, A.G.D.; Formal analysis, A.G.D. and A.E.P.; Investigation, A.G.D. and A.E.P.; Writing–original draft, A.G.D. and A.E.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gray, R.M. Entropy and Information Theory; Springer: Berlin/Heidelberg, Germany, 1990. [Google Scholar]
Cover, T.; Thomas, J. Elements of Information Theory; Wiley Series in Communication; Wiley: New York, NY, USA, 1991. [Google Scholar]
Rose, K. Deteministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems. Proc. IEEE 1998, 86, 2210–2239. [Google Scholar] [CrossRef]
Madeira, S.C.; Oliveira, A.L. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Comput. Biol. Bioinform. 2004, 1, 24–45. [Google Scholar] [CrossRef] [PubMed]
Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In 37th Annual Allerton Conference on Communication, Control, and Computing; University of Illinois: Champaign, IL, USA, 1999. [Google Scholar]
Dimitrov, A.G.; Miller, J.P.; Aldworth, Z.; Gedeon, T.; Parker, A.E. Analysis of neural coding through quantization with an information-based distortion measure. Netw. Comput. Neural Syst. 2003, 14, 151–176. [Google Scholar] [CrossRef]
Dimitrov, A.G.; Miller, J.P. Neural coding and decoding: Communication channels and quantization. Netw. Comput. Neural Syst. 2001, 12, 441–472. [Google Scholar] [CrossRef]
Gersho, A.; Gray, R.M. Vector Quantization and Signal Compression; Kluwer Academic Publishers: New York, NY, USA, 1992. [Google Scholar]
Mumey, B.; Gedeon, T. Optimal mutual information quantization is NP-complete. In Proceedings of the Neural Information Coding (NIC) Workshop, Snowbird, UT, USA, 1–4 March 2003. [Google Scholar]
Slonim, N.; Tishby, N. Agglomerative Information Bottleneck. In Advances in Neural Information Processing Systems; Solla, S.A., Leen, T.K., Müller, K.R., Eds.; MIT Press: Cambridge, MA, USA, 2000; Volume 12, pp. 617–623. [Google Scholar]
Slonim, N. The Information Bottleneck: Theory and Applications. Ph.D. Thesis, Hebrew University, Jerusalem, Israel, 2002. [Google Scholar]
Dimitrov, A.G.; Miller, J.P. Analyzing sensory systems with the information distortion function. In Proceedings of the Pacific Symposium on Biocomputing 2001; Altman, R.B., Ed.; World Scientific Publishing Co.: Singapore, 2000. [Google Scholar]
Gedeon, T.; Parker, A.E.; Dimitrov, A.G. Information Distortion and Neural Coding. Can. Appl. Math. Q. 2003, 10, 33–70. [Google Scholar]
Slonim, N.; Somerville, R.; Tishby, N.; Lahav, O. Objective classification of galaxy spectra using the information bottleneck method. Mon. Not. R. Astron. Soc. 2001, 323, 270–284. [Google Scholar] [CrossRef]
Bardera, A.; Rigau, J.; Boada, I.; Feixas, M.; Sbert, M. Image segmentation using information bottleneck method. IEEE Trans. Image Process. 2009, 18, 1601–1612. [Google Scholar] [CrossRef]
Aldworth, Z.N.; Dimitrov, A.G.; Cummins, G.I.; Gedeon, T.; Miller, J.P. Temporal encoding in a nervous system. PLoS Comput. Biol. 2011, 7, e1002041. [Google Scholar] [CrossRef]
Buddha, S.K.; So, K.; Carmena, J.M.; Gastpar, M.C. Function identification in neuron populations via information bottleneck. Entropy 2013, 15, 1587–1608. [Google Scholar] [CrossRef]
Lewandowsky, J.; Bauch, G. Information-optimum LDPC decoders based on the information bottleneck method. IEEE Access 2018, 6, 4054–4071. [Google Scholar] [CrossRef]
Parker, A.E.; Dimitrov, A.G.; Gedeon, T. Symmetry breaking in soft clustering decoding of neural codes. IEEE Trans. Inf. Theory 2010, 56, 901–927. [Google Scholar] [CrossRef] [Green Version]
Gedeon, T.; Parker, A.E.; Dimitrov, A.G. The mathematical structure of information bottleneck methods. Entropy 2012, 14, 456–479. [Google Scholar] [CrossRef]
Parker, A.E.; Gedeon, T. Bifurcations of a class of S_N-invariant constrained optimization problems. J. Dyn. Differ. Equ. 2004, 16, 629–678. [Google Scholar] [CrossRef]
Golubitsky, M.; Stewart, I.; Schaeffer, D.G. Singularities and Groups in Bifurcation Theory II; Springer: New York, NY, USA, 1988. [Google Scholar]
Golubitsky, M.; Schaeffer, D.G. Singularities and Groups in Bifurcation Theory I; Springer: New York, NY, USA, 1985. [Google Scholar]
Nocedal, J.; Wright, S.J. Numerical Optimization; Springer: New York, NY, USA, 2000. [Google Scholar]
Parker, A.E. Symmetry Breaking Bifurcations of the Information Distortion. Ph.D. Thesis, Montana State University, Bozeman, MT, USA, 2003. [Google Scholar]
Golubitsky, M.; Stewart, I. The Symmetry Perspective: From Equilibrium to Chaos in Phase Space and Physical Space; Birkhauser Verlag: Boston, MA, USA, 2002. [Google Scholar]
Schott, J.R. Matrix Analysis for Statistics; John Wiley and Sons: New York, NY, USA, 1997. [Google Scholar]
Parker, A.; Gedeon, T.; Dimitrov, A. Annealing and the rate distortion problem. In Advances in Neural Information Processing Systems 15; Becker, S.T., Obermayer, K., Eds.; MIT Press: Cambridge, MA, USA, 2003; Volume 15, pp. 969–976. [Google Scholar]
Dimitrov, A.G.; Cummins, G.I.; Baker, A.; Aldworth, Z.N. Characterizing the fine structure of a neural sensory code through information distortion. J. Comput. Neurosci. 2011, 30, 163–179. [Google Scholar] [CrossRef] [PubMed]
Schneidman, E.; Slonim, N.; Tishby, N.; de Ruyter van Steveninck, R.R.; Bialek, W. Analyzing neural codes using the information bottleneck method. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2003; Volume 15. [Google Scholar]
Stewart, I. Self-Organization in evolution: A mathematical perspective. Philos. Trans. R. Soc. 2003, 361, 1101–1123. [Google Scholar] [CrossRef] [PubMed]
Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. In Proceedings of the Advances in Neural Information Processing Systems 16 (NIPS 2003), Vancouver, BC, Canada, 8–13 December 2003. [Google Scholar]
Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information Bottleneck for Gaussian Variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
Gelfand, I.M.; Fomin, S.V. Calculus of Variations; Dover Publications: Mineola, NY, USA, 2000. [Google Scholar]
Wu, T.; Fischer, I.; Chuang, I.L.; Tegmark, M. Learnability for the information bottleneck. In Proceedings of the Uncertainty in Artificial Intelligence, PMLR, Virtual, 3–6 August 2020; pp. 1050–1060. [Google Scholar]
Ngampruetikorn, V.; Schwab, D.J. Perturbation theory for the information bottleneck. Adv. Neural Inf. Process. Syst. 2021, 34, 21008–21018. [Google Scholar]

Figure 1. The probability distribution

p (x, y)

for the “Four-blob” toy problem for a system of interest

X \to Y

. We use this probability to illustrate some results of the bifurcation analysis reported here.

Figure 1. The probability distribution

p (x, y)

for the “Four-blob” toy problem for a system of interest

X \to Y

. We use this probability to illustrate some results of the bifurcation analysis reported here.

Figure 2. The bifurcations of the solutions

(q^{*}, β)

to the Information Distortion problem (3). For the mixture of 4 well-separated Gaussians shown in Figure 1, the behavior of

D (q) = I (X; T)

as a function of

β

is shown in the top panel, and some of the solutions

q^{*} (T | Y)

are shown in the bottom panels. Item 1 shows the uniform quantizer

q_{\frac{1}{N}}

, assigning equal probability of each

y \in Y

to belong to one of the four clusters in T. Subsequent items 2–5 point to a set of partially resolved quantizations, in which subsets of Y are assigned with high probability to one (2) or more (3–5) classes (dark colors, close to 1), while other subsets are still unresolved (gray levels), albeit as a higher probability than

q_{\frac{1}{N}}

(darker gray, as some of the classes are excluded after being resolved for another subset). Item 6 shows an almost fully resolved quantizer at sufficiently high

β

. They become fully resolved (deterministic;

q (t | y) = 1

or 0) as

β \to \infty

(not shown).

Figure 2. The bifurcations of the solutions

(q^{*}, β)

to the Information Distortion problem (3). For the mixture of 4 well-separated Gaussians shown in Figure 1, the behavior of

D (q) = I (X; T)

as a function of

β

is shown in the top panel, and some of the solutions

q^{*} (T | Y)

are shown in the bottom panels. Item 1 shows the uniform quantizer

q_{\frac{1}{N}}

, assigning equal probability of each

y \in Y

to belong to one of the four clusters in T. Subsequent items 2–5 point to a set of partially resolved quantizations, in which subsets of Y are assigned with high probability to one (2) or more (3–5) classes (dark colors, close to 1), while other subsets are still unresolved (gray levels), albeit as a higher probability than

q_{\frac{1}{N}}

(darker gray, as some of the classes are excluded after being resolved for another subset). Item 6 shows an almost fully resolved quantizer at sufficiently high

β

. They become fully resolved (deterministic;

q (t | y) = 1

or 0) as

β \to \infty

(not shown).

Figure 3. (A) The bifurcation structure of stationary points of the Information Distortion problem (3), a problem of form (2). We found these points by annealing in

β

and finding stationary points for Problem (1) using the algorithm presented in [28]. A square indicates where a bifurcation occurs. (B) A close-up of the subcritical bifurcation at

β \approx 1.038706

, indicated by a square. Observe the subcritical bifurcating branch, and the subsequent saddle-node bifurcation at

β \approx 1.037479

, indicated by another square. We applied Theorem 6 to show that the subcritical bifurcating branch is composed of quantizers that are solutions of (3) but not of (1).

Figure 3. (A) The bifurcation structure of stationary points of the Information Distortion problem (3), a problem of form (2). We found these points by annealing in

β

and finding stationary points for Problem (1) using the algorithm presented in [28]. A square indicates where a bifurcation occurs. (B) A close-up of the subcritical bifurcation at

β \approx 1.038706

, indicated by a square. Observe the subcritical bifurcating branch, and the subsequent saddle-node bifurcation at

β \approx 1.037479

, indicated by another square. We applied Theorem 6 to show that the subcritical bifurcating branch is composed of quantizers that are solutions of (3) but not of (1).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Parker, A.E.; Dimitrov, A.G. Symmetry-Breaking Bifurcations of the Information Bottleneck and Related Problems. Entropy 2022, 24, 1231. https://doi.org/10.3390/e24091231

AMA Style

Parker AE, Dimitrov AG. Symmetry-Breaking Bifurcations of the Information Bottleneck and Related Problems. Entropy. 2022; 24(9):1231. https://doi.org/10.3390/e24091231

Chicago/Turabian Style

Parker, Albert E., and Alexander G. Dimitrov. 2022. "Symmetry-Breaking Bifurcations of the Information Bottleneck and Related Problems" Entropy 24, no. 9: 1231. https://doi.org/10.3390/e24091231

APA Style

Parker, A. E., & Dimitrov, A. G. (2022). Symmetry-Breaking Bifurcations of the Information Bottleneck and Related Problems. Entropy, 24(9), 1231. https://doi.org/10.3390/e24091231

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symmetry-Breaking Bifurcations of the Information Bottleneck and Related Problems

Abstract

1. Introduction

2. Bifurcation Analysis

2.1. Equivariant Branching Lemma

2.2. A Gradient Flow

2.3. Equilibria with Symmetry

2.4. The Kernel at a Bifurcation

2.5. Liapunov–Schmidt Reduction

2.6. Isotropy Subgroups $S_{m} \times S_{n}$ of $S_{N}$

2.7. Bifurcating Branches

2.8. The Crossing Condition for Annealing Problemsn

2.9. Bifurcation Type

2.10. Stability and Optimality

2.11. Structure of the Symmetry Projection

2.12. Visualizations of Sample Resultsn

3. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Symmetry-Breaking Bifurcations of the Information Bottleneck and Related Problems

Abstract

1. Introduction

2. Bifurcation Analysis

2.1. Equivariant Branching Lemma

2.2. A Gradient Flow

2.3. Equilibria with Symmetry

2.4. The Kernel at a Bifurcation

2.5. Liapunov–Schmidt Reduction

2.6. Isotropy Subgroups S m × S n of S N

2.7. Bifurcating Branches

2.8. The Crossing Condition for Annealing Problemsn

2.9. Bifurcation Type

2.10. Stability and Optimality

2.11. Structure of the Symmetry Projection

2.12. Visualizations of Sample Resultsn

3. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.6. Isotropy Subgroups $S_{m} \times S_{n}$ of $S_{N}$