Equivalence of Informations Characterizes Bregman Divergences

Chodrow, Philip S.

doi:10.3390/e27070766

Open AccessCommunication

Equivalence of Informations Characterizes Bregman Divergences

by

Philip S. Chodrow

Department of Computer Science, Middlebury College, Middlebury, VT 05753, USA

Entropy 2025, 27(7), 766; https://doi.org/10.3390/e27070766 (registering DOI)

Submission received: 30 May 2025 / Revised: 16 July 2025 / Accepted: 16 July 2025 / Published: 19 July 2025

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figure

Versions Notes

Abstract

Bregman divergences form a class of distance-like comparison functions which plays fundamental roles in optimization, statistics, and information theory. One important property of Bregman divergences is that they generate agreement between two useful formulations of information content (in the sense of variability or non-uniformity) in weighted collections of vectors. The first of these is the Jensen gap information, which measures the difference between the mean value of a strictly convex function evaluated on a weighted set of vectors and the value of that function evaluated at the centroid of that collection. The second of these is the divergence information, which measures the mean divergence of the vectors in the collection from their centroid. In this brief note, we prove that the agreement between Jensen gap and divergence informations in fact characterizes the class of Bregman divergences; they are the only divergences that generate this agreement for arbitrary weighted sets of data vectors.

Keywords:

Bregman divergence; mutual information; characterization theorem; convexity

1. Introduction

For a convex set

C \subseteq R^{ℓ}

with relative interior

C_{*}

and a strictly convex function

ϕ : C \to R

differentiable on

C_{*}

, the Bregman divergence induced by

ϕ

is the function

d_{ϕ} : C \times C_{*} \to R

defined by

\begin{matrix} d_{ϕ} (x_{1}, x_{2}) = ϕ (x_{1}) - ϕ (x_{2}) - \nabla ϕ {(x_{2})}^{T} (x_{1} - x_{2}) . \end{matrix}

Two common examples of Bregman divergences are:

The squared Mahalanobis distance $d_{ϕ} (x_{1}, x_{2}) = {(x_{1} - x_{2})}^{T} W (x_{1} - x_{2})$ , where $W$ is a positive-definite matrix. The function $ϕ$ is given by $ϕ (x) = \frac{1}{2} x^{T} W x$ . The special case $W = I$ gives the squared Euclidean distance. This divergence may be defined on $C = R^{ℓ}$ .
The Kullback–Leibler (KL) divergence $d_{ϕ} (x, y) = \sum_{j = 1}^{ℓ} x_{j} log \frac{x_{j}}{y_{j}}$ , where $x$ and $y$ are probability vectors. Here, $C$ is the simplex $Δ_{ℓ} = \{x \in R^{ℓ} | \sum_{j = 1}^{ℓ} x_{j} = 1, x_{j} \geq 0 \forall j\}$ . The KL divergence is induced by the negative entropy function $ϕ (x) = \sum_{j = 1}^{ℓ} x_{j} log x_{j}$ . This divergence can be extended to general convex subsets of $R_{+}^{ℓ}$ with formula $d_{ϕ} (x, y) = \sum_{j = 1}^{ℓ} (x_{j} log \frac{x_{j}}{y_{j}} + y_{j} - x_{j})$ for $x, y \in R_{+}^{ℓ}$ . When computing the KL divergence, we use the convention $0 log 0 = 0$ .

While it is possible to extend the definition of Bregman divergences to Banach spaces [1], in this note we focus on divergences whose domains are convex subsets of

R^{ℓ}

. In this setting, it is possible to interpret the Bregman divergence as a comparison between the difference

ϕ (x_{1}) - ϕ (x_{2})

on the one hand and the linearized approximation of this difference about

x_{2}

given by

\nabla ϕ {(x_{2})}^{T} (x_{1} - x_{2})

on the other.

Like metrics, Bregman divergences are positive-definite:

d_{ϕ} (x, y) \geq 0

with equality if and only if

x = y

. Unlike metrics, Bregman divergences are not in general symmetric and do not in general satisfy a triangle inequality, though they do satisfy a “law of cosines” and a generalized Pythagorean theorem [2]. Bregman divergences are locally distance-like in that they induce a Riemannian metric on

C

obtained by the small-

δ

expansion

\begin{matrix} d_{ϕ} (x + δ, x) = \frac{1}{2} δ^{T} H ϕ (x) δ + o ({∥δ∥}^{2}), \end{matrix}

where

δ

is a small perturbation vector and

H ϕ (x)

is the Hessian of

ϕ

at

x

. Because

ϕ

is strictly convex,

H ϕ (x)

is positive-definite and defines a Riemannian metric on

C

[3]; much work in information geometry [4] pursues the geometry induced by this metric and its connections to statistical inference. Bregman divergences [5] also play fundamental roles in machine learning, optimization, and information theory. They are the unique class of distance-like losses for which iterative, centroid-based clustering algorithms (such as k-means) always reduce the global loss [2,6] Bregman divergences are central in the formulation of mirror-descent methods for convex optimization [7] and have a connection via convex duality to Fenchel-Young loss functions [4,8]. See Reem et al. [9] for a more detailed review of Bregman divergences.

Bregman divergences provide one natural route through which to generalize Shannon information theory, with the differentiable function

- ϕ

taking on the role of the Shannon entropy. Indeed, generalized entropies play a role in describing the the asymptotic performance of learning algorithms; there exist a number of inequalities relating Bregman divergences to these generalized entropies [10,11]. Multiple characterization theorems exist for many information-theoretic quantities, including entropy [12,13,14], mutual information [15,16], and the Kullback–Leibler divergence [17,18]. This author, however, is aware of only one extant characterization of the class of Bregman divergences, due to Banerjee et al. [6]: Bregman divergences are the unique class of loss functions that render conditional expectations uniquely loss-minimizing in stochastic prediction problems. This characterization is the foundation of the connection between Bregman divergences and iterative centroid-based clustering algorithms noted above.

In this short note, we prove a new characterization of the class of Bregman divergences. This characterization is based on an equality of two common formulations of information content in weighted collections of finite-dimensional vectors.

2. Bregman Divergences Relate Two Informations

Let

μ \in Δ_{n}

be a probability measure over n points

x_{1}, \dots, x_{n} \in C

. We collect these points into a matrix

X

, and in a small abuse of notation, we consider this matrix to be an element of

C^{n}

. We now define two standard formulations of information, each of which we consider as a function

Δ_{n} \times C^{n} \to R

. The first formulation compares a weighted sum of strictly convex loss function evaluations on data points to the same loss function evaluated at the data centroid.

Definition 1

(Jensen Gap Information). Let

ϕ : C \to R

be a strictly convex function on

C

. The Jensen gap information is the function

I_{ϕ} : Δ_{n} \times C^{n} \to R

given by

\begin{matrix} I_{ϕ} (μ, X) ≜ \sum_{i = 1}^{n} μ_{i} ϕ (x_{i}) - ϕ (y), \end{matrix}

where

y = \sum_{i = 1}^{n} μ_{i} x_{i}

.

If we define X to be a random vector that takes value

x_{i}

with probability

μ_{i}

, Jensen’s inequality states that

E [ϕ (X)] \geq ϕ (E [X])

, with equality holding only if X is constant (i.e., if there exists i such that

μ_{i} = 1

). The Jensen gap information is a measure of the difference of the two sides of this inequality; indeed,

E [ϕ (X)] = ϕ (E [X]) + I_{ϕ} (μ, X)

[2,6]. This formulation makes clear that

I_{ϕ}

is non-negative and that

I_{ϕ} (μ, X) = 0

if and only if

x_{i} = x_{j}

whenever

μ_{i} > 0

and

μ_{j} > 0

.

Another standard formulation expresses information content as a weighted mean of divergences of data points from their centroid.

Definition 2

(Divergence). A function

d : C \times C \to R

is a divergence if

d (x_{1}, x_{2}) \geq 0

for any

x_{1}, x_{2} \in C

, with equality if and only if

x_{1} = x_{2}

.

Definition 3

(Divergence Information). Let d be a divergence. The divergence information is the function

I_{d} : Δ_{n} \times C^{n} \to R

given by

\begin{matrix} I_{d} (μ, X) ≜ \sum_{i = 1}^{n} μ_{i} d (x_{i}, y), \end{matrix}

(1)

where

y = \sum_{i = 1}^{n} μ_{i} x_{i}

.

In this definition, we assume that

y \in C_{*}

; as noted by Banerjee et al. [2], this assumption is not restrictive since the set

C

can be replaced with the convex hull of the data

X

without loss of generality. The divergence information measures the

μ

-weighted average divergence of

x_{i}

from the centroid

y

. This divergence information is related to the characterization result for Bregman divergences by Banerjee et al. [6]: a divergence d is a Bregman divergence if and only if the vector

y = \sum_{i = 1}^{n} μ_{i} x_{i}

is the unique minimizer of the function

\sum_{i = 1}^{n} μ_{i} d (x_{i}, \cdot)

appearing on the righthand side of Equation (1) for any choice of

μ

and

X

.

There are several important cases in which the Jensen gap information and the divergence information coincide.

Definition 4

(Information Equivalence). We say that a pair

(ϕ, d)

comprising a strictly convex function

ϕ : C \to R

and a divergence

d : C \times C \to R

satisfies the information equivalence property if, for all

(μ, X) \in Δ_{n} \times C^{n}

, it holds that

\begin{matrix} I_{ϕ} (μ, X) = I_{d} (μ, X) . \end{matrix}

(2)

A graphical illustration of information equivalence is shown in Figure 1.

Lemma 1

(Information Equivalence with Bregman Divergences [2,6]). The pair

(ϕ, d_{ϕ})

satisfies the information equivalence property.

The proof is a direct calculation and is provided by Banerjee et al. [2]. When

ϕ (x) = \frac{1}{2} {∥x∥}^{2}

and

d = d_{ϕ}

is the squared Euclidean distance, the information equivalence property (2) is the identity

\begin{matrix} \sum_{i = 1}^{n} μ_{i} {∥x_{i}∥}^{2} - {∥\sum_{i = 1}^{n} μ_{i} x_{i}∥}^{2} = \sum_{i = 1}^{n} μ_{i} {∥x_{i} - \sum_{i = 1}^{n} μ_{i} x_{i}∥}^{2} . \end{matrix}

(3)

The righthand side of (3) is the weighted sum-of-squares loss of the data points

x_{i}

with respect to their centroid

\sum_{i = 1}^{n} μ_{i} x_{i}

, which is often used in statistical tests and clustering algorithms. Equation (3) asserts that this loss may also be computed from a weighted average of the norms of the data points.

When

C

is the probability simplex,

ϕ (x) = \sum_{i = 1}^{n} x_{i} log x_{i}

is the negative entropy, and

d = d_{ϕ}

is the KL divergence, the information equivalence property (2) expresses the equality of two equivalent formulations of the mutual information for discrete random variables. Let A and B be discrete random variables on alphabets

A

of size n and

B

of size ℓ, respectively. Suppose that their joint distribution is

p_{A, B} (a_{i}, b_{j}) = μ_{i} x_{i j}

. Let

y

be the vector with entries

y_{j} = \sum_{i = 1}^{n} μ_{i} x_{i j}

; then,

y

is the marginal distribution of B. The Jensen gap information

I_{ϕ} (μ, X)

is

\begin{matrix} I_{ϕ} (μ, X) & = \underset{- H (B | A)}{\underset{︸}{\sum_{i = 1}^{n} μ_{i} \sum_{j = 1}^{ℓ} x_{i j} log x_{i j}}} - \underset{- H (B)}{\underset{︸}{\sum_{j = 1}^{ℓ} y_{j} log y_{j}}}; \end{matrix}

which expresses the mutual information

I (A; B)

between random variables A and B in the entropy-reduction formulation,

I (A; B) = H (B) - H (B | A)

[19]. On the other hand, the divergence information

I_{d} (μ, X)

is

\begin{matrix} I_{d} (μ, X) = \sum_{i = 1}^{n} μ_{i} \underset{d_{ϕ} (x_{i}, y)}{\underset{︸}{\sum_{j = 1}^{ℓ} x_{i j} log \frac{x_{i j}}{y_{j}},}} \end{matrix}

which expresses the mutual information

I (A; B)

instead as the weighted sum of KL divergences of

x_{i}

from

y

.

Our contribution in this paper is to prove a converse to Lemma 1: the Bregman divergence

d_{ϕ}

is the only divergence that satisfies information equivalence with

ϕ

.

3. Main Result

Theorem 1.

If the pair

(ϕ, d)

satisfies the information equivalence property (2), then d is the Bregman divergence induced by ϕ:

d (x, y) = d_{ϕ} (x, y)

for any

x \in C

and

y \in C_{*}

.

Let

(ϕ, d)

satisfy information equivalence (2). For any

x \in C

and

y \in C_{*}

, we can write

\begin{matrix} d (x, y) = ϕ (x) - ϕ (y) + f (x, y) \end{matrix}

(4)

for some unknown function

f : C \times C_{*} \to R

. We aim to show that

f (x, y) = - \nabla ϕ {(y)}^{T} (x - y)

for all

x \in C

and

y \in C_{*}

.

Our first step is to show that f is an affine function of its first argument

x

on

C

. To do so, we observe that if

μ \in Δ_{n}

and

X \in C^{n}

are such that

\sum_{i = 1}^{n} μ_{i} x_{i} = y

, then we have

\begin{matrix} \sum_{i = 1}^{n} μ_{i} ϕ (x_{i}) - ϕ (y) & = \sum_{i = 1}^{n} μ_{i} d (x_{i}, y) \\ = \sum_{i = 1}^{n} μ_{i} [ϕ (x_{i}) - ϕ (y) + f (x_{i}, y)] \\ = \sum_{i = 1}^{n} μ_{i} ϕ (x_{i}) - ϕ (y) + \sum_{i = 1}^{n} μ_{i} f (x_{i}, y), \end{matrix}

where the first line follows from information equivalence. It follows that

\begin{matrix} \sum_{i = 1}^{n} μ_{i} f (x_{i}, y) = 0 . \end{matrix}

(5)

Fix

y \in C_{*}

. Let

A_{y} = \{v \in R^{n} | y + v \in C\}

, and for any

ϵ > 0

let

B_{y} (ϵ) = A_{y} \cap {v \in R^{n} | ∥v∥ < ϵ}

. Pick

ϵ > 0

sufficiently small that, for all

v \in B_{y} (ϵ)

, it holds that both

y + v \in C

and

y - v \in C

; this is possible due to the relative openness of

C_{*}

. For notational compactness, let

B_{y} = B_{y} (ϵ)

. Since

B_{y}

is the intersection of a Euclidean ball with convex set

A_{y}

, it is also convex.

Consider the function

g_{y} : A_{y} \to R

given by

g_{y} (v) = f (v + y, y)

. The condition (5) implies that

\begin{matrix} \sum_{i = 1}^{n} μ_{i} g_{y} (v_{i}) = 0 . \end{matrix}

(6)

for any

v_{1}, \dots, v_{n} \in A_{y}

such that

\sum_{i = 1}^{n} μ_{i} v_{i} = 0

.

To show that

f (\cdot, y)

is affine, it suffices to show that the function

g_{y}

is linear on

A_{y}

. We do this through two short lemmas. In each, we characterize the behavior of

g_{y}

on the relative ball

B_{y}

before extending this characterization to the entire domain

A_{y}

.

Lemma 2.

For any vector

v \in A_{y}

and scalar α such that

α v \in A_{y}

, we have

g_{y} (α v) = α g_{y} (v)

.

Proof.

We will first prove the lemma in the restricted case that

α = - 1

and

v \in B_{y}

. By Equation (6), we have that

\begin{matrix} \frac{1}{2} g_{y} (v) + \frac{1}{2} g_{y} (- v) = 0 . \end{matrix}

from which it follows that

g_{y} (- v) = - g_{y} (v)

. Let us now assume that

v \in B_{y}

but that

α

is general; we will then use this to prove the more general setting

v \in A_{y}

. We proceed by cases.

$α = 0$ . The previous argument implies that $g_{y} (0) = 0$ .
$α > 0$ . Since $\frac{α}{1 + α} (- v) + \frac{1}{1 + α} (α v) = 0$ , an application of Equation (6) gives $α g_{y} (- v) + g_{y} (α v) = 0$ ; isolating $g_{y} (α v)$ and applying the previous argument proves the case.
$α < 0$ . This case follows by applying the proof of the previous case, replacing $α$ with $- α$ .

Now, assume only that

v \in A_{y}

. Choose

β > 0

so that

β v \in B_{y}

and

β α v \in B_{y}

;

β = min \{\frac{ϵ}{2 α ∥v∥}, 1\}

is one sufficient choice. Then, by our previous argument, we have

g_{y} (v) = g_{y} (\frac{1}{β} β v) = \frac{1}{β} g_{y} (β v)

, from which we infer

g_{y} (β v) = β g_{y} (v)

. Using this, we can compute

g_{y} (α v) = g_{y} (\frac{1}{β} β α v) = \frac{α}{β} g_{y} (β v) = c g_{y} (v)

, which proves the lemma. □

Lemma 3.

The function

g_{y}

is linear on

A_{y}

: for any

α \in R^{n}

and vectors

v_{1}, \dots, v_{n} \in A_{y}

such that

\sum_{i = 1}^{n} α_{i} v_{i} \in A_{y}

, it holds that

\begin{matrix} g_{y} (\sum_{i = 1}^{n} α_{i} v_{i}) = \sum_{i = 1}^{n} α_{i} g_{y} (v_{i}) . \end{matrix}

Proof.

Let us first assume that

α \in Δ_{n}

and

v_{1}, \dots, v_{n} \in B_{y}

. Applying Equation (6) gives

\begin{matrix} \frac{1}{2} g_{y} (\sum_{i = 1}^{n} α_{i} v_{i}) + \frac{1}{2} \sum_{i = 1}^{n} α_{i} g_{y} (- v_{i}) = 0, \end{matrix}

from which applying Lemma 2 gives the result under these hypotheses.

We now consider the general case. For each i, choose

β_{i} \neq 0

so that

β_{i} α_{i} > 0

and

{\tilde{v}}_{i} ≜ β_{i} v_{i} \in B_{y}

. Let

M = \sum_{i = 1}^{n} \frac{α_{i}}{β_{i}}

. Define the vector

\tilde{α} \in Δ_{n}

with entries

{\tilde{α}}_{i} = \frac{α_{i}}{M β_{i}}

. Then, by construction,

α_{i} v_{i} = M {\tilde{α}}_{i} {\tilde{v}}_{i}

for each i. Applying Lemma 2 and the restricted case above, we can then compute

\begin{matrix} g_{y} (\sum_{i = 1}^{n} α_{i} v_{i}) & = g_{y} (\sum_{i = 1}^{n} M {\tilde{α}}_{i} {\tilde{v}}_{i}) \\ = M g_{y} (\sum_{i = 1}^{n} {\tilde{α}}_{i} {\tilde{v}}_{i}) \\ = M \sum_{i = 1}^{n} {\tilde{α}}_{i} g_{y} ({\tilde{v}}_{i}) \\ = M \sum_{i = 1}^{n} {\tilde{α}}_{i} β_{i} g_{y} (v_{i}) \\ = \sum_{i = 1}^{n} α_{i} g_{y} (v_{i});, \end{matrix}

which completes the proof. □

Proof of Theorem 1.

Fix

y \in C_{*}

. The preceding lemmas prove that the function

g_{y}

is linear on

A_{y}

. Since for constant

y

the function f in (4) is a translation of

g_{y}

in its first argument, it follows that f is affine as a function of its first argument

x

. We may therefore write, for all

x \in C

and

y \in C_{*}

,

\begin{matrix} f (x, y) = h_{1} {(y)}^{T} x + h_{2} (y) . \end{matrix}

(7)

for some functions

h_{1} : C_{*} \to R^{ℓ}

and

h_{2} : C_{*} \to R

.

We now determine these functions. First, since

ϕ

is differentiable on

C_{*}

and

f (x, y)

is affine in

x

,

d (x, y)

is differentiable in its first argument on

C_{*}

. Since d is a divergence,

y

is a critical point of the function

d (\cdot, y)

on

C_{*}

. It follows that

\nabla_{1} d (y, y)

, the gradient of d with respect to its first argument, is orthogonal to

C_{*}

at

y

:

\begin{matrix} \nabla_{1} d {(y, y)}^{T} (x - y) = 0 \end{matrix}

(8)

for any

x \in C

. We can compute

\nabla_{1} d (y, y)

explicitly; it is

\nabla_{1} d (y, y) = \nabla ϕ (y) + h_{1} (y)

, which combined with (8) gives

\begin{matrix} {(\nabla ϕ (y) + h_{1} (y))}^{T} (x - y) = 0 . \end{matrix}

(9)

for any

x \in C

and

y \in C^{*}

.

Now, the condition that

d (y, y) = 0

implies that

h_{2} (y) = - h_{1} {(y)}^{T} y

. Using Equations (9) and (7), we then compute

\begin{matrix} - \nabla ϕ {(y)}^{T} (x - y) & = h_{1} {(y)}^{T} (x - y) \\ = h_{1} {(y)}^{T} x + h_{2} (y) \\ = f (x, y) . \end{matrix}

Recalling the definition of f in (4), we conclude that

\begin{matrix} d (x, y) = ϕ (x) - ϕ (y) - \nabla ϕ {(y)}^{T} (x - y), \end{matrix}

which is the Bregman divergence induced by

ϕ

. This completes the proof. □

4. Discussion

We have shown that the class of Bregman divergences is the unique class of divergences that induce agreement between the Jensen gap and divergence informations. This result offers some further perspective on the role for Bregman divergences in data clustering and quantization [2]. The Jensen gap information

I_{ϕ}

is a natural loss function for such tasks, with one motivation as follows. Suppose that we wish to measure the complexity of a set of data points

X

with weights

μ \in Δ_{n}

using a weighted per-observation loss function and a term that depends only on the centroid

y = \sum_{i = 1}^{n} μ_{i} x_{i}

of the data:

\begin{matrix} L (μ, X) = \sum_{i = 1}^{n} μ_{i} ψ (x_{i}) + ρ (y) . \end{matrix}

A natural stipulation for the loss function L is that replacing two data points

x_{1}

and

x_{2}

with their weighted mean

x = \frac{μ_{1}}{μ_{1} + μ_{2}} x_{1} + \frac{μ_{2}}{μ_{1} + μ_{2}} x_{2}

should strictly decrease the loss when

x_{1} \neq x_{2}

; this requirement is equivalent to strict convexity of the function

ψ

. If we further require that

L (μ, X) = 0

when each row of

X

is identical, we find that

ρ (y) = - ψ (y)

and that our loss function is the Jensen gap information:

L (μ, X) = I_{ψ} (μ, X)

. The present result shows that this natural formulation fully determines the choice of how to perform pairwise comparisons between individual data points; only the corresponding Bregman divergence can serve as a positive-definite comparator that is consistent with the Jensen gap information.

An extension of this result to the setting of Bregman divergences defined on more general spaces, such as Banach spaces [1], would be of considerable interest for problems in functional data clustering [20].

Funding

This research was funded by the National Science Foundation (NSF) under award number DMS-2407058.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The author declares no conflicts of interest.

References

Bauschke, H.H.; Borwein, J.M.; Combettes, P.L. Bregman Monotone Optimization Algorithms. SIAM J. Control Optim. 2003, 42, 596–636. [Google Scholar] [CrossRef]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Amari, S.; Cichocki, A. Information Geometry of Divergence Functions. Bull. Pol. Acad. Sci. Tech. Sci. 2010, 58, 183–195. [Google Scholar] [CrossRef]
Amari, S. Information Geometry and Its Applications; Applied Mathematical Sciences Series; Springer: Tokyo, Japan, 2016; Volume 194. [Google Scholar] [CrossRef]
Bregman, L. The Relaxation Method of Finding the Common Point of Convex Sets and Its Application to the Solution of Problems in Convex Programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
Banerjee, A.; Guo, X.; Wang, H. On the Optimality of Conditional Expectation as a Bregman Predictor. IEEE Trans. Inf. Theory 2005, 51, 2664–2669. [Google Scholar] [CrossRef]
Nemirovsky, A.; Yudin, D. Problem Complexity and Method Efficiency in Optimization; Wiley-Interscience Series in Discrete Mathematics; John Wiley & Sons: Hoboken, NJ, USA, 1983. [Google Scholar]
Blondel, M.; Martins, A.F.T.; Niculae, V. Learning with Fenchel-Young Losses. J. Mach. Learn. Res. 2020, 21, 1–69. [Google Scholar]
Reem, D.; Reich, S.; De Pierro, A. Re-Examination of Bregman Functions and New Properties of Their Divergences. Optimization 2019, 68, 279–348. [Google Scholar] [CrossRef]
Painsky, A.; Wornell, G.W. Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss. IEEE Trans. Inf. Theory 2020, 66, 1658–1673. [Google Scholar] [CrossRef]
Xu, A. Continuity of Generalized Entropy and Statistical Learning. arXiv 2021, arXiv:2012.15829. [Google Scholar] [CrossRef]
Baez, J.C.; Fritz, T.; Leinster, T. A Characterization of Entropy in Terms of Information Loss. Entropy 2011, 13, 1945–1957. [Google Scholar] [CrossRef]
Faddeev, D.K. On the Concept of Entropy of a Finite Probabilistic Scheme. Uspekhi Mat. Nauk 1956, 11, 227–231. [Google Scholar]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Fullwood, J. An Axiomatic Characterization of Mutual Information. Entropy 2023, 25, 663. [Google Scholar] [CrossRef] [PubMed]
Frankel, D.M.; Volij, O. Measuring School Segregation. J. Econ. Theory 2011, 146, 1–38. [Google Scholar] [CrossRef]
Jiao, J.; Courtade, T.; No, A.; Venkat, K.; Weissman, T. Information Measures: The Curious Case of the Binary Alphabet. IEEE Trans. Inf. Theory 2014, 60, 7616–7626. [Google Scholar] [CrossRef]
Hobson, A. A New Theorem of Information Theory. J. Stat. Phys. 1969, 1, 383–391. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Zhang, M.; Parnell, A. Review of Clustering Methods for Functional Data. ACM Trans. Knowl. Discov. Data 2023, 17, 1–34. [Google Scholar] [CrossRef]

Figure 1. Illustration of the information equivalence property in one dimension. Several data points

x_{i}

are shown (black) alongside their centroid

y = \sum_{i = 1}^{n} μ_{i} x_{i}

(hollow). The dashed grey line gives the tangent to

ϕ

at y, with equation

t (x) = ϕ (y) + ϕ^{'} (y) (x - y)

. Dashed yellow segments give the value of

ϕ (x_{i}) - ϕ (y)

; the signed weighted mean of the lengths of these segments is the Jensen-gap information

I_{ϕ} (μ, X)

. Solid blue segments give the value of

d_{ϕ} (x_{i}, y)

; the unsigned weighted mean of the lengths of these segments is the divergence information

I_{d} (μ, X)

. Information equivalence (2) asserts these two weighted means are equal. For the purposes of this visualization,

μ

is the uniform distribution

μ_{i} = \frac{1}{n}

. The function

ϕ

shown is

ϕ (x) = x log x

.

Figure 1. Illustration of the information equivalence property in one dimension. Several data points

x_{i}

are shown (black) alongside their centroid

y = \sum_{i = 1}^{n} μ_{i} x_{i}

(hollow). The dashed grey line gives the tangent to

ϕ

at y, with equation

t (x) = ϕ (y) + ϕ^{'} (y) (x - y)

. Dashed yellow segments give the value of

ϕ (x_{i}) - ϕ (y)

; the signed weighted mean of the lengths of these segments is the Jensen-gap information

I_{ϕ} (μ, X)

. Solid blue segments give the value of

d_{ϕ} (x_{i}, y)

; the unsigned weighted mean of the lengths of these segments is the divergence information

I_{d} (μ, X)

. Information equivalence (2) asserts these two weighted means are equal. For the purposes of this visualization,

μ

is the uniform distribution

μ_{i} = \frac{1}{n}

. The function

ϕ

shown is

ϕ (x) = x log x

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chodrow, P.S. Equivalence of Informations Characterizes Bregman Divergences. Entropy 2025, 27, 766. https://doi.org/10.3390/e27070766

AMA Style

Chodrow PS. Equivalence of Informations Characterizes Bregman Divergences. Entropy. 2025; 27(7):766. https://doi.org/10.3390/e27070766

Chicago/Turabian Style

Chodrow, Philip S. 2025. "Equivalence of Informations Characterizes Bregman Divergences" Entropy 27, no. 7: 766. https://doi.org/10.3390/e27070766

APA Style

Chodrow, P. S. (2025). Equivalence of Informations Characterizes Bregman Divergences. Entropy, 27(7), 766. https://doi.org/10.3390/e27070766

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Equivalence of Informations Characterizes Bregman Divergences

Abstract

1. Introduction

2. Bregman Divergences Relate Two Informations

3. Main Result

4. Discussion

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI