Bregman–Hausdorff Divergence: Strengthening the Connections Between Computational Geometry and Machine Learning

Pham, Tuyen; Dal Poz Kouřimská, Hana; Wagner, Hubert

doi:10.3390/make7020048

Open AccessArticle

Bregman–Hausdorff Divergence: Strengthening the Connections Between Computational Geometry and Machine Learning

by

Tuyen Pham

^1,*

,

Hana Dal Poz Kouřimská

²

and

Hubert Wagner

^1,*

¹

Department of Mathematics, University of Florida, Gainesville, FL 32611, USA

²

Applied Geometry and Topology, University of Potsdam, Am Neuen Palais 10, 14469 Potsdam, Germany

^*

Authors to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(2), 48; https://doi.org/10.3390/make7020048

Submission received: 12 March 2025 / Revised: 7 May 2025 / Accepted: 13 May 2025 / Published: 26 May 2025

(This article belongs to the Collection Extravaganza Feature Papers on Hot Topics in Machine Learning and Knowledge Extraction)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The purpose of this paper is twofold. On a technical side, we propose an extension of the Hausdorff distance from metric spaces to spaces equipped with asymmetric distance measures. Specifically, we focus on extending it to the family of Bregman divergences, which includes the popular Kullback–Leibler divergence (also known as relative entropy). The resulting dissimilarity measure is called a Bregman–Hausdorff divergence and compares two collections of vectors—without assuming any pairing or alignment between their elements. We propose new algorithms for computing Bregman–Hausdorff divergences based on a recently developed Kd-tree data structure for nearest neighbor search with respect to Bregman divergences. The algorithms are surprisingly efficient even for large inputs with hundreds of dimensions. As a benchmark, we use the new divergence to compare two collections of probabilistic predictions produced by different machine learning models trained using the relative entropy loss. In addition to the introduction of this technical concept, we provide a survey. It outlines the basics of Bregman geometry, and motivated the Kullback–Leibler divergence using concepts from information theory. We also describe computational geometric algorithms that have been extended to this geometry, focusing on algorithms relevant for machine learning.

Keywords:

computational geometry; machine learning; information theory; Shannon’s entropy; relative entropy; Bregman divergence; non-Euclidean geometry; Bregman geometry

1. Introduction

Various kinds of data—from sounds to images to text corpora—are routinely represented as finite sets of vectors. These vectors can be processed using a wide range of algorithms, often based on linear algebra. The intermediate representations as well as final outcomes of the data are, similarly, sets of vectors.

Conveniently, the above setup allows for an intuitive geometric interpretation. Indeed, it is usual to equip the vector space

R^{d}

in which such representations live with the Euclidean metric. Geometric objects in the geometry induced by this metric, such as the distance itself, balls and their intersections, bisectors, etc., help explain such algorithms in intuitive terms.

In recent years, however, other notions of distances have started to play an important role. One popular distance is the Kullback–Leibler divergence, often referred to as the relative entropy. A form of this divergence is the cross-entropy loss, commonly used for training deep learning models, in particular.

Compared to the once popular mean squared error loss (based on the Euclidean metric) the cross-entropy loss (based on the Kullback–Leibler divergence) provides significantly better performance. While the Kullback–Leibler divergence is often viewed as a distance between probability vectors, it lacks standard features of a metric. In particular, it is typically non-symmetric and never satisfies the triangle inequality. As such, its behavior can be less intuitive.

It may therefore be surprising that the Kullback–Leibler divergence induces a well-behaved geometry. Moreover, there exists an infinite family of distance measures, so-called Bregman divergences, that induce similar geometries. The aforementioned Kullback–Leibler divergence is one of its most prominent members, along with the squared Euclidean distance.

There is a significant overlap between algorithms in machine learning and computational geometry. Nevertheless, computational geometry tends to focus on the Euclidean distance (and other metric distances). In contrast, the non-metric aspects of the Kullback–Leibler divergence (and other Bregman divergences) prevent computational geometry algorithms from working—at least at the first glance.

It turns out that several popular algorithms can be extended to the Bregman setting—despite the lack of symmetry and triangle inequality, which are often deemed crucial. While this is an ongoing direction, there have been efforts to extend popular algorithms to operate within this framework.

In the first part of the paper, we offer a geometric perspective on Bregman divergences. We hope this perspective will streamline further development and analysis of algorithms at the intersection of machine learning and computational geometry; in particular, in the context of data measured using relative entropy, such as probabilistic predictions returned by a classifier trained using cross-entropy.

In the second part, we develop a crucial geometric tool in the context of Bregman geometry. The idea is simple: where a Bregman divergence provides a comparison between two vectors, we propose a natural way of comparing two sets of vectors. This idea is analogous to the Hausdorff distance between two sets and we therefore call it a Bregman–Hausdorff divergence. Notably, this measurement does not rely on any pairing or alignment between elements of the sets, and the sets may differ in size. This contrasts with the computation of a classifier’s loss during training, where each prediction is compared with the corresponding correct label.

Interestingly, the lack of symmetry characteristic to Bregman divergences allows for several different definitions—we select three, guided by the geometric interpretation of the original Hausdorff distance. Additionally, we propose first algorithms for computing these new divergences. These algorithms are enabled by recent developments in Bregman nearest-neighbor search, and we experimentally show they are efficient in practice.

Our contribution extends the arsenal of tools capable of handling data living in Bregman geometries. One crucial example of such data is the set of probabilistic predictions of modern classifiers trained with the cross-entropy loss.

Paper outline. In the first part of this paper, we introduce concepts from information theory and a geometric interpretation for the relative entropy (Section 2). This interpretation connects the relative entropy to a larger family of distance measures known as Bregman divergences. After a brief introduction to this family and the geometry its members induce (Section 3), we explain why the asymmetry is a beneficial property in context for machine learning, and highlight computational tools that have been extended to this setting (Section 4).

The second part of this paper introduces three new measurements based on Bregman divergences. We provide definitions as well as interpretations in the context of comparing sets of discrete probability distributions (Section 5). We then provide efficient algorithms for these measurements (Section 6). In Section 7, we experimentally show that the new measurements can be efficiently computed in practical situations. In particular, we combine the theory and tools from the two previous sections to provide quantitative ways to analyze and compare machine learning models. Section 8 concludes the paper.

2. Information Theory and Relative Entropy

We begin by highlighting certain concepts from information theory, with the goal of providing an interpretation of the relative entropy. We will use this interpretation to develop intuition for the geometry induced by the relative entropy in Section 2. (More details on information theory can be found in [1,2].) In particular, we emphasize the inherent asymmetry of the relative entropy, which will inform our decision to focus on asymmetric versions of the Hausdorff distance later. We also provide a geometric interpretation of relative entropy. This interpretation is shared among all Bregman divergences, and will be our focus afterwards.

Setup. We first set up a running example to guide us through each definition in this section. Let

E_{1}, E_{2}, E_{3},

and

E_{4}

be events occurring with probabilities

\frac{1}{2}, \frac{1}{4}, \frac{1}{8},

and

\frac{1}{8}

, respectively.

More generally, for d events we encode these probabilities as a probability vector or in other words a discrete probability distribution. Geometrically, the space of all such vectors is the

(d - 1)

-dimensional open probability simplex contained in

R^{d}

, namely

\begin{matrix} Δ^{d - 1} = {x \in R^{d} : \sum_{i = 1}^{d} x_{i} = 1, 0 < x_{i} < 1} . \end{matrix}

Going back to our example, we now plan to transmit information on sequences of observed events. To this end, we first encode each event as a finite sequence of bits, called a codeword. We aim to minimize the expected length of a codeword with the restriction that sequences of codewords be uniquely decodable.

Consider Table 1, in which

p_{i}

is the probability of event

E_{i}

. The three rightmost columns provide three different codes.

Given a code and a discrete probability distribution p, we can compute the expected code length for the transmission of information about the events. Specifically, the expected length,

E [ℓ_{i}]

, for each code i is as follows:

\begin{matrix} E [ℓ_{1}] & = 2 \times \frac{1}{2} + 2 \times \frac{1}{4} + 2 \times \frac{1}{8} + 2 \times \frac{1}{8} = 2, \\ E [ℓ_{2}] & = 1 \times \frac{1}{2} + 1 \times \frac{1}{4} + 2 \times \frac{1}{8} + 2 \times \frac{1}{8} = \frac{5}{4}, \\ E [ℓ_{3}] & = 1 \times \frac{1}{2} + 2 \times \frac{1}{4} + 3 \times \frac{1}{8} + 3 \times \frac{1}{8} = \frac{7}{4} . \end{matrix}

While Code₁ may be the most straightforward way to encode these four events, we can find a more optimized code. Although Code₂ has a shorter expected code length, it is not decodable. Indeed, the sequence 0111 can describe both

E_{1} E_{2} E_{4}

and

E_{3} E_{2} E_{2}

. In contrast, Code₃ is decodable and has a shorter expected code length than Code₁.

Shannon’s entropy. In his seminal paper [3], Shannon introduced a formula to compute the lower bound for the expected length of a code for a discrete probability distribution. Specifically, given a discrete probability distribution p, the Shannon’s entropy is defined as

\begin{matrix} H (p) & = \sum_{i} p_{i} {log}_{2} \frac{1}{p_{i}}, \end{matrix}

with

p_{i} {log}_{2} \frac{1}{p_{i}} = 0

for

p_{i} = 0

.

Returning to our example: for the discrete probability distribution p from Table 1,

H (p) = \frac{7}{4} = E [ℓ_{3}]

. Thus, Code₃ in Table 1 is the optimal way to encode events

E_{1}, \dots, E_{4}

.

Cross-entropy. Suppose we (erroneously) assume that the probability of the events is given by a probability distribution q, while in reality it is p. How inefficient is the code optimized for q compared to the code optimized for the true distribution p?

In this situation, we would assign longer codewords to less probable events, as measured by q. However, to compute the expected code length we must use the true probabilities of events, given by p.

The cross-entropy is an extension of Shannon’s entropy that provides a lower bound on the length of such codes:

\begin{matrix} H (p, q) = \sum_{i} p_{i} {log}_{2} \frac{1}{q_{i}} . \end{matrix}

In other words, the cross-entropy gives the lower bound for the expected code length for events with probabilities represented by p, assuming they occur with probabilities represented by q.

Given distributions p and q, their cross-entropy has a geometric interpretation. It is the approximation of

H (p)

by the best affine approximation of H centered at q. Indeed, we can write the cross-entropy as

\begin{matrix} H (p, q) = H (q) + 〈 \nabla H (q), p - q 〉, \end{matrix}

where

〈 \cdot, \cdot 〉

is the standard dot product.

Relative entropy. Relative entropy is the difference between cross-entropy and entropy,

H (p, q) - H (p)

. It therefore measures the expected loss of coding efficiency incurred by using the ‘approximate’ probability q instead of the ‘true’ probability p.

The relative entropy is often viewed as a distance measure between two discrete probability distributions. However, unlike proper metric distances, it is generally not symmetric and does not satisfy the triangle inequality. Given

p, q \in Δ^{d - 1} \subset R^{d}

, we write

\begin{matrix} D_{K L} (p ∥ q) & = H (p, q) - H (p) \\ = \sum_{i = 1}^{d} p_{i} {log}_{2} \frac{1}{q_{i}} - \sum_{i = 1}^{d} p_{i} {log}_{2} \frac{1}{p_{i}} \\ = \sum_{i = 1}^{d} p_{i} {log}_{2} \frac{p_{i}}{q_{i}} \end{matrix}

to denote the relative entropy. We provide a further explanation of this notation at the end of this section, and expand on it in Section 3.

Usage. In machine learning models, relative entropy is often used as a loss function. Let us consider a multiclass classification task, with

X \subset R^{d}

being a data set and Y the collection of correct labels encoded as probability vectors. For a model M dependent on a parameter vector

θ

and

x \in X

, we write

M (x; θ)

for the probability vectors, which are compared to the correct labels in Y. Then, minimizing

\sum_{(x, y) \in X \times Y} D_{K L} (y ∥ M (x; θ))

can be interpreted as penalizing predictions that are poor approximations of the true distribution. Relative entropy has also been used in the training of Variational Autoencoders [4]. Outside of machine learning, relative entropy has seen use in statistical physics [5] and information geometry [6].

Asymmetry. The asymmetry of relative entropy has a tangible interpretation. Let

p = (\frac{1}{2}, \frac{1}{4}, \frac{1}{8}, \frac{1}{8})

as in Table 1, and choose

q = (\frac{1}{3}, \frac{1}{3}, \frac{1}{3}, 0) .

If we attempt to approximate p using q, then

D_{K L} (p ∥ q) = + \infty

. This value reflects the fact that

E_{4}

is an impossible event assuming q, and therefore no code was prepared for it (along with all other events that occur with zero probability). Hence, there exists no codeword of finite length that we could use to encode this event.

If, on the other hand, we approximate q using p, then

D_{K L} (q ∥ p) \approx 0.415

. This value reflects the fact that while

E_{4}

cannot occur, the coding efficiency is decreased by accounting for a codeword that is never used. (We remark that this asymmetry also occurs when the outputs of relative entropy are finite in both directions.)

Recall that

H (p)

is Shannon’s entropy of p, while the cross-entropy,

H (p, q)

, is an approximation of

H (p)

based on an affine approximation at

H (q)

. Thus, the relative entropy can be interpreted as the vertical distance between the graph of this affine approximation and the graph of

H (p)

. See Figure 1 for illustration.

This geometric construction of relative entropy can be generalized. Indeed, the family of distance measures arising from this type of construction is known as Bregman divergences [7]. Like relative entropy, these distance measures generally lack symmetry and do not fulfill triangle inequality. The relative entropy is a member of this family and is often referred to as the Kullback–Leibler (

K L

) divergence in this context. We will introduce this family next.

3. Background on Bregman Geometry

In this section, we introduce Bregman divergences [7], portray the

K L

divergence as an important instance, and give an overview of Bregman geometry. We will put emphasis on the geometry induced by the

K L

divergence.

Bregman divergences—definition and basic properties. Each Bregman divergence is generated by a function of Legendre type. Given an open convex set

Ω \subseteq R^{d}

, a function

F : Ω \to R

is of Legendre type [8,9] if F is as follows:

differentiable;
strictly convex;
the boundary $\partial Ω$ of $Ω$ is nonempty, then $lim_{x \to \partial Ω} ∥ \nabla F (x) ∥ = \infty$ .

The last condition is often omitted, but it enables a correct application of the Legendre transformation. It is a useful tool coming from convex optimization [9], which we will review later.

Given a function F of Legendre type, the Bregman divergence [7] generated by F is the function

\begin{matrix} D_{F} : Ω \times Ω \to [0, \infty], D_{F} (x ∥ y) = F (x) - (F (y) + 〈 \nabla F (y), x - y 〉) . \end{matrix}

In other words, the divergence in the direction from x to y is the difference between

F (x)

and the best affine approximation of F at y, evaluated at x. We illustrate this in Figure 2. The construction also mirrors the geometric interpretation of the

K L

divergence from Figure 1.

Just like metrics, Bregman divergences are always non-negative: in fact,

D_{F} (x ∥ y) \geq 0

, with equality if and only if

x = y

. Unlike metrics, however, they are generally not symmetric, and do not satisfy the triangle inequality. To emphasize the lack of these two properties, we refrain from referring to the output of

D_{F} (x ∥ y)

as a ‘distance’. Similarly, to emphasize the lack of symmetry, it is customary to write

D_{F} (x ∥ y)

and not simply

D_{F} (x, y)

for the divergence computed from x to y.

Relative entropy as a Bregman divergence. Given the function

F : R_{+}^{d} \to R

,

F (x) = - E (x) = \sum_{i} x_{i} {log}_{2} x_{i}

, we have

\nabla F (x) = \frac{1}{log (2)} (1 + log x_{1}, \dots, 1 + log x_{d})

. Substituting in the formula for the Bregman divergence, we obtain

\begin{matrix} D_{K L} (x ∥ y) & = D_{- E} (x ∥ y) \\ = \sum_{i = 1}^{d} x_{i} {log}_{2} x_{i} - \sum_{i = 1}^{d} y_{i} {log}_{2} y_{i} - \sum_{i = 1}^{d} \frac{1}{log (2)} (1 + log y_{i}) (x_{i} - y_{i}) \\ = \sum_{i = 1}^{d} x_{i} {log}_{2} x_{i} - x_{i} {log}_{2} y_{i} + \frac{y_{i} - x_{i}}{log (2)} . \end{matrix}

This divergence is often called a generalized Kullback–Leibler divergence. Restricting it to the probability simplex,

Δ^{d - 1}

, we obtain a divergence that coincides with the relative entropy defined in the previous section. Following the statistics and information theory literature, we will refer to it as the Kullback–Leibler divergence and denote it by

D_{K L}

. One can easily check that the generator of this divergence is indeed a function of Legendre type, also when restricted to the probability simplex.

Other Bregman divergences. Among other examples of Bregman divergences, the most prominent one is the squared Euclidean distance (SE), generated by the square of the Euclidean norm:

R^{d} \to R, x \mapsto \sum_{i = 1}^{d} x_{i}^{2},

D_{S E} (x ∥ y) = \sum_{i = 1}^{d} {(x_{i} - y_{i})}^{2} .

We remark that restricting the domain to any bounded subset of

R^{d}

violates Condition 3 for a Legendre-type function. Consequently, under such a restriction the square of the Euclidean distance does not fulfill the above definition of a Bregman divergence.

The Itakura–Saito (IS) divergence [10], generated by the function

\begin{matrix} R_{+}^{d} \to R, x \mapsto - \sum_{i = 1}^{d} log x_{i}, \\ D_{I S} (x ∥ y) = \sum_{i = 1}^{d} \frac{x_{i}}{y_{i}} - log \frac{x_{i}}{y_{i}} - 1, \end{matrix}

has seen success as a loss function in machine learning models analyzing speech and sound data [11].

All of the above divergences are often classified as decomposible Bregman divergences, since each of them decomposes as a sum of 1-dimensional divergences.

The Mahalanobis divergence [12] is commonly used in statistics as a distance notion between probability distributions [13]. It is generated by the convex quadratic form

\begin{matrix} R^{d} \to R, & x \mapsto x^{⊤} Q x, \\ R^{d} \times R^{d} \to [0, \infty], & (x, y) \mapsto {(x - y)}^{⊤} Q (x - y), \end{matrix}

where Q is the inverse of a variance–covariance matrix. Unlike the other examples, it is generally not a decomposable divergence. The Mahalanobis divergence has also seen success in machine learning—examples include supervised clustering [14] and the classification of hyperspectral images [15].

Bregman Geometry. Similarly to a metric, each Bregman divergence induces a geometry. We provide a brief overview of some key objects and features of Bregman geometries, and provide their information–theoretic interpretations in the case of the

K L

divergence.

A fundamental object in geometry is the ball. Due to the asymmetry of Bregman divergences, one can define two types of Bregman balls [16]. The primal Bregman ball of radius

r \geq 0

centered at the point

q \in Ω

, is defined as

\begin{matrix} B_{F} (q; r) = {y \in Ω : D_{F} (q ∥ y) \leq r}, \end{matrix}

and is the collection of those points whose divergence from q does not exceed r. See Figure 3 for various illustrations. Primal Bregman balls have a particularly nice geometric interpretation: given a light source at point

(q, F (q) - r)

, the primal ball

B_{F} (q; r)

is the illuminated part of the graph of F, projected vertically onto

Ω

. We illustrate this in Figure 4, on the left.

The dual Bregman ball of radius

r \geq 0

centered at the point

q \in Ω

, is defined as

\begin{matrix} B_{F}^{'} (q; r) = {y \in Ω : D_{F} (y ∥ q) \leq r}, \end{matrix}

and is the collection of points whose divergence to q does not exceed r. The dual ball has a geometric interpretation similar to the primal ball. We first shift the tangent plane of the graph of F at

(q, F (q))

up by r. The dual Bregman ball

B_{F}^{'} (q; r)

is the portion of the graph of F below this plane, projected vertically onto

Ω

. We illustrate this in Figure 4, on the right.

As seen in Figure 3, and observed in [17], primal Bregman balls can be non-convex when viewed as a subset of Euclidean space. In contrast, dual balls are always convex since

D_{F} (x ∥ y)

is convex in x.

It may be tempting to consider the two geometries induced by a Bregman divergence to and from a point, separately. However, there is a strong connection between the two, given by the Legendre transformation [9].

Legendre transform. The Legendre transform

F^{*}

of a function F of Legendre type is defined on a conjugate domain

\begin{matrix} Ω^{*} = \{\nabla F (x) ∣ x \in Ω\} \end{matrix}

as

\begin{matrix} F^{*} (x^{*}) = sup_{x \in Ω} (〈 x, x^{*} 〉 - F (x)) . \end{matrix}

This construction induces a canonical map

\begin{matrix} Ω \to Ω^{*}, x \mapsto x^{*} = \nabla F (x) . \end{matrix}

It turns out that

F^{*}

is a convex function of Legendre type, and thus it also generates a Bregman divergence,

D_{F^{*}}

. This divergence satisfies

D_{F^{*}} (p^{*}, q^{*}) = D_{F} (q, p),

implying that the function mentioned above maps primal balls in

Ω

to dual balls in

Ω^{*}

[17].

Chernoff point. Another connection between the two geometries is given by the Chernoff point [18]. Following [19], we say a set

X \subset Ω

has an enclosing sphere, if there exists a dual Bregman ball containing X. The enclosing sphere is then the boundary of this ball.

Moreover, every finite set

X \subset Ω

has a unique smallest enclosing Bregman sphere [16,19,20]. The center of this smallest enclosing sphere is known as the Chernoff point for X.

We are interested in a simple situation in which the set X consists of two points only,

X = {p, q}

. In this case, the Chernoff point minimizes the divergence

D_{F} (p ∥ c)

subject to

D_{F} (p ∥ c) = D_{F} (q ∥ c)

; one can view it as lying midway between p and q with respect to the chosen Bregman divergence. Indeed, in the squared Euclidean case, it is the usual midpoint (arithmetic mean), namly

\frac{p + q}{2}

. However, for other Bregman divergences this point will generally not be the midpoint, but some other point on the segment joining the two points. We also remark that the midpoint does play a special role for all Bregman divergences. We elaborate on this when discussing the Bregman k-means clustering algorithm. Additionally, the popular Jensen–Shannon divergence, and its variants, rely on the midpoint.

The Chernoff point can be visualized: for each point

p \in X

, consider the primal ball

B_{F} (p; r)

growing about p, parameterized by

r \geq 0

. Then, the Chernoff point is the point where all the balls

B_{F} (p; r)

intersect for the first time. In the case when

X = {p, q}

, let us denote this radius by

r^{*}

, and the Chernoff point by c. Then, c lies on the boundary of both

B_{F} (p; r^{*})

and

B_{F} (q; r^{*})

, and p and q lie on the boundary of

B_{F}^{'} (c; r^{*})

. A visualization of this interaction can be seen in Figure 5.

We refer to works by Nielsen [18,21] for more information on Chernoff points and their applications.

Information-theoretic interpretation of the geometric objects stemming from the KL divergence. In the language of information theory, the primal and dual

K L

balls have the following interpretation: Let

p \in Δ^{d - 1}

be a probability vector, and

r \geq 0

a radius, expressed in bits. Note that, since the

K L

divergence measures the expected efficiency loss in bits, the radius r can indeed be of any (non-negative) real value, and is not limited to being an integer. The primal ball

B_{K L} (p; r)

contains all probability vectors q that can approximate p with an expected loss of at most r bits. In contrast, the dual ball

B_{K L}^{'} (p; r)

contains all probability vectors q that can be approximated by p with an expected loss of at most r bits. Consequently, the Chernoff point for two probability vectors p and q is the vector that approximates both p and q with the least loss of expected efficiency (as usual counted in bits).

4. Algorithms in Bregman Geometry

The development of algorithms for Bregman divergences is relatively young. We survey computational geometry algorithms that were adapted to the Bregman setting from the common Euclidean and metric settings, and highlight the necessary modifications.

k-means clustering. The first algorithm to be adapted was the k-means clustering algorithm. It was originally proposed by Lloyd [22] in 1957, and worked with the Euclidean distance. It was extended to arbitrary Bregman divergences by Banerjee and collaborators [23] in 2005.

In Euclidean space, the k-means clustering partitions a data set into k clusters. Each cluster is identified by a unique point, called a cluster centroid. Here, the centroid is chosen as the arithmetic mean of all the data points in the cluster. This choice is motivated by the fact that the mean is the unique point that minimizes the sum of the squares of the Euclidean distances (i.e., the variance) to all data points in a given cluster.

Surprisingly, the k-means algorithm still works when the square of the Euclidean distance is replaced with a Bregman divergence computed from data points. Indeed, Banerjee and collaborators show that for

X = {x_{i}}_{i = 1, 2, \dots, d} \subset Ω

, the sum

\sum_{i = 1}^{d} D_{F} (x_{i} ∥ p)

is uniquely minimized by point p being the arithmetic mean of X—independently of the choice of a Bregman divergence. Consequently, apart from the requirement of computing the divergence towards the centroid, the original algorithm works in the Bregman setting without modification. This remarkable fact was the first hint that other geometric algorithms may work in the Bregman setting. (See Figure 6 for a comparison of k-means clustering in different geometries.)

Voronoi diagrams. Voronoi diagrams were first formally defined for two- and three-dimensional Euclidean space by Dirichlet [24] in 1850, and for general Euclidean spaces by Voronoi [25,26] in 1908. Given a finite set

S \subset R^{d}

of points, called sites, we partition the space

R^{d}

into Voronoi cells, such that the Voronoi cell of

s \in S

contains all points in the space for which s is the closest site. More formally,

{x \in R^{d} : d (x, s) \leq d (x, y) \forall y \in S}

.

This definition for Voronoi diagrams was extended to the Bregman setting by Boissonnat, Nielsen, and Nock [17] in 2007. In the Euclidean space, Voronoi cells are convex polytopes (possibly unbounded). However, when another distance measure is used, the shape of these cells can change drastically. See Figure 7 for an illustration of Voronoi diagrams for various distance measures.

In the Euclidean case, the bisector of a pair of sites is a hyperplane, and the Voronoi cells arise as intersections of the resulting half-spaces. One method of computing these cells is by Chazelle’s half-space intersection algorithm [27]. In the Bregman version, Boissonnat, Nielsen, and Nock show that calculating the divergence to the sites yields Bregman bisectors that are also hyperplanes. Chazelle’s algorithm can therefore be used to compute the Bregman Voronoi diagram. However, when computing the divergence from a site, the Bregman bisectors are more general hypersurfaces. Consequently, the Bregman Voronoi cells may have curved faces. As a side note, the Legendre transform can be used to handle this harder scenario by mapping the input to the conjugate space, performing the computations there using hyperplanes, and mapping the result back.

k-Nearest neighbor search. In a metric space

(M, d)

, given a finite subset

S \subset M

and a query

q \in M

, a k-nearest neighbor search finds the k elements of S closest to q. A common strategy to answer these queries is to spatially decompose the set S. This decomposition is often encoded as a tree data structure. This type of data structures includes ball trees [28], vantage point trees [29], and Kd-trees [30]. Collectively, these are referred to as metric trees—although many of them can be extended to the non-metric setting of Bregman divergences.

In the Bregman setting, due to the asymmetry, we consider two types of nearest neighbor search. Namely,

arg {min}_{s \in S} D_{F} (q ∥ s)

and

arg {min}_{s \in S} D_{F} (s ∥ q)

. Queries are performed by searching subtrees while updating a nearest neighbor candidate. Subtrees may be ignored (pruned) to accelerate the search if certain conditions, which depend on the type of tree, are met.

In 2008, Cayton extended the concept of ball trees from metric spaces to the Bregman setting, and created software for Bregman ball trees [31]. The software works for the squared Euclidean distance and the

K L

divergence, with experimental support for the

I S

divergence. Cayton’s Bregman ball trees are constructed with the help the aforementioned Bregman k-means clustering. The pruning decision involves a projection of a candidate point onto the surface of the Bregman ball. This is performed via a bisection search.

Nielsen, Piro, and Barlaud improved on Cayton’s ball tree algorithm [32]. In particular, they improved the construction by altering the initial points for the Bregman k-means algorithm, and introduced a branching factor to allow for more splits at each internal node. They also implemented a priority-based traversal instead of the standard recursive ball tree traversal. The same authors also adapted another data structure called a vantage point tree in 2009 [32,33]. Specifically, they replaced the metric ball with a dual Bregman ball, and in the pruning decision, they performed a bisection search to check intersections of Bregman balls.

Kd-trees were introduced by Bentley [30] in 1975, and extended to Bregman divergences by Pham and Wagner [34] in 2025. Unlike ball trees and vantage point trees, the construction of the Bregman Kd-tree is independent of the Bregman divergence. Indeed, the choice of divergence can be deferred to the time of performing each query. The query algorithm, surprisingly, is the same as in the Euclidean case. For decomposable Bregman divergences in particular, it allows each pruning decision to be made in effectively

O (1)

time, independently on the dimension of the data.

Cayton’s Bregman ball trees algorithm is specialized for the

K L

divergence, with experimental implementation for the

I S

divergence. Nielsen, Piro, and Barlaud’s implementations show results for the

K L

and

I S

divergences. However, for both Bregman ball trees and vantage point trees, adding implementations for new divergences is nontrivial. In contrast, although Kd-trees work for a subfamily of Bregman divergences, a further extension of the implementation to new Bregman divergences is straightforward.

Apart from the data structures and algorithms described above, other exact searches have been extended to work in the Bregman setting. These include R-trees and VA-files (extended by Zhang and collaborators [35] in 2009), and BrePartition (introduced by Song, Gu, Zhang, and Yu [36] in 2020). The listed algorithms provide exact nearest neighbor queries in the Bregman setting [37]. We remark that there exist other nearest neighbor algorithms that work in the Bregman setting [37,38]. However, these algorithms focus on approximations, without guarantees to find the exact nearest neighbors.

Computational topology. Topological concepts, such as homology groups, have been imported into computational geometry. One key concept is persistent homology [39], a stable geometric–topological descriptor of data, including point cloud data. It is the basis of the field called topological data analysis [40].

Building on geometric results of Boissonnat, Nielsen, and Nock [17], Edelsbrunner and Wagner [41] extended concepts of computational topology to the Bregman setting in 2017. In particular, basic concepts that allow for computation of persistent homology were extended to the Bregman setting. These include generalizations of the alpha and Čech constructions [42]. One key result is the proof of contractibility of nonempty intersection of Bregman balls. Intuitively speaking, this result ensures that these constructions correctly capture the topology of data. More recent work focuses on implementation and experimental aspects [43].

The development of computational tools for Bregman divergences motivates the extension of other geometric concepts from metric spaces to the Bregman setting. In the following sections we concentrate on one of the most commonly used—the Hausdorff distance. In short, we aim to compare two sets of vectors embedded in a Bregman geometry.

In Section 5, we recall the definition of the Hausdorff distance and some of its properties. We then introduce the extension to the Bregman setting. Here, we offer two separate variants: the Bregman–Hausdorff and Chernoff–Bregman–Hausdorff divergence. Finally, we offer an interpretation of both divergences based on the

K L

divergence, through the lens of information theory. In Section 6 and Section 7, we demonstrate how nearest neighbor algorithms can be used to compute the two divergences.

We hope that extending the basic concept to the Bregman setting may also open the door for further development. This would not be unprecedented: as mentioned above, the development of Bregman k-means enabled the development of efficient Bregman ball trees and Bregman vantage point trees.

5. Bregman–Hausdorff Divergence

Hausdorff distance is a very natural concept: introduced by Hausdorff [44] in 1914, it has since become the standard distance measure for comparing sets of points, used ubiquitously across multiple fields of mathematics.

Recently, Hausdorff distance has also been used in applications. Indeed, it is a natural choice whenever two shapes need to be compared. For example, in computer vision, Hausdorff distance has been implemented as a measurement for the largest segmentation error in image segmentation [45,46] and to compare 3D shapes, such as meshes [47,48].

We start this section by providing the definition of the Hausdorff distance in a metric space. We then extend this concept to the Bregman setting. The inherent asymmetry of Bregman divergences leads to a number of distinct definitions, which would all coincide in the metric setting. In devising the definitions, we are guided by the geometric and information–theoretical considerations. We elaborate on situations in which each of these new definitions finds a natural application.

Hausdorff distance in metric spaces. Given two sets P and Q in a metric space

(M, d)

, the one-sided Hausdorff distance from P to Q is defined as

\begin{matrix} d (P, Q) = sup_{p \in P} inf_{q \in Q} d (p, q) . \end{matrix}

Similarly to the Bregman divergence, this measurement is not symmetric:

d (P, Q) \neq d (Q, P)

. The Hausdorff distance between P and Q is the symmetrization of the two one-sided Hausdorff distances, and is given by the maximum,

\begin{matrix} H_{d} (P, Q) = max {d (P, Q), d (Q, P)} . \end{matrix}

Equivalently, the one-sided Hausdorff distance can be defined using a so-called thickening. A thickening—sometimes also called an offset—of the set Q of size r consists of all those points in the ambient space M, whose distance to Q is at most r. In other words, it is the union of all balls of radius r centered at a point in Q.

The one-sided Hausdorff distance from the set P to Q is the radius of the smallest thickening of Q that contains P:

\begin{matrix} d (P, Q) = inf \{r \geq 0 : P \subseteq ⋃_{q \in Q} B_{d} (q; r)\}, \end{matrix}

where

B_{d} (x; r)

is the ball of radius r (with respect to the metric d) centered at x. (We illustrate the two one-sided Hausdorff distances in Figure 8.) The Hausdorff distance between P and Q is—as before—the maximum of the two radii,

d (P, Q)

and

d (Q, P)

. It is a well-known fact that the Hausdorff distance defines a metric on the collection of closed, bounded, and nonempty subsets of the metric space

(M, d)

.

Bregman–Hausdorff divergence. To extend the notion of the one-sided Hausdorff distance to the Bregman setting, we use the geometric perspective of thickenings to help us select viable definitions.

Let F be a function of Legendre type, defined on the domain

Ω

, and let P and Q be two nonempty subsets of

Ω

. The primal (resp., dual) thickening of Q of size

r \geq 0

is the union of primal (resp., dual) balls of radius r, centered at the points in Q, with respect to the divergence

D_{F}

. We define the primal (resp., dual) (one-sided) Bregman–Hausdorff divergence from P to Q, with respect to the divergence

D_{F}

, as

\begin{matrix} H_{D_{F}} (P ∥ Q) = inf {r \geq 0 : P \subseteq ⋃_{q \in Q} B_{F} (q; r)}, \\ H_{D_{F}}^{'} (P ∥ Q) = inf {r \geq 0 : P \subseteq ⋃_{q \in Q} B_{F}^{'} (q; r)}, \end{matrix}

respectively. See Figure 9 for a visualization. We will refer to this new measurement as a divergence, rather than calling it a ‘one-sided distance’, to emphasize that it is generated by a Bregman divergence, and does not satisfy the triangle inequality.

Similarly to the one-sided Hausdorff distances, we have equivalent expressions for both the primal and the dual Bregman–Hausdorff divergence:

\begin{matrix} H_{D_{F}} (P ∥ Q) & = sup_{p \in P} inf_{q \in Q} D_{F} (p ∥ q), \\ H_{D_{F}}^{'} (P ∥ Q) & = sup_{p \in P} inf_{q \in Q} D_{F} (q ∥ p) . \end{matrix}

These expressions will be useful for computations. It is worth noting that the asymmetry of Bregman divergences allows for more definitions, which however, deviate from the natural geometric interpretation of the original Hausdorff distance.

Furthermore, it would be possible to define symmetrized (primal and dual) Bregman–Hausdorff divergence as the maximum of the two variants. However, we refrain from symmetrizing it this way, for the same reason Bregman divergences are typically not symmetrized. Namely, each of the above definitions has a natural interpretation and applications. However, we will introduce a third variant, which will be naturally symmetric.

For popular divergences with established names and abbreviations, such as the

K L

and

I S

divergences, we shorthand the Bregman–Hausdorff divergences for these divergences to

H_{K L}

for the

K L

–Hausdorff divergence, and

H_{I S}

for the

I S

–Hausdorff divergence.

The proposed Bregman–Hausdorff divergences can be used to compare the probabilistic predictions of a machine learning model with a reference model, as we showcase on the example of the

K L

divergence in the next paragraph.

Interpreting the Bregman–Hausdorff divergences with respect to the KL divergence. We can now extend the interpretation of the

K L

divergence presented in Section 2 to the Bregman–Hausdorff divergences based on this divergence. This new case involves not only pairs of probability vectors, but pairs of collections of such vectors.

Let P and Q be nonempty collections of probability vectors in

Δ^{d}

. If we form a primal

K L

ball

B_{K L} (q; r)

, with a fixed radius

r \geq 0

, around every point

q \in Q

, then the region covered by these balls will contain all probability vectors that can approximate a vector in Q with an expected loss of at most r bits. Now, if the set P is contained in this region, then r is an upper bound on how inefficient the approximation of probabilities in Q is for some vector

p \in P

. Thus, by taking the infimum over all radii such that P is contained within the primal

K L

balls around Q, we can compute how efficient the approximation can be. The infimum is precisely the primal Bregman–Hausdorff divergence from P to Q,

H_{K L} (P ∥ Q)

; it measures the maximum expected efficiency loss (in bits) if P is used to reasonably approximate Q. In other words, for any probability vector

p \in P

, there exists a vector

q \in Q

, which p approximates with an expected efficiency loss of at most

H_{K L} (P ∥ Q)

bits.

In contrast, the dual Bregman–Hausdorff divergence

H_{K L}^{'} (P ∥ Q)

measures the minimum radius for which the dual thickening of Q covers P. Each dual ball

B_{K L}^{'} (q; r)

contains all probability vectors that q approximates with an expected efficiency loss of at most r bits. Thus, when

r = H_{K L}^{'} (P ∥ Q)

, the union of the dual Bregman balls captures the maximum expected number of bits lost for any probability vector

p \in P

to be reasonably approximated by a probability vector in Q.

Hence,

H_{K L} (P ∥ Q)

and

H_{K L}^{'} (Q ∥ P)

measure how P and Q can approximate each other. Specifically,

H_{K L} (P ∥ Q)

is the maximum loss of expected bits if any vector

p \in P

is used to approximate some vector

q \in Q

. On the other hand, for

H_{K L}^{'} (Q ∥ P)

, every vector

q \in Q

is approximated by some probability vector in P, but not every q will be used as an approximator. For

H_{K L} (P ∥ Q) = r

, every point in P is contained in some

B_{K L} (q; r)

, so every p is approximating the center of a ball it is contained in.

We can use the Bregman–Hausdorff divergence in the assessment of performance of machine learning models. Indeed, let

M_{1}

and

M_{2}

be two different classification models trained using the

K L

divergence loss (or equivalently the cross-entropy loss). We also let

X, Z

be two data sets, and denote the probabilistic predictions made by the models as

P = {M_{1} (x; θ_{1})}_{x \in X}

and

Q = {M_{2} (z; θ_{2})}_{z \in Z}

. Then,

H_{K L} (P ∥ Q)

quantifies a divergence from the set of predictions made by Q towards the set of predictions made by P.

We stress that this measurement does not rely on any explicit pairing between the outputs of two models (i.e., there is no obvious bijection between X and Z). Although the data sets X and Z are not explicitly paired, we can still make a reasonable numerical measurement between the two. This is the case when, for example,

M_{1} = M_{2}

, and X and Z are training and test data, respectively. In this case, the values

H_{K L} (P ∥ Q)

and

H_{K L}^{'} (Q ∥ P)

can be used to gauge the generalization power of a model. Importantly, this measurement is consistent with the loss function used to train the model.

Chernoff–Bregman–Hausdorff distance. We propose one more natural distance measurement. As its name suggests, the Chernoff–Bregman–Hausdorff distance is based on the notion of the Chernoff point. As before, we let F be a function of Legendre type, defined on the domain

Ω

, and let P and Q be two nonempty subsets of

Ω

.

For each pair of points

(p, q) \in P \times Q

, write

c_{p, q}

for the Chernoff point of the set

{p, q}

, and write

C = \{c_{p, q} : (p, q) \in P \times Q\}

for the collection of all the Chernoff points. Then, the primal (resp., dual) Chernoff–Bregman–Hausdorff distance is the smallest size of the primal (resp., dual) thickening of C that contains the union

P \cup Q

. To be more concrete, we define the primal Chernoff–Bregman–Hausdorff distance between P and Q as

\begin{matrix} C H_{D_{F}} (P, Q) = inf {r \geq 0 : P \cup Q \subseteq ⋃_{c \in C} B_{F}^{'} (c; r)}, \end{matrix}

and the dual Chernoff–Bregman–Hausdorff distance between P and Q as

\begin{matrix} C H_{D_{F}}^{'} (P, Q) = inf {r \geq 0 : P \cup Q \subseteq ⋃_{c \in C} B_{F} (c; r)} . \end{matrix}

We emphasize that each divergence is named after the type of the Bregman ball that grows about the pair

(p, q) \in P \times Q

, and not the balls growing about the Chernoff points. We visualize the primal and dual Chernoff–Bregman–Hausdorff distances in Figure 10. Unlike the Bregman–Hausdorff divergences, the Chernoff–Bregman–Hausdorff distance is symmetric.

If fact, the set C can be viewed as the ‘average’ of P and Q with respect to the chosen divergence at the level of sets. In contrast to symmetrizing the Bregman–Hausdorff divergence by taking the average

(H_{D_{F}} (P ∥ Q) + H_{D_{F}} (Q ∥ P)) / 2

, the Chernoff–Bregman–Hausdorff distance avoids mixing directions of divergence computations. In particular, in the context of the

K L

divergence, the corresponding Chernoff–Bregman–Hausdorff distance inherits the information–theoretical interpretation, as we will see shortly. In the case of the squared Euclidean distance, C contains the usual arithmetic average

\frac{p + q}{2}

for each pair

(p, q) \in P \times Q

.

Again, for the

K L

and

I S

divergences, we shorten the notation to

C H_{K L}

and

C H_{I S}

, respectively.

Interpreting the Chernoff–Bregman–Hausdorff distance for the $K L$ divergence. As the primal Chernoff–Bregman–Hausdorff distance is defined by taking the infimum of the radius of a dual Bregman ball about the Chernoff points,

C H_{K L}

gives the least number of expected bits lost using C to approximate both P and Q. Similarly, the dual Chernoff–Bregman–Hausdorff distance, where we center primal balls about each Chernoff point, gives the least number of expected bits lost to approximate C by either P or Q.

Returning to machine learning, for models

P = {M_{1} (x; θ)}_{x \in X}

and

Q = {M_{2} (z; θ^{'})}_{z \in Z}

, C can be viewed as the collection of ‘average’ probability distributions [20] for every

(p, q) \in P \times Q

. Therefore,

C H_{K L} (P, Q)

measures the maximum expected loss of coding efficiency (in bits) when attempting to reasonably approximate both P and Q using C. In this sense, C contains ‘joint approximators’ for pairs from

P \times Q

. Similarly,

C H_{K L}^{'} (P, Q)

measures the maximum expected loss when using P and Q to approximate the set C. While the Bregman–Hausdorff divergence is applicable if either P or Q is a reference set of vectors, the Chernoff–Bregman–Hausdorff distance is a natural choice when P and Q play the same role.

6. Algorithms for Bregman–Hausdorff Divergences

In this section, we present first algorithms for the Bregman–Hausdorff divergences. The algorithms rely on the data structure for nearest neighbor search, which we outline next. For the remainder of this section, we let F be a decomposible function of Legendre type, defined on the domain

Ω

, and let P and Q be two finite, nonempty subsets of

Ω

.

Bregman Kd-trees. We use the Bregman Kd-tree structure and search algorithm mentioned in Section 4. In [34], Pham and Wagner experimentally show that Kd-trees for Bregman divergences are efficient in a range of practical situations. In particular, they perform better than Cayton’s Bregman ball trees, and alleviate issues with certain other implementations.

The aforementioned Bregman Kd-tree implementation works for decomposable Bregman divergences [49,50], including the squared Euclidean distance, and

K L

and

I S

divergences. One additional benefit of this algorithm is its ability to compute the nearest neighbor in either direction.

Computing the Bregman–Hausdorff divergences. We first provide an algorithm for computing the Bregman–Hausdorff divergences between P and Q.

To compute the Bregman–Hausdorff divergence

H_{F} (P ∥ Q)

from P to Q, we can use a version of the Kd-tree data structure for decomposable Bregman divergences [34]. We first construct a Kd-tree to represent P. Then, for each

q \in Q

, we search for

ρ^{*} = arg {min}_{p \in P} D_{F} (p ∥ q)

, maintaining value of the largest divergence,

D_{F} (q ∥ ρ^{*})

. We denote this maintained value as

h a u s

in the three algorithms presented.

In Algorithm 1, we can replace line 2 and line 4 with any nearest neighbor structure and search. As the Bregman Kd-tree can be queried for divergences computed in both directions, this algorithm also works for computing the dual Bregman–Hausdorff divergence. For a proximity search algorithm with time complexity

O (C (n, d))

, this algorithm runs in

O (m \cdot C (n, d))

time.

Algorithm 1 Primal Bregman–Hausdorff divergence algorithm (basic version)

Require:: Point clouds P and Q of size n, m; decomposable Bregman divergence $D_{F}$ .
Ensure:: Bregman–Hausdorff divergence $H_{D_{F}} (P ∥ Q)$
1:: $h a u s \leftarrow 0$
2:: $K d T r e e \leftarrow$ build Kd-tree $(P)$
3:: for $q \in Q$ do
4:: $n n = K d T r e e . q u e r y (q, D_{F})$ ▹ Find $arg {min}_{p \in P} D_{F} (p ∥ q)$
5:: $n n_d i v = D_{F} (n n, q)$
6:: $h a u s = max (n n_d i v, h a u s)$
7:: return $h a u s$

To approximate the Bregman–Hausdorff divergence, one can use a

(1 + ϵ)

-nearest neighbor algorithm.

We can accelerate Algorithm 1 by adding an early query termination during the Kd-tree search. The the best of our knowledge, this technique was introduced to approximate the one-sided Hausdorff distance between 3D meshes [48]. We adjust the Kd tree query in line 4 as follows. During each query, candidates for

arg {min}_{p \in P} D_{F} (p ∥ q_{i})

are found. For any candidate

ρ \in P

, if

D_{F} (q_{i} ∥ ρ) \leq h a u s

, then we terminate the query. We denote this adjusted search as the

s h e l l_q u e r y

method on line 4 in Algorithm 2. The

s h e l l

method returns the nearest neighbor if the search completes and

N u l l

otherwise. An illustration of the above considerations can be seen in Figure 11.

In the worst case scenario, the shell variant will have the same complexity as Algorithm 1, but we will see in Section 7 that reducing the number of complete searches provides significant speed-ups, even in high dimensions.

Algorithm 2 Primal Bregman–Hausdorff divergence shell algorithm (improved)

Require:: Point clouds P and Q of size n, m, respectively; decomposable Bregman divergence $D_{F}$ .
Ensure:: Bregman–Hausdorff divergence $H_{D_{F}} (P ∥ Q)$
1:: $h a u s \leftarrow 0$
2:: $K d T r e e \leftarrow$ build Kd-tree $(P)$
3:: for $q \in Q$ do
4:: $n n = K d T r e e . s h e l l_q u e r y (q, D_{F}, h a u s)$
5:: if $n n \neq N u l l$ then
6:: $h a u s \leftarrow D_{F} (n n, q)$
7:: return $h a u s$

We will experimentally compare these implementations with a naive algorithm, where the Kd-tree search in line 4 of Algorithm 1 is replaced by a linear search.

Computing the Chernoff–Bregman–Hausdorff distance. We also provide an algorithm for the Chernoff–Bregman–Hausdorff distance. To compute the distance, we first determine the set of Chernoff points:

C = {c_{p, q} : (p, q) \in P \times Q}

.

To approximate the Chernoff point

c_{p, q}

for a pair

(p, q) \in P \times Q

, we perform a bisection search along the line segment connecting p and q:

α p + (1 - α) q

with parameter

α \in [0, 1]

. This search was proposed by Nielsen [20]. We assume it runs in time

O (β (ϵ) d)

to be

ϵ

close to the true Chernoff point, with d being the dimension of the ambient space in which the domain

Ω

lies. Thus this component of the algorithm runs in

O (β (ϵ) d m n)

time. Letting

O (C (n + m, d))

be the time complexity of the chosen proximity search algorithm, the search runs in

O (n m \cdot C (n + m, d))

time. This gives us a total running time of

O (m n (β (ϵ) d + C (n + m, d)))

.

As with the Bregman–Hausdorff divergence, Kd-trees may be replaced by other exact Bregman nearest neighbor structure and algorithms; an

(1 + ϵ)

approximation may be used as well.

Computing the dual Chernoff–Bregman–Hausdorff distance requires mapping the input to the Legendre conjugate space. This adds a preprocessing step, but does not affect the complexity of the algorithm.

7. Experiments

In this section, we compute Bregman–Hausdorff divergences between various data sets. We work both with practical and synthetic data sets, and provide computation times. Generally, we use Algorithms 1 and 2 with domain

Ω = Δ^{d - 1}

, where d depends on the chosen data set. We will run experiments using the

K L

and

I S

divergences, as well as the squared Euclidean distance. We will focus on Algorithms 1 and 2 because, unlike Algorithm 3, they promise to be efficient in practice. The computed values are exact up to machine precision, since the Bregman Kd-trees we use provide exact nearest neighbors.

Algorithm 3 Primal Chernoff–Bregman–Hausdorff divergence algorithm

Require:: Point clouds P and Q of size n, m; decomposable Bregman divergence $D_{F}$ .
Ensure:: Chernoff–Bregman–Hausdorff divergence $C H_{D_{F}} (P ∥ Q)$
1:: $h a u s \leftarrow 0$
2:: C = empty array of size $n m \times \dim (Ω)$
3:: $i = 0$
4:: for $q \in Q$ do
5:: for $p \in P$ do
6:: C[i] = Chernoff $(p, q)$ ▹ Compute the Chernoff points.
7:: $i = i + 1$
8:: $K d T r e e \leftarrow$ build Kd-tree $(C)$
9:: for $a \in Q \cup P$ do
10:: $n n = K d T r e e . q u e r y (a, D_{F})$ ▹ Find $arg {min}_{c \in C} D_{F} (a ∥ c)$
11:: $n n_d i v = D_{F} (n n, q)$
12:: $h a u s = max (n n_d i v, h a u s)$
13:: return $h a u s$

Compiler and hardware. Software was compiled with Clang 14.0.3. The experiments were performed on a single core of a 3.5 GHz ARM64-based CPU with 4 MB L2 cache using 32 GB RAM.

Data sets. We use predictions from machine learning models and synthetic data sets. We train two neural networks,

M_{1}

and

M_{2}

, on a classification task using CIFAR100, which has 50,000 training images and 10,000 test images. Specifically, we perform transfer learning using EfficientNetB0 [51] pretrained on ImageNet as a backbone. The first model,

M_{1}

, is trained with fine-tuning and the second model,

M_{2}

, is trained without. Both models are trained using the

K L

divergence as the loss function. The models,

M_{1}

and

M_{2}

, achieve 80.22%, and 71.74% test accuracy, respectively. From each model, we produce two sets of predictions:

({trn}_{i}, {tst}_{i})

, for

i \in {1, 2}

. The synthetic data sets are drawn uniformly from the open simplex in dimension 50, 100, and 250. The target data set, P, has 100,000 sample points; the query data set, Q, has 20,000.

Bregman–Hausdorff computations. One concrete motivation for this measurement was the need to quantify how well one set of predictions approximates another. We set this up as a computation of the Bregman–Hausdorff divergence. We are especially interested in comparing the probabilistic predictions arising from two models of different quality, as well as the training and test data.

Additionally, we aim to investigate to what extent the choice of the underlying Bregman divergence affects the resulting Bregman–Hausdorff divergence. Of course, in practice, one would use an appropriate divergence. For example, if the data represent probability vectors, the KL divergence, and by extension the KL–Hausdorff divergence is a natural choice. However, in practice the (squared) Euclidean distance and the related Hausdorff distance is often used, mostly due to limitations of computational tools. The following experiments confirm that the choice of divergence has a significant effect on the outputs of the corresponding Bregman–Hausdorff divergence.

Using the relative entropy loss. In Table 2, we compare the Bregman–Hausdorff divergence between the four sets of predictions. Divergences change by row, data sets by column, and the units for the

K L

divergence are in bits. We choose a subset of the possible pairings to highlight the asymmetry and some key interpretations, and also for brevity.

From the row containing the values of

H_{K L}

in Table 2, the lowest value is

H_{K L}

(

{tst}_{1} ∥

{trn}_{1}

). This value shows that

{tst}_{1}

predicts

{trn}_{1}

with a maximum expected loss of 1.764 bits. In particular,

H_{K L} ({tst}_{1} ∥ {trn}_{1}) < H_{K L} ({trn}_{1} ∥ {tst}_{1})

—the relation we would expect from

M_{1}

’s predictions on a train and test data sets. In contrast,

H_{S E}

(

{tst}_{1} ∥

{trn}_{1}

) = 0.371 is the largest value in its row. In particular,

H_{S E} ({tst}_{1} ∥ {trn}_{1}) > H_{S E} ({trn}_{1} ∥ {tst}_{1})

, which is the reverse of how we expect the soft predictions from the training data and test data to behave. On the surface, this would indicate that

{trn}_{1}

is a poor predictor of

{tst}_{1}

. However, as the models were trained to minimize the

K L

divergence, and because of the difference between the Euclidean geometry and the Bregman geometry induced by the

K L

divergence, values from

H_{S E}

do not carry the information–theoretical interpretation. This shows the importance of analyzing these models using the proposed

K L

–Hausdorff divergence rather than the standard one-sided Hausdorff distance.

We see a similar effect for the computation of

H_{I S}

and

H_{I S}^{'}

; the results suggest that the geometry induced by the

I S

divergence is drastically different from the geometry induced by the

K L

divergence, and thus the computations lack an obvious interpretation in this context. However, for comparing speech and sound data (as opposed to probabilistic predictions) the

I S

–Hausdorff divergence would be the preferred choice. In this case, the measurement would inherit an interpretation from the IS divergence [11]. So, while in practice it is clear that the

K L

-Hausdorff divergence is the only reasonable choice, we provide provide the table to demonstrate that our algorithms can handle various Bregman divergences and that the outputs are highly sensitive to the chosen divergence.

Using the mean squared error loss. In this case, we train

M_{1}

and

M_{2}

by minimizing the mean squared error (MSE) loss, which corresponds to the squared Euclidean distance. In this case, the training accuracy of

M_{1}

is 73.97% while the training accuracy of

M_{2}

is 69.18%. For numerical stability, we decrease the learning rate for

M_{2}

for this loss function. We stress that, like before, these models output classification predictions, interpreted as vectors in

R^{100}

.

Similarly, we produce two sets of data per model, labeled as (

{tst}_{i}^{S E}

,

{trn}_{i}^{S E}

) for

i \in {1, 2}

. We can compute the Bregman–Hausdorff divergences for the same pairs of sets as in Table 2 and record them in Table 3.

In contrast to minimizing with respect to the

K L

divergence, all

K L

–Hausdorff divergence values from SE distance minimized data sets are smaller. This tells us that

{tst}_{i}^{S E}

and

{trn}_{i}^{S E}

are closely clustered together using the

K L

divergence geometry more so than

{tst}_{i}

and

{trn}_{i}

from above. In contrast,

H_{S E}

is larger when comparing the training data predictions,

{trn}_{1}^{S E}

and

{trn}_{2}^{S E}

, between the two models, showing that the two resulting prediction sets are less clustered when changing the minimization function.

Timings. Finally, we compare the computation speeds of the algorithms using the Bregman Kd-tree (Algorithms 1 and 2) and a version that uses a linear search. The results can be seen in Table 4. The left columns are computation times for two sets of predictions. For the Kd-tree algorithm without the shell acceleration, we see significant speed-ups in the left two columns, while the right three columns have a similar time between the two searches. As predictions from classification models tend to cluster near the vertices of the simplex and randomly generated points have more spread across the simplex, we see that the distribution of points heavily influences the speed of computations. Similarly, the run time increases as the dimension increases. This is expected, as the performance of Kd-trees is known to degrade in higher dimensions [52,53].

Perhaps surprisingly, when we apply early termination via the shell method, we see that we maintain considerable speed-ups even in high dimensions. While the worst case complexity of the shell method is still equivalent to the Kd-tree search, we see up to 1000× speed-up even in dimension 250.

These speed-ups show that the Bregman–Hausdorff divergence is an efficient tool for the comparison of the outputs of machine learning models—as well as in other situations that require the comparison between two sets of vectors.

Alternative implementations. There are alternative approaches to Bregman nearest neighbor search, but they come with certain limitations. To implement computations for Bregman–Hausdorff divergences using Cayton’s Bregman ball trees [31], significant changes to the source code would be required. However, as shown in [34], Bregman Kd-trees are a preferred choice as they are more efficient in practice. Implementations of ball trees and VP trees (due to Nielsen, Piro, and Barlaud) are not usable on modern systems due to severe compilation issues. Our comparison is therefore limited to our Kd-tree algorithms and the linear search.

8. Conclusions

Modern machine learning models commonly rely on optimizing the Kullback–Leibler divergence. As a result, the resulting vectors, such as the probabilistic predictions of a classifier, would naturally be measured using this divergence. However, many standard tools that could be used to analyze such data are limited to the Euclidean distance.

One geometric tool that found applications in various fields in the Euclidean context is the Hausdorff distance. It serves as a distance between two collections of points—without requiring any pairing or alignment between the inputs. While potentially useful also in machine learning, the standard Hausdorff distance is limited to the metric setting—as opposed to the Kullback–Leibler divergence and other Bregman divergences.

This paper is an attempt to bridge the gap between the necessity of using non-metric distances—and the familiarity with such concepts and the paucity of computational tools supporting them. Specifically, we outlined the field of Bregman geometry, including descriptions of existing tools that can be used in the above context. Importantly, we extended a popular geometric tool to this new setup: We defined several variants of Bregman–Hausdorff divergences, which allow one to compare two collections of vectors according to a chosen Bregman divergence. We highlighted situations in which each of these variants can be meaningfully used. In particular, when the underlying divergence is the Kullback–Leibler divergence, we explained that its Hausdorff counterpart inherits a clear information–theoretical interpretation.

To make these theoretical considerations practical, we also designed novel algorithms for computing the newly defined Bregman–Hausdorff divergences. To ensure efficiency, we applied the Bregman version of the Kd-tree data structure for exact nearest neighbor search, which we recently developed. We exploited the special structure of the Bregman–Hausdorff computations to better leverage the Kd-tree data structure. We implemented these three algorithms.

We benchmarked our implementations in scenarios arising from a machine learning setup. Our experiments show that our most optimized algorithm performs well in this scenario. Specifically, it achieves several orders of magnitude speed-up compared to more basic algorithms. This is surprising, given that the Kd-trees the algorithm uses are known to perform poorly in such high-dimensional scenarios. In fact, our straightforward application of Kd-trees scales significantly worse with dimension. Understanding this speed-up theoretically is an interesting future direction.

Overall, we hope that the Bregman–Hausdorff divergence will find applications. In this paper, we focused on the theoretical setup, interpretations of the new measurement, as well as efficiency aspects. More broadly, we hope that geometric tools supporting Bregman divergences—and not only the Euclidean distance or other metrics—will eventually become more popular in machine learning.

Author Contributions

Conceptualization, T.P., H.D.P.K. and H.W.; methodology, T.P., H.D.P.K. and H.W.; software, T.P.; validation, T.P., H.D.P.K. and H.W.; formal analysis, T.P., H.D.P.K. and H.W.; investigation, T.P., H.D.P.K. and H.W.; writing—original draft preparation, T.P. and H.W.; writing—review and editing, T.P., H.D.P.K. and H.W.; visualization, T.P.; supervision, H.W.; funding acquisition, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the 2022 Google Research Scholar program awarded for project “Algorithms for Topological Analysis of Neural Networks”. Hana Dal Poz Kouřimská was supported by the DFG Project No. 524578210.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gray, R. Entropy and Information Theory; Springer: New York, NY, USA, 2013. [Google Scholar]
Cover, T.; Thomas, J. Elements of Information Theory; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2022, arXiv:1312.6114. [Google Scholar]
Pathria, R.K.; Beale, P.D. Statistical Mechanics, 3rd ed.; Academic Press: Boston, MA, USA, 2011. [Google Scholar]
Amari, S.I. Information Geometry and Its Applications; Springer: Toyko, Japan, 2016. [Google Scholar] [CrossRef]
Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
Bauschke, H.H.; Borwein, J.M. Legendre functions and the method of random Bregman projections. J. Convex Anal. 1997, 4, 27–67. [Google Scholar]
Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar] [CrossRef]
Do, M.N.; Vetterli, M. Wavelet-based texture retrieval using generalized Gaussian density and Kullback-Leibler distance. IEEE Trans. Image Process. 2002, 11, 146–158. [Google Scholar] [CrossRef] [PubMed]
Itakura, F. Analysis synthesis telephony based on the maximum likelihood method. In Proceedings of the 6th International Congress on Acoustics, Tokyo, Japan, 21–28 August 1968. [Google Scholar]
Mahalanobis, P.C. On the generalised distance in statistics. Proc. Natl. Inst. Sci. India 1936, 2, 49–55. [Google Scholar]
Sapatinas, T. Discriminant Analysis and Statistical Pattern Recognition. J. R. Stat. Soc. Ser. Stat. Soc. 2005, 168, 635–636. [Google Scholar] [CrossRef]
Law, M.T.; Yu, Y.; Cord, M.; Xing, E.P. Closed-Form Training of Mahalanobis Distance for Supervised Clustering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3909–3917. [Google Scholar] [CrossRef]
Li, L.; Sun, C.; Lin, L.; Li, J.; Jiang, S. A dual-layer supervised Mahalanobis kernel for the classification of hyperspectral images. Neurocomputing 2016, 214, 430–444. [Google Scholar] [CrossRef]
Nock, R.; Nielsen, F. Fitting the Smallest Enclosing Bregman Ball. In Machine Learning: ECML 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 649–656. [Google Scholar] [CrossRef]
Boissonnat, J.D.; Nielsen, F.; Nock, R. Bregman Voronoi diagrams. Discret. Comput. Geom. 2010, 44, 281–307. [Google Scholar] [CrossRef]
Nielsen, F. Chernoff information of exponential families. arXiv 2011, arXiv:1102.2684. [Google Scholar]
Edelsbrunner, H.; Virk, Ž.; Wagner, H. Smallest Enclosing Spheres and Chernoff Points in Bregman Geometry. In Proceedings of the 34th International Symposium on Computational Geometry, Budapest, Hungary, 11–14 June 2018. [Google Scholar]
Nielsen, F. An Information-Geometric Characterization of Chernoff Information. IEEE Signal Process. Lett. 2013, 20, 269–272. [Google Scholar] [CrossRef]
Nielsen, F. Revisiting Chernoff Information with Likelihood Ratio Exponential Families. Entropy 2022, 24, 1400. [Google Scholar] [CrossRef]
Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Dirichlet, G.L. Über die Reduction der positiven quadratischen Formen mit drei unbestimmten ganzen Zahlen. J. Pure Appl. Math. Crelles J. 1850, 1850, 209–227. [Google Scholar] [CrossRef]
Voronoi, G. Nouvelles applications des paramètres continus à la théorie des formes quadratiques. Premier mémoire. Sur quelques propriétés des formes quadratiques positives parfaites. J. Pure Appl. Math. Crelles J. 1908, 1908, 97–102. [Google Scholar] [CrossRef]
Voronoi, G. Nouvelles applications des paramètres continus à la théorie des formes quadratiques. Deuxième mémoire. Recherches sur les parallélloèdres primitifs. J. Pure Appl. Math. Crelles J. 1908, 1908, 198–287. [Google Scholar] [CrossRef]
Chazelle, B. An optimal convex hull algorithm in any fixed dimension. Discrete Comput. Geom. 1993, 10, 377–409. [Google Scholar] [CrossRef]
Omohundro, S.M. Five Balltree Construction Algorithms. 1989. Available online: https://steveomohundro.com/wp-content/uploads/2009/03/omohundro89_five_balltree_construction_algorithms.pdf (accessed on 12 May 2025).
Yianilos, P.N. Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. In Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). Society for Industrial and Applied Mathematics, Austin, TX, USA, 25–27 January 1993; pp. 311–321. [Google Scholar]
Bentley, J.L. Multidimensional binary search trees used for associative searching. Commun. ACM 1975, 18, 509–517. [Google Scholar] [CrossRef]
Cayton, L. Fast Nearest Neighbor Retrieval for Bregman Divergences. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, 5–9 July 2008; pp. 112–119. [Google Scholar] [CrossRef]
Nielsen, F.; Piro, P.; Barlaud, M. Tailored Bregman ball trees for effective nearest neighbors. In Proceedings of the 25th European Workshop on Computational Geometry (EuroCG), Brussels, Belgium, 16–18 March 2009; pp. 29–32. [Google Scholar]
Nielsen, F.; Piro, P.; Barlaud, M. Bregman Vantage Point Trees for Efficient Nearest Neighbor Queries. In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo, (ICME), New York, NY, USA, 28 June–3 July 2009; pp. 878–881. [Google Scholar] [CrossRef]
Pham, T.; Wagner, H. Fast Kd-trees for the Kullback–Leibler Divergence and other Decomposable Bregman Divergences. In Proceedings of the 19th Algorithms and Data Structures Symposium (WADS 2025), Toronto, ON, Canada, 11–15 August 2025. [Google Scholar]
Zhang, Z.; Ooi, B.C.; Parthasarathy, S.; Tung, A.K.H. Similarity Search on Bregman Divergence: Towards Non-Metric Indexing. Proc. VLDB Endow. 2009, 2, 13–24. [Google Scholar] [CrossRef]
Song, Y.; Gu, Y.; Zhang, R.; Yu, G. BrePartition: Optimized High-Dimensional kNN Search with Bregman Distances. arXiv 2020, arXiv:2006.00227. [Google Scholar] [CrossRef]
Boytsov, L.; Naidan, B. Engineering Efficient and Effective Non-metric Space Library. In Proceedings of the Similarity Search and Applications—6th International Conference, SISAP 2013, A Coruña, Spain, 2–4 October 2013; Brisaboa, N.R., Pedreira, O., Zezula, P., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2013; Volume 8199, pp. 280–293. [Google Scholar] [CrossRef]
Malkov, Y.A.; Yashunin, D.A. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 824–836. [Google Scholar] [CrossRef] [PubMed]
Edelsbrunner, H.; Letscher, D.; Zomorodian, A. Topological persistence and simplification. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, CA, USA, 12–14 November 2000; IEEE: Piscataway, NJ, USA, 2000; pp. 454–463. [Google Scholar]
Carlsson, G. Topology and data. Bull. Amer. Math. Soc. 2009, 46, 255–308. [Google Scholar] [CrossRef]
Edelsbrunner, H.; Wagner, H. Topological Data Analysis with Bregman Divergences. In Proceedings of the 33th International Symposium on Computational Geometry (SoCG), Brisbane, Australia, 4–7 July 2017; pp. 67–86. [Google Scholar] [CrossRef]
Edelsbrunner, H.; Harer, J. Computational Topology: An Introduction; American Mathematical Society: Providence, RI, USA, 2010. [Google Scholar]
Edelsbrunner, H.; Ölsböck, K.; Wagner, H. Understanding Higher-Order Interactions in Information Space. Entropy 2024, 26, 637. [Google Scholar] [CrossRef]
Hausdorff, F. Grundzüge der Mengenlehre; Göschens Lehrbücherei/Gruppe I: Reine und Angewandte Mathematik Series; Von Veit: Berlin, Germany, 1914. [Google Scholar]
Huttenlocher, D.P.; Klanderman, G.A.; Rucklidge, W. Comparing Images Using the Hausdorff Distance. IEEE Trans. Pattern Anal. Mach. Intell. 1993, 15, 850–863. [Google Scholar] [CrossRef]
Karimi, D.; Salcudean, S.E. Reducing the Hausdorff Distance in Medical Image Segmentation with Convolutional Neural Networks. arXiv 2019, arXiv:1904.10030. [Google Scholar] [CrossRef]
Cignoni, P.; Rocchini, C.; Scopigno, R. Metro: Measuring Error on Simplified Surfaces. Comput. Graph. Forum 1998, 17, 167–174. [Google Scholar] [CrossRef]
Guthe, M.; Borodin, P.; Klein, R. Fast and Accurate Hausdorff Distance Calculation between Meshes. J. WSCG 2005, 13, 41–48. [Google Scholar]
Zhang, J. Divergence Function, Duality, and Convex Analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef]
Nielsen, F.; Nock, R. The Dual Voronoi Diagrams with Respect to Representational Bregman Divergences. In Proceedings of the 2009 Sixth International Symposium on Voronoi Diagrams, Copenhagen, Denmark, 23–26 June 2009; pp. 71–78. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; PMLR: Breckenridge, CO, USA, 2019; pp. 6105–6114. [Google Scholar]
Marimont, R.B.; Shapiro, M.B. Nearest Neighbour Searches and the Curse of Dimensionality. IMA J. Appl. Math. 1979, 24, 59–70. [Google Scholar] [CrossRef]
Chávez, E.; Navarro, G.; Baeza-Yates, R.; Marroquín, J.L. Searching in metric spaces. ACM Comput. Surv. 2001, 33, 273–321. [Google Scholar] [CrossRef]

Figure 1. A visualization of the relative entropy.

Figure 2. Visualization of a Bregman divergence formula for a one-dimensional domain.

Figure 3. Left: concentric primal Itakura–Saito balls. Right: concentric primal generalized Kullback–Leibler balls.

Figure 4. Geometric interpretation of primal (left) and dual (right) Bregman balls in dimension one.

Figure 5. In blue (light): Primal

K L

balls with centers x and y intersect at the Chernoff point c. In magenta (dark): A dual

K L

ball of radius

D_{K L} (x ∥ c)

is drawn about c.

Figure 5. In blue (light): Primal

K L

balls with centers x and y intersect at the Chernoff point c. In magenta (dark): A dual

K L

ball of radius

D_{K L} (x ∥ c)

is drawn about c.

Figure 6. Clustering assignments for k-means with four centroids on

Δ^{2}

. Left is with squared Euclidean distance, right is with

K L

divergence. Centroids are denoted by ×’s.

Figure 6. Clustering assignments for k-means with four centroids on

Δ^{2}

. Left is with squared Euclidean distance, right is with

K L

divergence. Centroids are denoted by ×’s.

Figure 7. Voronoi diagrams on

Δ^{2}

fixed sites S, each marked with x. The left is computed with the Euclidean distance. The middle and right images compute nearest site with respect to the

K L

divergence, the middle to each site and the right from each site.

Figure 7. Voronoi diagrams on

Δ^{2}

fixed sites S, each marked with x. The left is computed with the Euclidean distance. The middle and right images compute nearest site with respect to the

K L

divergence, the middle to each site and the right from each site.

Figure 8. One-sided Hausdorff distances with respect to the Euclidean metric (

d_{E u c}

) between the sets P (black points) and Q (black squares). The left image highlights

d_{E u c} (P, Q)

and the right image highlights

d_{E u c} (Q, P)

. Clearly,

d_{E u c} (P, Q) > d_{E u c} (Q, P)

.

Figure 8. One-sided Hausdorff distances with respect to the Euclidean metric (

d_{E u c}

) between the sets P (black points) and Q (black squares). The left image highlights

d_{E u c} (P, Q)

and the right image highlights

d_{E u c} (Q, P)

. Clearly,

d_{E u c} (P, Q) > d_{E u c} (Q, P)

.

Figure 9. The left portion of the image visualizes the primal Bregman–Hausdorff divergence from P to Q, and the right visualizes the Bregman–Hausdorff divergence from Q to P. The set P consists of black points, the set Q consists of black squares. Shaded regions are the primal thickening of Q (on the left) and Q (on the right). One sees that the primal thickening of Q has to be of a large radius in order to contain the far left point of P. In contrast, the primal thickening of P can contain Q at a smaller radius. Thus, the difference between

H_{K L} (P ∥ Q)

and

H_{K L} (Q ∥ P)

can be used to see that Q clusters about P, but not vice versa.

Figure 9. The left portion of the image visualizes the primal Bregman–Hausdorff divergence from P to Q, and the right visualizes the Bregman–Hausdorff divergence from Q to P. The set P consists of black points, the set Q consists of black squares. Shaded regions are the primal thickening of Q (on the left) and Q (on the right). One sees that the primal thickening of Q has to be of a large radius in order to contain the far left point of P. In contrast, the primal thickening of P can contain Q at a smaller radius. Thus, the difference between

H_{K L} (P ∥ Q)

and

H_{K L} (Q ∥ P)

can be used to see that Q clusters about P, but not vice versa.

Figure 10. On the left, we visualize the primal Chernoff–Bregman–Hausdorff distance between P and Q. On the right, we visualize the dual Chernoff–Bregman–Hausdorff distance between P and Q. The set P consists of black points, the set Q of black squares. Chernoff points are marked as ×. The left is a dual thickening of the Chernoff points while the right is a primal thickening.

Figure 11. A shrinking shell around a query,

q_{i}

, whose inner radius is defined by

h a u s

and outer radius is defined by

d (ρ_{1} ∥ q_{i})

. The search will terminate at

ρ_{2}

, since

D_{F} (ρ_{2} ∥ q) < h a u s

, instead of returning

ρ_{3}

as the true nearest neighbor.

Figure 11. A shrinking shell around a query,

q_{i}

, whose inner radius is defined by

h a u s

and outer radius is defined by

d (ρ_{1} ∥ q_{i})

. The search will terminate at

ρ_{2}

, since

D_{F} (ρ_{2} ∥ q) < h a u s

, instead of returning

ρ_{3}

as the true nearest neighbor.

Table 1. Example codes for a discrete probability distribution on four events.

	Probability	Code₁	Code₂	Code₃
$E_{1}$	$1 / 2$	00	0	0
$E_{2}$	$1 / 4$	10	1	10
$E_{3}$	$1 / 8$	01	01	110
$E_{4}$	$1 / 8$	11	11	111

Table 2. Bregman–Hausdorff divergences between outputs of two classification models trained using the relative entropy loss (which corresponds to the

K L

divergence). The values for

H_{K L}

and

H_{K L}^{'}

are measured in bits.

Table 2. Bregman–Hausdorff divergences between outputs of two classification models trained using the relative entropy loss (which corresponds to the

K L

divergence). The values for

H_{K L}

and

H_{K L}^{'}

are measured in bits.

	( ${tst}_{1} ∥$ ${trn}_{1}$ )	( ${trn}_{1} ∥$ ${tst}_{1}$ )	( ${trn}_{1} ∥$ ${tst}_{2}$ )	( ${trn}_{1} ∥$ ${trn}_{2}$ )	( ${trn}_{2} ∥$ ${trn}_{1}$ )
$H_{K L}$	1.765b	2.215b	2.044b	2.236b	2.237b
$H_{K L}^{'}$	3.797b	4.541b	4.509b	4.343b	4.033b
$H_{I S}$	32,496.887	9,822,345.381	1,739,646,377.745	14,801,113.426	584,772.398
$H_{I S}^{'}$	3147.685	2987.831	1998.378	2360.0485	1309.6230
$H_{S E}$	0.371	0.296	0.243	0.234	0.271

Table 3. Bregman–Hausdorff divergences when

M_{1}

and

M_{2}

are trained to minimize the mean squared error loss (corresponding to the squared Euclidean distance).

Table 3. Bregman–Hausdorff divergences when

M_{1}

and

M_{2}

are trained to minimize the mean squared error loss (corresponding to the squared Euclidean distance).

	( ${tst}_{1}^{SE} ∥$ ${trn}_{1}^{SE}$ )	( ${trn}_{1}^{SE} ∥$ ${tst}_{1}^{SE}$ )	( ${trn}_{1}^{SE} ∥$ ${tst}_{2}^{SE}$ )	( ${trn}_{1}^{SE} ∥$ ${trn}_{2}^{SE}$ )	( ${trn}_{2}^{SE} ∥$ ${trn}_{1}^{SE}$ )
$H_{K L}$	1.888b	1.557b	1.938b	1.979b	1.888b
$H_{K L}^{'}$	2.719b	1.904b	2.965b	2.625b	2.527b
$H_{I S}$	891,154.1	12,281.8	251,969.8	12,875,882.0	2,985,641,328,333.4
$H_{I S}^{'}$	961.2	552.6	551.8	819.2	2080.3
$H_{S E}$	0.326	0.212	0.291	0.330	0.330

Table 4. Computation times for the

K L

– and

I S

–Hausdorff divergences using the Kd-tree search algorithms and a linear search. Left two columns use data sets from the predictions of

M_{1}

and

M_{2}

in dimension 100; the three rightmost columns use synthetic data sets using a set of 20,000 sampled points as the query into 100,000 points. Speed-up compares the Kd-shell computation times and linear search computation times.

Table 4. Computation times for the

K L

– and

I S

–Hausdorff divergences using the Kd-tree search algorithms and a linear search. Left two columns use data sets from the predictions of

M_{1}

and

M_{2}

in dimension 100; the three rightmost columns use synthetic data sets using a set of 20,000 sampled points as the query into 100,000 points. Speed-up compares the Kd-shell computation times and linear search computation times.

		( ${trn}_{1} ∥$ ${tst}_{2}$ )	( ${trn}_{2} ∥$ ${trn}_{1}$ )	Dim 10	Dim 50	Dim 100	Dim 250
KL	Kd-shell	0.47 s	1.63 s	0.02 s	0.78 s	1.36 s	2.19 s
	Kd-tree	1.74 s	6.79 s	2.43 s	481.15 s	999.26 s	2413.94 s
	Linear	284.30 s	1417.32 s	112.84 s	563.17 s	1113.63 s	2753.32 s
	Speed-up	598.54×	866.86×	4906.26×	720.16×	818.24×	1254.36×
IS	Kd-shell	0.21 s	1.01 s	0.01 s	0.66 s	1.48 s	3.92 s
	Kd-tree	7.09 s	23.92 s	3.85 s	412.38 s	738.37 s	1645.400 s
	Linear	279.70 s	1391.39 s	115.12 s	574.26 s	1129.01 s	2803.430 s
	Speed-up	1338.29×	1374.89×	6395.77×	862.25×	759.25×	714.98×

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pham, T.; Dal Poz Kouřimská, H.; Wagner, H. Bregman–Hausdorff Divergence: Strengthening the Connections Between Computational Geometry and Machine Learning. Mach. Learn. Knowl. Extr. 2025, 7, 48. https://doi.org/10.3390/make7020048

AMA Style

Pham T, Dal Poz Kouřimská H, Wagner H. Bregman–Hausdorff Divergence: Strengthening the Connections Between Computational Geometry and Machine Learning. Machine Learning and Knowledge Extraction. 2025; 7(2):48. https://doi.org/10.3390/make7020048

Chicago/Turabian Style

Pham, Tuyen, Hana Dal Poz Kouřimská, and Hubert Wagner. 2025. "Bregman–Hausdorff Divergence: Strengthening the Connections Between Computational Geometry and Machine Learning" Machine Learning and Knowledge Extraction 7, no. 2: 48. https://doi.org/10.3390/make7020048

APA Style

Pham, T., Dal Poz Kouřimská, H., & Wagner, H. (2025). Bregman–Hausdorff Divergence: Strengthening the Connections Between Computational Geometry and Machine Learning. Machine Learning and Knowledge Extraction, 7(2), 48. https://doi.org/10.3390/make7020048

Article Menu

Bregman–Hausdorff Divergence: Strengthening the Connections Between Computational Geometry and Machine Learning

Abstract

1. Introduction

2. Information Theory and Relative Entropy

3. Background on Bregman Geometry

4. Algorithms in Bregman Geometry

5. Bregman–Hausdorff Divergence

6. Algorithms for Bregman–Hausdorff Divergences

7. Experiments

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI