1. Introduction
Various kinds of data—from sounds to images to text corpora—are routinely represented as finite sets of vectors. These vectors can be processed using a wide range of algorithms, often based on linear algebra. The intermediate representations as well as final outcomes of the data are, similarly, sets of vectors.
Conveniently, the above setup allows for an intuitive geometric interpretation. Indeed, it is usual to equip the vector space in which such representations live with the Euclidean metric. Geometric objects in the geometry induced by this metric, such as the distance itself, balls and their intersections, bisectors, etc., help explain such algorithms in intuitive terms.
In recent years, however, other notions of distances have started to play an important role. One popular distance is the Kullback–Leibler divergence, often referred to as the relative entropy. A form of this divergence is the cross-entropy loss, commonly used for training deep learning models, in particular.
Compared to the once popular mean squared error loss (based on the Euclidean metric) the cross-entropy loss (based on the Kullback–Leibler divergence) provides significantly better performance. While the Kullback–Leibler divergence is often viewed as a distance between probability vectors, it lacks standard features of a metric. In particular, it is typically non-symmetric and never satisfies the triangle inequality. As such, its behavior can be less intuitive.
It may therefore be surprising that the Kullback–Leibler divergence induces a well-behaved geometry. Moreover, there exists an infinite family of distance measures, so-called Bregman divergences, that induce similar geometries. The aforementioned Kullback–Leibler divergence is one of its most prominent members, along with the squared Euclidean distance.
There is a significant overlap between algorithms in machine learning and computational geometry. Nevertheless, computational geometry tends to focus on the Euclidean distance (and other metric distances). In contrast, the non-metric aspects of the Kullback–Leibler divergence (and other Bregman divergences) prevent computational geometry algorithms from working—at least at the first glance.
It turns out that several popular algorithms can be extended to the Bregman setting—despite the lack of symmetry and triangle inequality, which are often deemed crucial. While this is an ongoing direction, there have been efforts to extend popular algorithms to operate within this framework.
In the first part of the paper, we offer a geometric perspective on Bregman divergences. We hope this perspective will streamline further development and analysis of algorithms at the intersection of machine learning and computational geometry; in particular, in the context of data measured using relative entropy, such as probabilistic predictions returned by a classifier trained using cross-entropy.
In the second part, we develop a crucial geometric tool in the context of Bregman geometry. The idea is simple: where a Bregman divergence provides a comparison between two vectors, we propose a natural way of comparing two sets of vectors. This idea is analogous to the Hausdorff distance between two sets and we therefore call it a Bregman–Hausdorff divergence. Notably, this measurement does not rely on any pairing or alignment between elements of the sets, and the sets may differ in size. This contrasts with the computation of a classifier’s loss during training, where each prediction is compared with the corresponding correct label.
Interestingly, the lack of symmetry characteristic to Bregman divergences allows for several different definitions—we select three, guided by the geometric interpretation of the original Hausdorff distance. Additionally, we propose first algorithms for computing these new divergences. These algorithms are enabled by recent developments in Bregman nearest-neighbor search, and we experimentally show they are efficient in practice.
Our contribution extends the arsenal of tools capable of handling data living in Bregman geometries. One crucial example of such data is the set of probabilistic predictions of modern classifiers trained with the cross-entropy loss.
Paper outline. In the first part of this paper, we introduce concepts from information theory and a geometric interpretation for the relative entropy (
Section 2). This interpretation connects the relative entropy to a larger family of distance measures known as Bregman divergences. After a brief introduction to this family and the geometry its members induce (
Section 3), we explain why the asymmetry is a beneficial property in context for machine learning, and highlight computational tools that have been extended to this setting (
Section 4).
The second part of this paper introduces three new measurements based on Bregman divergences. We provide definitions as well as interpretations in the context of comparing sets of discrete probability distributions (
Section 5). We then provide efficient algorithms for these measurements (
Section 6). In
Section 7, we experimentally show that the new measurements can be efficiently computed in practical situations. In particular, we combine the theory and tools from the two previous sections to provide quantitative ways to analyze and compare machine learning models.
Section 8 concludes the paper.
2. Information Theory and Relative Entropy
We begin by highlighting certain concepts from information theory, with the goal of providing an interpretation of the relative entropy. We will use this interpretation to develop intuition for the geometry induced by the relative entropy in
Section 2. (More details on information theory can be found in [
1,
2].) In particular, we emphasize the inherent asymmetry of the relative entropy, which will inform our decision to focus on asymmetric versions of the Hausdorff distance later. We also provide a geometric interpretation of relative entropy. This interpretation is shared among all Bregman divergences, and will be our focus afterwards.
Setup. We first set up a running example to guide us through each definition in this section. Let and be events occurring with probabilities and , respectively.
More generally, for
d events we encode these probabilities as a probability vector or in other words a discrete probability distribution. Geometrically, the space of all such vectors is the
-dimensional open probability simplex contained in
, namely
Going back to our example, we now plan to transmit information on sequences of observed events. To this end, we first encode each event as a finite sequence of bits, called a codeword. We aim to minimize the expected length of a codeword with the restriction that sequences of codewords be uniquely decodable.
Consider
Table 1, in which
is the probability of event
. The three rightmost columns provide three different codes.
Given a code and a discrete probability distribution
p, we can compute the expected code length for the transmission of information about the events. Specifically, the expected length,
, for each code
i is as follows:
While Code1 may be the most straightforward way to encode these four events, we can find a more optimized code. Although Code2 has a shorter expected code length, it is not decodable. Indeed, the sequence 0111 can describe both and . In contrast, Code3 is decodable and has a shorter expected code length than Code1.
Shannon’s entropy. In his seminal paper [
3], Shannon introduced a formula to compute the lower bound for the expected length of a code for a discrete probability distribution. Specifically, given a discrete probability distribution
p, the Shannon’s entropy is defined as
with
for
.
Returning to our example: for the discrete probability distribution
p from
Table 1,
. Thus, Code
3 in
Table 1 is the optimal way to encode events
.
Cross-entropy. Suppose we (erroneously) assume that the probability of the events is given by a probability distribution q, while in reality it is p. How inefficient is the code optimized for q compared to the code optimized for the true distribution p?
In this situation, we would assign longer codewords to less probable events, as measured by q. However, to compute the expected code length we must use the true probabilities of events, given by p.
The cross-entropy is an extension of Shannon’s entropy that provides a lower bound on the length of such codes:
In other words, the cross-entropy gives the lower bound for the expected code length for events with probabilities represented by p, assuming they occur with probabilities represented by q.
Given distributions
p and
q, their cross-entropy has a geometric interpretation. It is the approximation of
by the best affine approximation of
H centered at
q. Indeed, we can write the cross-entropy as
where
is the standard dot product.
Relative entropy. Relative entropy is the difference between cross-entropy and entropy, . It therefore measures the expected loss of coding efficiency incurred by using the ‘approximate’ probability q instead of the ‘true’ probability p.
The relative entropy is often viewed as a distance measure between two discrete probability distributions. However, unlike proper metric distances, it is
generally not symmetric and does not satisfy the triangle inequality. Given
, we write
to denote the relative entropy. We provide a further explanation of this notation at the end of this section, and expand on it in
Section 3.
Usage. In machine learning models, relative entropy is often used as a loss function. Let us consider a multiclass classification task, with
being a data set and
Y the collection of correct labels encoded as probability vectors. For a model
M dependent on a parameter vector
and
, we write
for the probability vectors, which are compared to the correct labels in
Y. Then, minimizing
can be interpreted as penalizing predictions that are poor approximations of the true distribution. Relative entropy has also been used in the training of Variational Autoencoders [
4]. Outside of machine learning, relative entropy has seen use in statistical physics [
5] and information geometry [
6].
Asymmetry. The asymmetry of relative entropy has a tangible interpretation. Let
as in
Table 1, and choose
If we attempt to approximate
p using
q, then
. This value reflects the fact that
is an impossible event assuming
q, and therefore no code was prepared for it (along with all other events that occur with zero probability). Hence, there exists no codeword of finite length that we could use to encode this event.
If, on the other hand, we approximate q using p, then . This value reflects the fact that while cannot occur, the coding efficiency is decreased by accounting for a codeword that is never used. (We remark that this asymmetry also occurs when the outputs of relative entropy are finite in both directions.)
Recall that
is Shannon’s entropy of
p, while the cross-entropy,
, is an approximation of
based on an affine approximation at
. Thus, the relative entropy can be interpreted as the vertical distance between the graph of this affine approximation and the graph of
. See
Figure 1 for illustration.
This geometric construction of relative entropy can be generalized. Indeed, the family of distance measures arising from this type of construction is known as Bregman divergences [
7]. Like relative entropy, these distance measures generally lack symmetry and do not fulfill triangle inequality. The relative entropy is a member of this family and is often referred to as the Kullback–Leibler (
) divergence in this context. We will introduce this family next.
3. Background on Bregman Geometry
In this section, we introduce Bregman divergences [
7], portray the
divergence as an important instance, and give an overview of Bregman geometry. We will put emphasis on the geometry induced by the
divergence.
Bregman divergences—definition and basic properties. Each Bregman divergence is generated by a function of Legendre type. Given an open convex set
, a function
is
of Legendre type [
8,
9] if
F is as follows:
The last condition is often omitted, but it enables a correct application of the
Legendre transformation. It is a useful tool coming from convex optimization [
9], which we will review later.
Given a function
F of Legendre type, the
Bregman divergence [
7] generated by
F is the function
In other words, the divergence in the
direction from
x to
y is the difference between
and the best affine approximation of
F at
y, evaluated at
x. We illustrate this in
Figure 2. The construction also mirrors the geometric interpretation of the
divergence from
Figure 1.
Just like metrics, Bregman divergences are always non-negative: in fact, , with equality if and only if . Unlike metrics, however, they are generally not symmetric, and do not satisfy the triangle inequality. To emphasize the lack of these two properties, we refrain from referring to the output of as a ‘distance’. Similarly, to emphasize the lack of symmetry, it is customary to write and not simply for the divergence computed from x to y.
Relative entropy as a Bregman divergence. Given the function
,
, we have
. Substituting in the formula for the Bregman divergence, we obtain
This divergence is often called a generalized Kullback–Leibler divergence. Restricting it to the probability simplex, , we obtain a divergence that coincides with the relative entropy defined in the previous section. Following the statistics and information theory literature, we will refer to it as the Kullback–Leibler divergence and denote it by . One can easily check that the generator of this divergence is indeed a function of Legendre type, also when restricted to the probability simplex.
Other Bregman divergences. Among other examples of Bregman divergences, the most prominent one is the
squared Euclidean distance (SE), generated by the square of the Euclidean norm:
We remark that restricting the domain to any bounded subset of violates Condition 3 for a Legendre-type function. Consequently, under such a restriction the square of the Euclidean distance does not fulfill the above definition of a Bregman divergence.
The
Itakura–Saito (IS) divergence [
10], generated by the function
has seen success as a loss function in machine learning models analyzing speech and sound data [
11].
All of the above divergences are often classified as decomposible Bregman divergences, since each of them decomposes as a sum of 1-dimensional divergences.
The
Mahalanobis divergence [
12] is commonly used in statistics as a distance notion between probability distributions [
13]. It is generated by the convex quadratic form
where
Q is the inverse of a variance–covariance matrix. Unlike the other examples, it is generally not a decomposable divergence. The Mahalanobis divergence has also seen success in machine learning—examples include supervised clustering [
14] and the classification of hyperspectral images [
15].
Bregman Geometry. Similarly to a metric, each Bregman divergence induces a geometry. We provide a brief overview of some key objects and features of Bregman geometries, and provide their information–theoretic interpretations in the case of the divergence.
A fundamental object in geometry is the ball. Due to the asymmetry of Bregman divergences, one can define two types of Bregman balls [
16]. The
primal Bregman ball of radius
centered at the point
, is defined as
and is the collection of those points whose divergence
from q does not exceed
r. See
Figure 3 for various illustrations. Primal Bregman balls have a particularly nice geometric interpretation: given a light source at point
, the primal ball
is the
illuminated part of the graph of
F, projected vertically onto
. We illustrate this in
Figure 4, on the left.
The
dual Bregman ball of radius
centered at the point
, is defined as
and is the collection of points whose divergence
to q does not exceed
r. The dual ball has a geometric interpretation similar to the primal ball. We first shift the tangent plane of the graph of
F at
up by
r. The dual Bregman ball
is the portion of the graph of
F below this plane, projected vertically onto
. We illustrate this in
Figure 4, on the right.
As seen in
Figure 3, and observed in [
17], primal Bregman balls can be non-convex when viewed as a subset of Euclidean space. In contrast, dual balls are always convex since
is convex in
x.
It may be tempting to consider the two geometries induced by a Bregman divergence to and from a point, separately. However, there is a strong connection between the two, given by the Legendre transformation [
9].
Legendre transform. The
Legendre transform of a function
F of Legendre type is defined on a conjugate domain
as
This construction induces a canonical map
It turns out that
is a convex function of Legendre type, and thus it also generates a Bregman divergence,
. This divergence satisfies
implying that the function mentioned above maps primal balls in
to dual balls in
[
17].
Chernoff point. Another connection between the two geometries is given by the
Chernoff point [
18]. Following [
19], we say a set
has an
enclosing sphere, if there exists a dual Bregman ball containing
X. The enclosing sphere is then the boundary of this ball.
Moreover, every finite set
has a unique smallest enclosing Bregman sphere [
16,
19,
20]. The center of this smallest enclosing sphere is known as the Chernoff point for
X.
We are interested in a simple situation in which the set X consists of two points only, . In this case, the Chernoff point minimizes the divergence subject to ; one can view it as lying midway between p and q with respect to the chosen Bregman divergence. Indeed, in the squared Euclidean case, it is the usual midpoint (arithmetic mean), namly . However, for other Bregman divergences this point will generally not be the midpoint, but some other point on the segment joining the two points. We also remark that the midpoint does play a special role for all Bregman divergences. We elaborate on this when discussing the Bregman k-means clustering algorithm. Additionally, the popular Jensen–Shannon divergence, and its variants, rely on the midpoint.
The Chernoff point can be visualized: for each point
, consider the primal ball
growing about
p, parameterized by
. Then, the Chernoff point is the point where all the balls
intersect for the first time. In the case when
, let us denote this radius by
, and the Chernoff point by
c. Then,
c lies on the boundary of both
and
, and
p and
q lie on the boundary of
. A visualization of this interaction can be seen in
Figure 5.
We refer to works by Nielsen [
18,
21] for more information on Chernoff points and their applications.
Information-theoretic interpretation of the geometric objects stemming from the KL divergence. In the language of information theory, the primal and dual balls have the following interpretation: Let be a probability vector, and a radius, expressed in bits. Note that, since the divergence measures the expected efficiency loss in bits, the radius r can indeed be of any (non-negative) real value, and is not limited to being an integer. The primal ball contains all probability vectors q that can approximate p with an expected loss of at most r bits. In contrast, the dual ball contains all probability vectors q that can be approximated by p with an expected loss of at most r bits. Consequently, the Chernoff point for two probability vectors p and q is the vector that approximates both p and q with the least loss of expected efficiency (as usual counted in bits).
4. Algorithms in Bregman Geometry
The development of algorithms for Bregman divergences is relatively young. We survey computational geometry algorithms that were adapted to the Bregman setting from the common Euclidean and metric settings, and highlight the necessary modifications.
k-means clustering. The first algorithm to be adapted was the
k-means clustering algorithm. It was originally proposed by Lloyd [
22] in 1957, and worked with the Euclidean distance. It was extended to arbitrary Bregman divergences by Banerjee and collaborators [
23] in 2005.
In Euclidean space, the k-means clustering partitions a data set into k clusters. Each cluster is identified by a unique point, called a cluster centroid. Here, the centroid is chosen as the arithmetic mean of all the data points in the cluster. This choice is motivated by the fact that the mean is the unique point that minimizes the sum of the squares of the Euclidean distances (i.e., the variance) to all data points in a given cluster.
Surprisingly, the
k-means algorithm still works when the square of the Euclidean distance is replaced with a Bregman divergence computed
from data points. Indeed, Banerjee and collaborators show that for
, the sum
is uniquely minimized by point
p being the arithmetic mean of
X—independently of the choice of a Bregman divergence. Consequently, apart from the requirement of computing the divergence towards the centroid, the original algorithm works in the Bregman setting without modification. This remarkable fact was the first hint that other geometric algorithms may work in the Bregman setting. (See
Figure 6 for a comparison of
k-means clustering in different geometries.)
Voronoi diagrams. Voronoi diagrams were first formally defined for two- and three-dimensional Euclidean space by Dirichlet [
24] in 1850, and for general Euclidean spaces by Voronoi [
25,
26] in 1908. Given a finite set
of points, called
sites, we partition the space
into
Voronoi cells, such that the Voronoi cell of
contains all points in the space for which
s is the closest site. More formally,
.
This definition for Voronoi diagrams was extended to the Bregman setting by Boissonnat, Nielsen, and Nock [
17] in 2007. In the Euclidean space, Voronoi cells are convex polytopes (possibly unbounded). However, when another distance measure is used, the shape of these cells can change drastically. See
Figure 7 for an illustration of Voronoi diagrams for various distance measures.
In the Euclidean case, the bisector of a pair of sites is a hyperplane, and the Voronoi cells arise as intersections of the resulting half-spaces. One method of computing these cells is by Chazelle’s half-space intersection algorithm [
27]. In the Bregman version, Boissonnat, Nielsen, and Nock show that calculating the divergence
to the sites yields Bregman bisectors that are also hyperplanes. Chazelle’s algorithm can therefore be used to compute the Bregman Voronoi diagram. However, when computing the divergence
from a site, the Bregman bisectors are more general hypersurfaces. Consequently, the Bregman Voronoi cells may have curved faces. As a side note, the Legendre transform can be used to handle this harder scenario by mapping the input to the conjugate space, performing the computations there using hyperplanes, and mapping the result back.
k-Nearest neighbor search. In a metric space
, given a finite subset
and a query
, a
k-nearest neighbor search finds the
k elements of
S closest to
q. A common strategy to answer these queries is to spatially decompose the set
S. This decomposition is often encoded as a tree data structure. This type of data structures includes ball trees [
28], vantage point trees [
29], and Kd-trees [
30]. Collectively, these are referred to as metric trees—although many of them can be extended to the non-metric setting of Bregman divergences.
In the Bregman setting, due to the asymmetry, we consider two types of nearest neighbor search. Namely, and . Queries are performed by searching subtrees while updating a nearest neighbor candidate. Subtrees may be ignored (pruned) to accelerate the search if certain conditions, which depend on the type of tree, are met.
In 2008, Cayton extended the concept of ball trees from metric spaces to the Bregman setting, and created software for Bregman ball trees [
31]. The software works for the squared Euclidean distance and the
divergence, with experimental support for the
divergence. Cayton’s Bregman ball trees are constructed with the help the aforementioned Bregman
k-means clustering. The pruning decision involves a projection of a candidate point onto the surface of the Bregman ball. This is performed via a bisection search.
Nielsen, Piro, and Barlaud improved on Cayton’s ball tree algorithm [
32]. In particular, they improved the construction by altering the initial points for the Bregman
k-means algorithm, and introduced a branching factor to allow for more splits at each internal node. They also implemented a priority-based traversal instead of the standard recursive ball tree traversal. The same authors also adapted another data structure called a
vantage point tree in 2009 [
32,
33]. Specifically, they replaced the metric ball with a dual Bregman ball, and in the pruning decision, they performed a bisection search to check intersections of Bregman balls.
Kd-trees were introduced by Bentley [
30] in 1975, and extended to Bregman divergences by Pham and Wagner [
34] in 2025. Unlike ball trees and vantage point trees, the construction of the Bregman Kd-tree is independent of the Bregman divergence. Indeed, the choice of divergence can be deferred to the time of performing each query. The query algorithm, surprisingly, is the same as in the Euclidean case. For decomposable Bregman divergences in particular, it allows each pruning decision to be made in effectively
time, independently on the dimension of the data.
Cayton’s Bregman ball trees algorithm is specialized for the divergence, with experimental implementation for the divergence. Nielsen, Piro, and Barlaud’s implementations show results for the and divergences. However, for both Bregman ball trees and vantage point trees, adding implementations for new divergences is nontrivial. In contrast, although Kd-trees work for a subfamily of Bregman divergences, a further extension of the implementation to new Bregman divergences is straightforward.
Apart from the data structures and algorithms described above, other exact searches have been extended to work in the Bregman setting. These include R-trees and VA-files (extended by Zhang and collaborators [
35] in 2009), and BrePartition (introduced by Song, Gu, Zhang, and Yu [
36] in 2020). The listed algorithms provide exact nearest neighbor queries in the Bregman setting [
37]. We remark that there exist other nearest neighbor algorithms that work in the Bregman setting [
37,
38]. However, these algorithms focus on approximations, without guarantees to find the exact nearest neighbors.
Computational topology. Topological concepts, such as homology groups, have been imported into computational geometry. One key concept is
persistent homology [
39], a stable geometric–topological descriptor of data, including point cloud data. It is the basis of the field called
topological data analysis [
40].
Building on geometric results of Boissonnat, Nielsen, and Nock [
17], Edelsbrunner and Wagner [
41] extended concepts of computational topology to the Bregman setting in 2017. In particular, basic concepts that allow for computation of persistent homology were extended to the Bregman setting. These include generalizations of the alpha and Čech constructions [
42]. One key result is the proof of contractibility of nonempty intersection of Bregman balls. Intuitively speaking, this result ensures that these constructions correctly capture the topology of data. More recent work focuses on implementation and experimental aspects [
43].
The development of computational tools for Bregman divergences motivates the extension of other geometric concepts from metric spaces to the Bregman setting. In the following sections we concentrate on one of the most commonly used—the Hausdorff distance. In short, we aim to compare two sets of vectors embedded in a Bregman geometry.
In
Section 5, we recall the definition of the Hausdorff distance and some of its properties. We then introduce the extension to the Bregman setting. Here, we offer two separate variants: the Bregman–Hausdorff and Chernoff–Bregman–Hausdorff divergence. Finally, we offer an interpretation of both divergences based on the
divergence, through the lens of information theory. In
Section 6 and
Section 7, we demonstrate how nearest neighbor algorithms can be used to compute the two divergences.
We hope that extending the basic concept to the Bregman setting may also open the door for further development. This would not be unprecedented: as mentioned above, the development of Bregman k-means enabled the development of efficient Bregman ball trees and Bregman vantage point trees.
5. Bregman–Hausdorff Divergence
Hausdorff distance is a very natural concept: introduced by Hausdorff [
44] in 1914, it has since become the standard distance measure for comparing sets of points, used ubiquitously across multiple fields of mathematics.
Recently, Hausdorff distance has also been used in applications. Indeed, it is a natural choice whenever two shapes need to be compared. For example, in computer vision, Hausdorff distance has been implemented as a measurement for the largest segmentation error in image segmentation [
45,
46] and to compare 3D shapes, such as meshes [
47,
48].
We start this section by providing the definition of the Hausdorff distance in a metric space. We then extend this concept to the Bregman setting. The inherent asymmetry of Bregman divergences leads to a number of distinct definitions, which would all coincide in the metric setting. In devising the definitions, we are guided by the geometric and information–theoretical considerations. We elaborate on situations in which each of these new definitions finds a natural application.
Hausdorff distance in metric spaces. Given two sets
P and
Q in a metric space
, the
one-sided Hausdorff distance from
P to
Q is defined as
Similarly to the Bregman divergence, this measurement is not symmetric:
. The
Hausdorff distance between
P and
Q is the symmetrization of the two one-sided Hausdorff distances, and is given by the maximum,
Equivalently, the one-sided Hausdorff distance can be defined using a so-called thickening. A thickening—sometimes also called an offset—of the set Q of size r consists of all those points in the ambient space M, whose distance to Q is at most r. In other words, it is the union of all balls of radius r centered at a point in Q.
The one-sided Hausdorff distance from the set
P to
Q is the radius of the smallest thickening of
Q that contains
P:
where
is the ball of radius
r (with respect to the metric
d) centered at
x. (We illustrate the two one-sided Hausdorff distances in
Figure 8.) The Hausdorff distance between
P and
Q is—as before—the maximum of the two radii,
and
. It is a well-known fact that the Hausdorff distance defines a
metric on the collection of closed, bounded, and nonempty subsets of the metric space
.
Bregman–Hausdorff divergence. To extend the notion of the one-sided Hausdorff distance to the Bregman setting, we use the geometric perspective of thickenings to help us select viable definitions.
Let
F be a function of Legendre type, defined on the domain
, and let
P and
Q be two nonempty subsets of
. The
primal (resp.,
dual)
thickening of
Q of size
is the union of primal (resp., dual) balls of radius
r, centered at the points in
Q, with respect to the divergence
. We define the
primal (resp.,
dual)
(one-sided) Bregman–Hausdorff divergence from
P to
Q, with respect to the divergence
, as
respectively. See
Figure 9 for a visualization. We will refer to this new measurement as a divergence, rather than calling it a ‘one-sided distance’, to emphasize that it is generated by a Bregman divergence, and does not satisfy the triangle inequality.
Similarly to the one-sided Hausdorff distances, we have equivalent expressions for both the primal and the dual Bregman–Hausdorff divergence:
These expressions will be useful for computations. It is worth noting that the asymmetry of Bregman divergences allows for more definitions, which however, deviate from the natural geometric interpretation of the original Hausdorff distance.
Furthermore, it would be possible to define symmetrized (primal and dual) Bregman–Hausdorff divergence as the maximum of the two variants. However, we refrain from symmetrizing it this way, for the same reason Bregman divergences are typically not symmetrized. Namely, each of the above definitions has a natural interpretation and applications. However, we will introduce a third variant, which will be naturally symmetric.
For popular divergences with established names and abbreviations, such as the and divergences, we shorthand the Bregman–Hausdorff divergences for these divergences to for the –Hausdorff divergence, and for the –Hausdorff divergence.
The proposed Bregman–Hausdorff divergences can be used to compare the probabilistic predictions of a machine learning model with a reference model, as we showcase on the example of the divergence in the next paragraph.
Interpreting the Bregman–Hausdorff divergences with respect to the KL divergence. We can now extend the interpretation of the
divergence presented in
Section 2 to the Bregman–Hausdorff divergences based on this divergence. This new case involves not only pairs of probability vectors, but pairs of
collections of such vectors.
Let P and Q be nonempty collections of probability vectors in . If we form a primal ball , with a fixed radius , around every point , then the region covered by these balls will contain all probability vectors that can approximate a vector in Q with an expected loss of at most r bits. Now, if the set P is contained in this region, then r is an upper bound on how inefficient the approximation of probabilities in Q is for some vector . Thus, by taking the infimum over all radii such that P is contained within the primal balls around Q, we can compute how efficient the approximation can be. The infimum is precisely the primal Bregman–Hausdorff divergence from P to Q, ; it measures the maximum expected efficiency loss (in bits) if P is used to reasonably approximate Q. In other words, for any probability vector , there exists a vector , which p approximates with an expected efficiency loss of at most bits.
In contrast, the dual Bregman–Hausdorff divergence measures the minimum radius for which the dual thickening of Q covers P. Each dual ball contains all probability vectors that q approximates with an expected efficiency loss of at most r bits. Thus, when , the union of the dual Bregman balls captures the maximum expected number of bits lost for any probability vector to be reasonably approximated by a probability vector in Q.
Hence, and measure how P and Q can approximate each other. Specifically, is the maximum loss of expected bits if any vector is used to approximate some vector . On the other hand, for , every vector is approximated by some probability vector in P, but not every q will be used as an approximator. For , every point in P is contained in some , so every p is approximating the center of a ball it is contained in.
We can use the Bregman–Hausdorff divergence in the assessment of performance of machine learning models. Indeed, let and be two different classification models trained using the divergence loss (or equivalently the cross-entropy loss). We also let be two data sets, and denote the probabilistic predictions made by the models as and . Then, quantifies a divergence from the set of predictions made by Q towards the set of predictions made by P.
We stress that this measurement does not rely on any explicit pairing between the outputs of two models (i.e., there is no obvious bijection between X and Z). Although the data sets X and Z are not explicitly paired, we can still make a reasonable numerical measurement between the two. This is the case when, for example, , and X and Z are training and test data, respectively. In this case, the values and can be used to gauge the generalization power of a model. Importantly, this measurement is consistent with the loss function used to train the model.
Chernoff–Bregman–Hausdorff distance. We propose one more natural distance measurement. As its name suggests, the Chernoff–Bregman–Hausdorff distance is based on the notion of the Chernoff point. As before, we let F be a function of Legendre type, defined on the domain , and let P and Q be two nonempty subsets of .
For each pair of points
, write
for the Chernoff point of the set
, and write
for the collection of all the Chernoff points. Then, the primal (resp., dual) Chernoff–Bregman–Hausdorff distance is the smallest size of the primal (resp., dual) thickening of
C that contains the union
. To be more concrete, we define the
primal Chernoff–Bregman–Hausdorff distance between
P and
Q as
and the
dual Chernoff–Bregman–Hausdorff distance between
P and
Q as
We emphasize that each divergence is named after the type of the Bregman ball that
grows about the pair , and
not the balls growing about the Chernoff points. We visualize the primal and dual Chernoff–Bregman–Hausdorff distances in
Figure 10. Unlike the Bregman–Hausdorff divergences, the Chernoff–Bregman–Hausdorff distance is symmetric.
If fact, the set C can be viewed as the ‘average’ of P and Q with respect to the chosen divergence at the level of sets. In contrast to symmetrizing the Bregman–Hausdorff divergence by taking the average , the Chernoff–Bregman–Hausdorff distance avoids mixing directions of divergence computations. In particular, in the context of the divergence, the corresponding Chernoff–Bregman–Hausdorff distance inherits the information–theoretical interpretation, as we will see shortly. In the case of the squared Euclidean distance, C contains the usual arithmetic average for each pair .
Again, for the and divergences, we shorten the notation to and , respectively.
Interpreting the Chernoff–Bregman–Hausdorff distance for the divergence. As the primal Chernoff–Bregman–Hausdorff distance is defined by taking the infimum of the radius of a dual Bregman ball about the Chernoff points, gives the least number of expected bits lost using C to approximate both P and Q. Similarly, the dual Chernoff–Bregman–Hausdorff distance, where we center primal balls about each Chernoff point, gives the least number of expected bits lost to approximate C by either P or Q.
Returning to machine learning, for models
and
,
C can be viewed as the collection of ‘average’ probability distributions [
20] for every
. Therefore,
measures the maximum expected loss of coding efficiency (in bits) when attempting to reasonably approximate both
P and
Q using
C. In this sense,
C contains ‘joint approximators’ for pairs from
. Similarly,
measures the maximum expected loss when using
P and
Q to approximate the set
C. While the Bregman–Hausdorff divergence is applicable if either
P or
Q is a reference set of vectors, the Chernoff–Bregman–Hausdorff distance is a natural choice when
P and
Q play the same role.
6. Algorithms for Bregman–Hausdorff Divergences
In this section, we present first algorithms for the Bregman–Hausdorff divergences. The algorithms rely on the data structure for nearest neighbor search, which we outline next. For the remainder of this section, we let F be a decomposible function of Legendre type, defined on the domain , and let P and Q be two finite, nonempty subsets of .
Bregman Kd-trees. We use the Bregman Kd-tree structure and search algorithm mentioned in
Section 4. In [
34], Pham and Wagner experimentally show that Kd-trees for Bregman divergences are efficient in a range of practical situations. In particular, they perform better than Cayton’s Bregman ball trees, and alleviate issues with certain other implementations.
The aforementioned Bregman Kd-tree implementation works for decomposable Bregman divergences [
49,
50], including the squared Euclidean distance, and
and
divergences. One additional benefit of this algorithm is its ability to compute the nearest neighbor in either direction.
Computing the Bregman–Hausdorff divergences. We first provide an algorithm for computing the Bregman–Hausdorff divergences between P and Q.
To compute the Bregman–Hausdorff divergence
from
P to
Q, we can use a version of the
Kd-tree data structure for decomposable Bregman divergences [
34]. We first construct a Kd-tree to represent
P. Then, for each
, we search for
, maintaining value of the largest divergence,
. We denote this maintained value as
in the three algorithms presented.
In Algorithm 1, we can replace line 2 and line 4 with any nearest neighbor structure and search. As the Bregman Kd-tree can be queried for divergences computed in both directions, this algorithm also works for computing the dual Bregman–Hausdorff divergence. For a proximity search algorithm with time complexity
, this algorithm runs in
time.
Algorithm 1 Primal Bregman–Hausdorff divergence algorithm (basic version) |
- Require:
Point clouds P and Q of size n, m; decomposable Bregman divergence . - Ensure:
Bregman–Hausdorff divergence - 1:
- 2:
build Kd-tree - 3:
for do - 4:
▹ Find - 5:
- 6:
- 7:
return
|
To approximate the Bregman–Hausdorff divergence, one can use a -nearest neighbor algorithm.
We can accelerate Algorithm 1 by adding an early query termination during the Kd-tree search. The the best of our knowledge, this technique was introduced to approximate the one-sided Hausdorff distance between 3D meshes [
48]. We adjust the Kd tree query in line 4 as follows. During each query, candidates for
are found. For any candidate
, if
, then we terminate the query. We denote this adjusted search as the
method on line 4 in Algorithm 2. The
method returns the nearest neighbor if the search completes and
otherwise. An illustration of the above considerations can be seen in
Figure 11.
In the worst case scenario, the shell variant will have the same complexity as Algorithm 1, but we will see in
Section 7 that reducing the number of complete searches provides significant speed-ups, even in high dimensions.
Algorithm 2 Primal Bregman–Hausdorff divergence shell algorithm (improved) |
- Require:
Point clouds P and Q of size n, m, respectively; decomposable Bregman divergence . - Ensure:
Bregman–Hausdorff divergence - 1:
- 2:
build Kd-tree - 3:
for do - 4:
- 5:
if then - 6:
- 7:
return
|
We will experimentally compare these implementations with a naive algorithm, where the Kd-tree search in line 4 of Algorithm 1 is replaced by a linear search.
Computing the Chernoff–Bregman–Hausdorff distance. We also provide an algorithm for the Chernoff–Bregman–Hausdorff distance. To compute the distance, we first determine the set of Chernoff points: .
To approximate the Chernoff point
for a pair
, we perform a bisection search along the line segment connecting
p and
q:
with parameter
. This search was proposed by Nielsen [
20]. We assume it runs in time
to be
close to the true Chernoff point, with
d being the dimension of the ambient space in which the domain
lies. Thus this component of the algorithm runs in
time. Letting
be the time complexity of the chosen proximity search algorithm, the search runs in
time. This gives us a total running time of
.
As with the Bregman–Hausdorff divergence, Kd-trees may be replaced by other exact Bregman nearest neighbor structure and algorithms; an approximation may be used as well.
Computing the dual Chernoff–Bregman–Hausdorff distance requires mapping the input to the Legendre conjugate space. This adds a preprocessing step, but does not affect the complexity of the algorithm.
7. Experiments
In this section, we compute Bregman–Hausdorff divergences between various data sets. We work both with practical and synthetic data sets, and provide computation times. Generally, we use Algorithms 1 and 2 with domain
, where
d depends on the chosen data set. We will run experiments using the
and
divergences, as well as the squared Euclidean distance. We will focus on Algorithms 1 and 2 because, unlike Algorithm 3, they promise to be efficient in practice. The computed values are exact up to machine precision, since the Bregman Kd-trees we use provide exact nearest neighbors.
Algorithm 3 Primal Chernoff–Bregman–Hausdorff divergence algorithm |
- Require:
Point clouds P and Q of size n, m; decomposable Bregman divergence . - Ensure:
Chernoff–Bregman–Hausdorff divergence - 1:
- 2:
C = empty array of size - 3:
- 4:
for do - 5:
for do - 6:
C[i] = Chernoff ▹ Compute the Chernoff points. - 7:
- 8:
build Kd-tree - 9:
for do - 10:
▹ Find - 11:
- 12:
- 13:
return
|
Compiler and hardware. Software was compiled with Clang 14.0.3. The experiments were performed on a single core of a 3.5 GHz ARM64-based CPU with 4 MB L2 cache using 32 GB RAM.
Data sets. We use predictions from machine learning models and synthetic data sets. We train two neural networks,
and
, on a classification task using CIFAR100, which has 50,000 training images and 10,000 test images. Specifically, we perform transfer learning using EfficientNetB0 [
51] pretrained on ImageNet as a backbone. The first model,
, is trained with fine-tuning and the second model,
, is trained without. Both models are trained using the
divergence as the loss function. The models,
and
, achieve 80.22%, and 71.74% test accuracy, respectively. From each model, we produce two sets of predictions:
, for
. The synthetic data sets are drawn uniformly from the open simplex in dimension 50, 100, and 250. The target data set,
P, has 100,000 sample points; the query data set,
Q, has 20,000.
Bregman–Hausdorff computations. One concrete motivation for this measurement was the need to quantify how well one set of predictions approximates another. We set this up as a computation of the Bregman–Hausdorff divergence. We are especially interested in comparing the probabilistic predictions arising from two models of different quality, as well as the training and test data.
Additionally, we aim to investigate to what extent the choice of the underlying Bregman divergence affects the resulting Bregman–Hausdorff divergence. Of course, in practice, one would use an appropriate divergence. For example, if the data represent probability vectors, the KL divergence, and by extension the KL–Hausdorff divergence is a natural choice. However, in practice the (squared) Euclidean distance and the related Hausdorff distance is often used, mostly due to limitations of computational tools. The following experiments confirm that the choice of divergence has a significant effect on the outputs of the corresponding Bregman–Hausdorff divergence.
Using the relative entropy loss. In
Table 2, we compare the Bregman–Hausdorff divergence between the four sets of predictions. Divergences change by row, data sets by column, and the units for the
divergence are in bits. We choose a subset of the possible pairings to highlight the asymmetry and some key interpretations, and also for brevity.
From the row containing the values of
in
Table 2, the lowest value is
(
). This value shows that
predicts
with a maximum expected loss of 1.764 bits. In particular,
—the relation we would expect from
’s predictions on a train and test data sets. In contrast,
(
) = 0.371 is the largest value in its row. In particular,
, which is the reverse of how we expect the soft predictions from the training data and test data to behave. On the surface, this would indicate that
is a poor predictor of
. However, as the models were trained to minimize the
divergence, and because of the difference between the Euclidean geometry and the Bregman geometry induced by the
divergence, values from
do not carry the information–theoretical interpretation. This shows the importance of analyzing these models using the proposed
–Hausdorff divergence rather than the standard one-sided Hausdorff distance.
We see a similar effect for the computation of
and
; the results suggest that the geometry induced by the
divergence is drastically different from the geometry induced by the
divergence, and thus the computations lack an obvious interpretation in this context. However, for comparing speech and sound data (as opposed to probabilistic predictions) the
–Hausdorff divergence would be the preferred choice. In this case, the measurement would inherit an interpretation from the IS divergence [
11]. So, while in practice it is clear that the
-Hausdorff divergence is the only reasonable choice, we provide provide the table to demonstrate that our algorithms can handle various Bregman divergences and that the outputs are highly sensitive to the chosen divergence.
Using the mean squared error loss. In this case, we train and by minimizing the mean squared error (MSE) loss, which corresponds to the squared Euclidean distance. In this case, the training accuracy of is 73.97% while the training accuracy of is 69.18%. For numerical stability, we decrease the learning rate for for this loss function. We stress that, like before, these models output classification predictions, interpreted as vectors in .
Similarly, we produce two sets of data per model, labeled as (
,
) for
. We can compute the Bregman–Hausdorff divergences for the same pairs of sets as in
Table 2 and record them in
Table 3.
In contrast to minimizing with respect to the divergence, all –Hausdorff divergence values from SE distance minimized data sets are smaller. This tells us that and are closely clustered together using the divergence geometry more so than and from above. In contrast, is larger when comparing the training data predictions, and , between the two models, showing that the two resulting prediction sets are less clustered when changing the minimization function.
Timings. Finally, we compare the computation speeds of the algorithms using the Bregman Kd-tree (Algorithms 1 and 2) and a version that uses a linear search. The results can be seen in
Table 4. The left columns are computation times for two sets of predictions. For the Kd-tree algorithm without the shell acceleration, we see significant speed-ups in the left two columns, while the right three columns have a similar time between the two searches. As predictions from classification models tend to cluster near the vertices of the simplex and randomly generated points have more spread across the simplex, we see that the distribution of points heavily influences the speed of computations. Similarly, the run time increases as the dimension increases. This is expected, as the performance of Kd-trees is known to degrade in higher dimensions [
52,
53].
Perhaps surprisingly, when we apply early termination via the shell method, we see that we maintain considerable speed-ups even in high dimensions. While the worst case complexity of the shell method is still equivalent to the Kd-tree search, we see up to 1000× speed-up even in dimension 250.
These speed-ups show that the Bregman–Hausdorff divergence is an efficient tool for the comparison of the outputs of machine learning models—as well as in other situations that require the comparison between two sets of vectors.
Alternative implementations. There are alternative approaches to Bregman nearest neighbor search, but they come with certain limitations. To implement computations for Bregman–Hausdorff divergences using Cayton’s Bregman ball trees [
31], significant changes to the source code would be required. However, as shown in [
34], Bregman Kd-trees are a preferred choice as they are more efficient in practice. Implementations of ball trees and VP trees (due to Nielsen, Piro, and Barlaud) are not usable on modern systems due to severe compilation issues. Our comparison is therefore limited to our Kd-tree algorithms and the linear search.
8. Conclusions
Modern machine learning models commonly rely on optimizing the Kullback–Leibler divergence. As a result, the resulting vectors, such as the probabilistic predictions of a classifier, would naturally be measured using this divergence. However, many standard tools that could be used to analyze such data are limited to the Euclidean distance.
One geometric tool that found applications in various fields in the Euclidean context is the Hausdorff distance. It serves as a distance between two collections of points—without requiring any pairing or alignment between the inputs. While potentially useful also in machine learning, the standard Hausdorff distance is limited to the metric setting—as opposed to the Kullback–Leibler divergence and other Bregman divergences.
This paper is an attempt to bridge the gap between the necessity of using non-metric distances—and the familiarity with such concepts and the paucity of computational tools supporting them. Specifically, we outlined the field of Bregman geometry, including descriptions of existing tools that can be used in the above context. Importantly, we extended a popular geometric tool to this new setup: We defined several variants of Bregman–Hausdorff divergences, which allow one to compare two collections of vectors according to a chosen Bregman divergence. We highlighted situations in which each of these variants can be meaningfully used. In particular, when the underlying divergence is the Kullback–Leibler divergence, we explained that its Hausdorff counterpart inherits a clear information–theoretical interpretation.
To make these theoretical considerations practical, we also designed novel algorithms for computing the newly defined Bregman–Hausdorff divergences. To ensure efficiency, we applied the Bregman version of the Kd-tree data structure for exact nearest neighbor search, which we recently developed. We exploited the special structure of the Bregman–Hausdorff computations to better leverage the Kd-tree data structure. We implemented these three algorithms.
We benchmarked our implementations in scenarios arising from a machine learning setup. Our experiments show that our most optimized algorithm performs well in this scenario. Specifically, it achieves several orders of magnitude speed-up compared to more basic algorithms. This is surprising, given that the Kd-trees the algorithm uses are known to perform poorly in such high-dimensional scenarios. In fact, our straightforward application of Kd-trees scales significantly worse with dimension. Understanding this speed-up theoretically is an interesting future direction.
Overall, we hope that the Bregman–Hausdorff divergence will find applications. In this paper, we focused on the theoretical setup, interpretations of the new measurement, as well as efficiency aspects. More broadly, we hope that geometric tools supporting Bregman divergences—and not only the Euclidean distance or other metrics—will eventually become more popular in machine learning.