Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Bregman–Hausdorff Divergence: Strengthening the Connections Between Computational Geometry and Machine Learning

Mach. Learn. Knowl. Extr. 2025, 7(2), 48; https://doi.org/10.3390/make7020048

by Tuyen Pham^1,*

, Hana Dal Poz Kouřimská²

and Hubert Wagner^1,*

Reviewer 1:

Vesselin Gueorguiev

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Mach. Learn. Knowl. Extr. 2025, 7(2), 48; https://doi.org/10.3390/make7020048

Submission received: 12 March 2025 / Revised: 7 May 2025 / Accepted: 13 May 2025 / Published: 26 May 2025

(This article belongs to the Collection Extravaganza Feature Papers on Hot Topics in Machine Learning and Knowledge Extraction)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

25 March 2025 - make-3553603 report

Bregman-Hausdorff divergence: strengthening the connections between computational geometry and machine learning
by Tuyen Pham, Hana Dal Poz Kourimská, and Hubert Wagner

The paper discusses Bregman geometry, which I would like to call Bregman spaces instead. It justifies and motivates the mathematical formalism based on information theory and the various entropy-related concepts. Some commonly used and relevant decomposable Bregman divergences, which I prefer to call constructions, are discussed as examples to be used later along with extending the Hausdorff distances to such Bregman spaces. Three algorithms have been introduced based on the use of Kd-Trees but the discussion points out that other geometric algorithms can be adopted to use Bregman spaces instead of the usual metric spaces. In section 3. "Background on Bregman geometry" the paper presents the geometric interpretations of primal Bregman ball and dual Bregman ball, this is very similar to the topological spaces where the topology is defined by balls around points. In this respect, the relationship between the primal and dual topology is mentioned in connection to the corresponding Legendre transform. This section also discusses Chernoff's points but does not seem to connect them to weighed averages but only to the midpoint when the concept is applied to the Euclidean case. In part II the paper focuses on the application of the mathematical tools discussed to compare two models M1 and M2 starting with section 7. Experiments. Two tables show the outcomes of their experiments that in my opinion need significant extra work in the methodology employed and also the paper needs a more elaborate conclusion section. Some of the work in the Experiments section shown and discussed seems misleading to me.

Thus, my conclusion is that this is an interesting paper. After a revision, it could be publishable but needs some more major work to improve the claims of the paper. Below are some of my suggestions for improving the paper.

I) As stated, in line 649 "the Bregman–Hausdorff divergence will be a valuable tool", I also anticipate that such Bregman–Hausdorff analyses would be valuable for the advancement of machine learning and AI systems. However, this has to be demonstrated by utilizing the scientific method with appropriate experiments and setups. To me, it does not make sense to compare Kd-shell to the Linear Search algorithm to show qualitative Speed-up as shown in Table 2 and commented on in line 644. It is more appropriate to compare the Kd-shell to Kd-tree and Kd-tree to Linear but Linear has to be done in the best way possible, that is, the KdTree data structure may have to be replaced by a data structure suitable for the specific bisection search. So the Speed-up row in Table II of numbers is not needed, people can see the speed improvement.

II) It does not make sense to me to train models by optimizing on Kullback–Leibler construction and then to test and compare by using Itakura–Saito (IS) framework. The authors should also train models on relevant data, such as speech and sound-related data, by optimizing with the Itakura–Saito (IS) measure and then comparing the models. Similarly one has to consider a data set that is suitable for squared Euclidean distance (SE) as well and then compare the models.

III) Since some of the terminology feels awkward to me and probably to other readers too, so I suggest elaborating on why specific terms were used instead of others. For example, instead of divergence, one can use one-sided distance or asymmetric measure. Elements of topology have been utilized as elements for geometry but there is no standard distance measure. However, asymmetric measure could easily be symmetrized but such explorations were not touched upon until the application of the Hausdorff ideas. So, I suggest the authors carefully introduce such potentially controversial terminology.

Minor comments and suggestions:

1) Some labels and axes in Fin. 1 must be corrected. It seems that x & y must be q or p.

2) line 470 P is missing as the name for M1;

3) line 541 and 542 are missing units for the vale in line 542 and similar in line 567.

4) in the algorithms, the KdTree.query and KdTree.shell_query the utilization of the extra argument max_haus was not explained well.

5) The text around Lines 587 to 580 is confusing. Is M1 a pertained transfer learning mode with fine-tuning while M2 does not have fine-tuning? What is the fine-tuning on?

6) Why aren't all possible combinations shown in Table 1, (trn2 || tst1), (tst2 || trn2), etc?

7) In Table 2, one can remove the Speed-up rows as discussed in II and also the P size; and Q Size since they are the same

8) How can one assess the errors in the values obtained and shown in the two tables?

If the authors are to modify the paper, it will be best to use \color{blue} or any other suitable color to indicate the more substantial changes in the next version no need to show what was the old text, only where are the new substantial changes beyond a few words or major formula modifications.

Author Response

Summary of responses to Reviewer 1:

Thank you for your time, we appreciate your careful reading of the paper and the detailed comments.

In overview, we have made the following changes:

We run new experiments in which the Bregman–Hausdorff divergences are computed for two models trained with respect to other Bregman divergences.
We explain our preference of the term ‘divergence’ rather than one-sided distance or metric.
We have extended the conclusion of our paper.
We have corrected the minor comments and suggestions.
New or significantly altered text is marked in blue, as suggested.

Detailed responses:

Comment 1: This section also discusses Chernoff's points but does not seem to connect them to weighed averages but only to the midpoint when the concept is applied to the Euclidean case.

Response 1: We now clarify the relationship between the Chernoff point and the midpoint (arithmetic mean) for Bregman divergences and mention that the two coincide in the Euclidean case. We also mention some other roles that the midpoint plays, e.g. in the Jensen–Shannon divergence. We had decided not to elaborate on the relationship between the Chernoff point and weighted means, as we do not use this property.

Comment 2: In part II the paper focuses on the application of the mathematical tools discussed to compare two models M1 and M2 starting with section 7. Experiments. Two tables show the outcomes of their experiments that in my opinion need significant extra work in the methodology employed and also the paper needs a more elaborate conclusion section. Some of the work in the Experiments section shown and discussed seems misleading to me.

Response 2: We have expanded upon the conclusion of the experiments and of the overall paper. In particular we clarify the role of the experimental results reported in Table 1, which primarily aim to show that the choice of the underlying divergence results in very different measurements.

Comment 3: As stated, in line 649 "the Bregman–Hausdorff divergence will be a valuable tool", I also anticipate that such Bregman–Hausdorff analyses would be valuable for the advancement of machine learning and AI systems. However, this has to be demonstrated by utilizing the scientific method with appropriate experiments and setups.

Response 3: We agree that the usefulness of the method should really be tested in specific, practical scenarios and compared with an alternative method. We definitely did not intend to make such a claim without supporting it.

The experiments aimed to show that (1) the new measurement is efficiently computable in practice; and (2) that choosing a different underlying divergence will result in very different values. Along with the theoretical justification we provide, we hoped the paper is enough to interest practitioners to consider applying this tool.

Such experiments showcasing the usage of this new tool in various practical situations fall outside of the scope of paper. While we agree it would be highly valuable, this would be another, big project, which would significantly delay the paper.

Comment 4: To me, it does not make sense to compare Kd-shell to the Linear Search algorithm to show qualitative Speed-up as shown in Table 2 and commented on in line 644. It is more appropriate to compare the Kd-shell to Kd-tree and Kd-tree to Linear [...]

Response 4: In this case we stand behind our decision. We had decided that the user would like to see how much is gained if one switches from a simple to implement baseline algorithm (Linear Search) to a more complex algorithm. In this case it’s clear that the Kd-tree algorithm scales poorly, so we viewed “Kd-shell to Linear” speedup as the most relevant information. (Of course, the information about individual speedups can be inferred from the table.)

Comment 5: [...] but Linear has to be done in the best way possible, that is, the KdTree data structure may have to be replaced by a data structure suitable for the specific bisection search.

Response 5: We definitely agree that ideally one would compare the two implementations using various *exact* Bregman nearest neighbour data structures.

We are aware of two viable public implementations: (1) Bregman ball trees by Cayton and (2) improved Bregman ball trees and VP trees by Nielsen, Piro, and Barlaud. In our previous paper on Bregman Kd-trees, we carefully compared with Cayton’s implementation, and the Kd-trees were consistently more efficient. We have experienced issues with compiling and using (2) on a modern system. We’ve added this explanation into the paper as well.

Comment 6: So the Speed-up row in Table II of numbers is not needed, people can see the speed improvement.

Response 6: The speed-up line is there to provide a quick summary for the reader.

Comment 7: It does not make sense to me to train models by optimizing on Kullback–Leibler construction and then to test and compare by using Itakura–Saito (IS) framework. The authors should also train models on relevant data, such as speech and sound-related data, by optimizing with the Itakura–Saito (IS) measure and then comparing the models. Similarly one has to consider a data set that is suitable for squared Euclidean distance (SE) as well and then compare the models.

Response 7: This is an important point, and we reworked this part of the paper.

We definitely agree that the mismatch between the divergence used for training and for subsequent analysis makes no sense in practice! This is one of the things we tried to demonstrate in this table – specifically that the choice of divergence has a big influence on the outcome.

(As a side note – this mismatch is something I have unfortunately encountered. Due to the lack of tools (and/or familiarity with other distances/divergences) data that should be measured with KL (e.g. probabilistic predictions of a network trained with cross/relative entropy) is measured with the Euclidean distance.)

Let us clarify our setup (which we also clarified in the paper). As the two models being compared are classification models, they are trained via the KL divergence regardless of the origin of the data. So even if the input data was, for example, speech data compared using the Itakura-Saito divergence, in this setup we would still work with probabilistic predictions measured with KL.

However we definitely agree that presenting data measured using another divergence would be valuable. To this end, we have run and included new results for models trained with the Mean Squared Error loss (corresponding to the squared Euclidean distance).

Comment 8: Since some of the terminology feels awkward to me and probably to other readers too, so I suggest elaborating on why specific terms were used instead of others. For example, instead of divergence, one can use one-sided distance or asymmetric measure. Elements of topology have been utilized as elements for geometry but there is no standard distance measure. However, asymmetric measure could easily be symmetrized but such explorations were not touched upon until the application of the Hausdorff ideas. So, I suggest the authors carefully introduce such potentially controversial terminology.

Response 8: We maintain use of the term ‘divergence’ for our new definitions to stay true to the fact that they are generated from Bregman divergences. One reason to not use the ‘one-sided distance’ for example is that the one-sided Hausdorff distance satisfies the triangle inequality while the Bregman–Hausdorff divergence does not. We have added a small passage addressing this shortly after our definitions. Additionally, we tried to avoid the word ‘measure’ as it carries different connotations. In particular it can be easily confused with ‘probability measures’, since we talk about probabilities.

Comment 9: Minor comments and suggestions:

We appreciate the careful reading of the paper. We have corrected all the smaller mistakes. (Except we decided to keep the speedup line, as we mention above).

Comment 10: How can one assess the errors in the values obtained and shown in the two tables?

Response 10: As for the values of the Bregman–Hausdorff distance, the computed values are exact (up to machine precision). This is because the Kd-tree we use allows for exact nearest neighbor search. As for the timings, these are averaged values.

Comment 11: If the authors are to modify the paper, it will be best to use \color{blue} or any other suitable color to indicate the more substantial changes in the next version no need to show what was the old text, only where are the new substantial changes beyond a few words or major formula modifications.

Response 11: That’s a good idea, done.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper introduces a Bregman-type divergence between sets of vectors
which is obtained by replacing the metric in the definition of the Hausdorff distance
by a standard Bregman divergence. Algorithms for calculating the newly defined divergences
are presented. Their usefulness is
tested experimentally by calculations comparing a pair of neural networks.

The paper is well-written, be it somewhat lengthy.

l133 Relative entropy/divergence is also used in other domains, e.g. Statistical Physics
and Information Geometry.

l137 It may disturb readers that the divergence is applied to vectors in R^n while the definition
in l130 is given for probability distributions. Explain that the definition follows in Section 3.

Author Response

Responses to Reviewer 2:

Thank you for your time and valuable feedback. As detailed below we made appropriate changes. Significant changes are marked in blue.

Comment 1: The paper introduces a Bregman-type divergence between sets of vectors
which is obtained by replacing the metric in the definition of the Hausdorff distance
by a standard Bregman divergence. Algorithms for calculating the newly defined divergences
are presented. Their usefulness is
tested experimentally by calculations comparing a pair of neural networks.

Response 1: That’s a good summary.

Comment 2: The paper is well-written, [...]

Response 2: Thank you!

Comment 3: [...] be it somewhat lengthy.

Response 3: For what it’s worth we were surprised how many words seem to be needed to carefully explain these concepts in a self-contained way.

Comment 4: l133 Relative entropy/divergence is also used in other domains, e.g. Statistical Physics
and Information Geometry.

Response 4: We have added your suggested uses of relative entropy in fields when introducing it.

Comment 5: l137 It may disturb readers that the divergence is applied to vectors in R^n while the definition
in l130 is given for probability distributions. Explain that the definition follows in Section 3.

Response 5: Thank you for raising this issue. It’s a valuable comment, showing an issue with our introduction. We now clarify much earlier that in fact we primarily work with discrete probability distributions, which we view as vectors in R^d (restricted to the open probability simplex, which we now define much earlier).

This was very useful feedback. If there are similar issues, we would be happy to iron them out. It’s really important for us to make this paper readable to the members of the machine learning, information extraction and related communities!

Reviewer 3 Report

Comments and Suggestions for Authors

In this paper, the authors did 2 jobs: for the first part they gave a survey on the asymmetric Bregman geometry, starting with a good presentation of the entropy and cross entropy concepts; for the second part they gave a new definition of what they called the Bregman-Hausdorff divergence and gave some relevant algorithms for computing them. Since there is no proof, the validation is done by experiments. The paper is in general well-written.

What I am not quite happy is on the validation part, i.e., page 18-19. Does your data have ground truth? It would be good to have some empirical comparisonwith the ground truth. If not, then maybe some extra details should be given.

Author Response

Responses to Reviewer 3:

Thank you for your time and valuable comments. As detailed below, we made appropriate changes. Significant changes are marked in blue.

Comment 1: In this paper, the authors did 2 jobs: for the first part they gave a survey on the asymmetric Bregman geometry, starting with a good presentation of the entropy and cross entropy concepts; for the second part they gave a new definition of what they called the Bregman-Hausdorff divergence and gave some relevant algorithms for computing them.

Response 1: We appreciate the positive summary.

Comment 2: Since there is no proof, the validation is done by experiments.

Comment 2: We decided that a formal correctness proof of these algorithms is not needed, since they are relatively straightforward. One aspect which did require extensive proofs was the correctness of the Kd-tree method in the Bregman case. This is highly intricate, and described in our previous paper which (which we cite and is now published).

Comment 3: The paper is in general well-written.

Response 3: We did our best, but if there is anything specific we could improve, we would be happy to do this. It’s really important for us to make this paper readable to the members of the machine learning, information extraction and related communities!

Comment 4: What I am not quite happy is on the validation part, i.e., page 18-19. Does your data have ground truth? It would be good to have some empirical comparison with the ground truth. If not, then maybe some extra details should be given.

Response 4: We hope we understood the question as intended.

In most cases there is no pre-existing ground truth for the outputs of the proposed algorithms, since the concept they compute is new. We did compare their outputs to the outputs of existing algorithms in the Euclidean case (they were the same in all cases). The linear search we implemented (and use as a baseline) is so straightforward that we essentially treat it as ground truth.

Apart from the above, (in response to other reviewers’ comments) we reworked this section.

Overall, it’s really important for us to make this paper readable to the members of the machine learning, information extraction and related communities. Please let us know if there are any other issues we should work on.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The paper is significantly improved and the authors have taken the time to address the suggestions made in my report. I have only one minor comment to make about line 652 where "provide" repeats twice.

Article Menu

Bregman–Hausdorff Divergence: Strengthening the Connections Between Computational Geometry and Machine Learning

Further Information

Guidelines

MDPI Initiatives

Follow MDPI