Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Aggregating Image Segmentation Predictions with Probabilistic Risk Control Guarantees

Mathematics 2025, 13(11), 1711; https://doi.org/10.3390/math13111711

by Joaquin Alvarez and Edgar Roman-Rangel^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Mathematics 2025, 13(11), 1711; https://doi.org/10.3390/math13111711

Submission received: 4 April 2025 / Revised: 13 May 2025 / Accepted: 21 May 2025 / Published: 23 May 2025

(This article belongs to the Special Issue Artificial Intelligence: Deep Learning and Computer Vision)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

How the combinations are considered of pretrained predictive algorithms.
How the threshold value was considered for the individual prediction set of each agent to obtain a merged weighted-average set predictor.
How finite-sample statistical guarantees by leveraging valid p-values from concentration inequalities for bounded random variables.
Algorithm 1 needs some more clarity on how they were implemented. Need a clear, brief explanation.
Is the writer using a hybridised LTT framework? (Figure 2) The content in this section presents some ambiguity during the reading process.
How do the writers get a guarantee to reallocate the risk budget?
Explain the process of how polyp segmentation was implemented.
The visualisation is weak. We require images with high-quality resolution. Make sure to increase the font size. Ensure that every image includes the x- and y-axis.
We need statistical data on how the table 2 values were generated

Author Response

We thank the reviewers for their valuable revision of our manuscript. We have attended to their comments, which have helped improve the presentation of our work.

Please, find below a point-by-point response to the reviewers’ comments. Additionally, we are now presenting an improved version of our document with highlighted sections for easy identification of our responses.

Reviewer 1

Comment 1. How the combinations are considered of pretrained predictive algorithms.

Response. We appreciate this comment, which made us realize that the combination process was not clear enough. To make it clearer, we have separated Equation (1) from the paragraph where it was initially placed. Its explanation spans from lines 191 through 195. We also included an explanation in lines 316 to 320.

Comment 2. How the threshold value was considered for the individual prediction set of each agent to obtain a merged weighted-average set predictor.

Response. Thank you for your questions, we have noticed that this detail is not clear. We consider the interval of the guarantee as a finite budget, which we allocate through a uniform distribution, whose intervals were set empirically. We have improved the explanation of this detail in Equation (8) and in lines 316-320.

Comment 3. How finite-sample statistical guarantees by leveraging valid p-values from concentration inequalities for bounded random variables.

Response. We acknowledge that this fact was nuclear. We use hypothesis testing for calibration. We have highlighted this explanation in lines 232 to 235.

Comment 4. Algorithm 1 needs some more clarity on how they were implemented. Need a clear, brief explanation.

Response. We have provided an explanation of its logic and process in lines 284-290. Additionally, we also provided details about coding language and libraries in lines 459-462, and 475-476.

Comment 5. Is the writer using a hybridised LTT framework? (Figure 2) The content in this section presents some ambiguity during the reading process.

Response. We have noticed that this explanation was also unclear. Indeed, we use a hybridized LTT framework. We have made it clearer in lines 316-320.

Comment 6. How do the writers get a guarantee to reallocate the risk budget?

Response. We defined it such that it is uniformly distributed across all cases considered during the evaluation. The explanation in the manuscript is the same as that in comment 2. This is, we have improved the explanation of this detail in Equation (8) and in lines 316-320.

Comment 7. Explain the process of how polyp segmentation was implemented.

Response. Thank you for this question. We have provided details about the polyp segmentation, including base models, implementation, hyperparameters, and language. These details are in lines 459-461, 475-476, and 478-479.

Comment 8. The visualisation is weak. We require images with high-quality resolution. Make sure to increase the font size. Ensure that every image includes the x- and y-axis.

Response. We have revised and improved the images. Thank you.

Comment 9. We need statistical data on how the table 2 values were generated.

Response. We thank you for this observation. We noticed that it can be unclear as the reading progresses. We have made it clearer in the Experimental Setting section, in lines 454, 461, 504-505.

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript presents a promising framework for risk-controlled aggregation of segmentation models, but revisions are necessary to address methodological ambiguities, improve experimental rigor, and enhance clarity. Addressing these issues will strengthen the contribution and ensure broader impact in both AI and medical communities.

1.The distinction between "prediction sets" (Equation 1) and "majority vote prediction sets" (Equation 3) is not clearly motivated. The rationale for using two different formulations (weighted average vs. majority vote) needs explicit justification.

2.The statement in Equation (4) claims marginal coverage but does not explicitly address conditional coverage (e.g., per-image guarantees). This could mislead readers unfamiliar with conformal prediction.

3.The experiments compare aggregated models against individual constituent models but omit comparisons to some latest existing ensemble methods.

4.Figure 1 (example aggregation) is referenced but not included in the provided content.

5.The authors should follow up on the latest research progress.

6.Tables 2 and 3 report "FFR" (likely a typo for FPR) and lack units or confidence intervals.

7.Results in Tables 2 and 3 report empirical FPR/FNR reductions (e.g., "50% reduction") but lack statistical tests (e.g., p-values) to validate these claims.

8.References [13,18,40] are authored by the same group, risking overemphasis on in-group work.

Comments on the Quality of English Language

1.The phrase "high-probability risk control guarantees" in the abstract is vague.

2.Abstract: "without compromising the false negative rate at any user-specified false negative rate tolerance" → redundant phrasing.

Author Response

We thank the reviewers for their valuable revision of our manuscript. We have attended to their comments, which have helped improve the presentation of our work.

Reviewer 2

Comment 1. The distinction between "prediction sets" (Equation 1) and "majority vote prediction sets" (Equation 3) is not clearly motivated. The rationale for using two different formulations (weighted average vs. majority vote) needs explicit justification.

Response. We agree that this difference was not so clear. It is worth mentioning that the goal of this work is to propose a methodology that works well for both cases. Therefore, we have included the corresponding explanation in lines 202-204.

Comment 2. The statement in Equation (4) claims marginal coverage but does not explicitly address conditional coverage (e.g., per-image guarantees). This could mislead readers unfamiliar with conformal prediction.

Response. Thanks for pointing out this claim. Indeed, at this point we investigated the guarantees on average, but per image. We included this explanation in lines 218-222.

Comment 3. The experiments compare aggregated models against individual constituent models but omit comparisons to some latest existing ensemble methods.

Response. At this point, we are not targeting comparison against existing methods but rather proposing a method for ensuring guarantees for a specific combination model. We have acknowledged this precision in lines 152-158.

Comment 4. Figure 1 (example aggregation) is referenced but not included in the provided content.

Response. Thank you for the comment. We have made emphasis now in the reference of Figure 1 in line 80.

Comment 5. The authors should follow up on the latest research progress.

Response. Thank you for this recommendation. Indeed, we have followed recent contributions in the research areas, as presented in the related work section.

Comment 6. Tables 2 and 3 report "FFR" (likely a typo for FPR) and lack units or confidence intervals.

Response. The typo was fixed, thank you. On the other hand, FPR is a rate, therefore, it has no units. Moreover, the comparison is empirical at this point, for which no confidence intervals are shown. We have included an explanation of this fact in lines 155-158 and in lines 563-570.

Comment 7. Results in Tables 2 and 3 report empirical FPR/FNR reductions (e.g., "50% reduction") but lack statistical tests (e.g., p-values) to validate these claims.

Response. Thank you for this comment. At this point, the comparison is empirical, for which no confidence intervals are shown. We have included an explanation of this fact in lines 155-158 and in lines 563-570.

Comment 8. References [13,18,40] are authored by the same group, risking overemphasis on in-group work.

Response. Thank you for this observation. We have removed one of those references. The two remaining are important to highlight specific contributions.

Reviewer 3 Report

Comments and Suggestions for Authors

In Section 3.1 Settings: The explanation for the selection of the value ranges for λ1 and λ2 is insufficient. It is recommended to provide a more detailed justification.

In Section 4 Experimental Settings: The experiments are based solely on the PolypGen and Jun Cheng's brain tumor datasets. The lack of dataset diversity may limit the generalizability of the research findings. It is suggested to include a more diverse set of datasets. In addition, a deeper analysis and summary of the experimental results is also recommended.

Author Response

We thank the reviewers for their valuable revision of our manuscript. We have attended to their comments, which have helped improve the presentation of our work.

Reviewer 3

Comment 1. In Section 3.1 Settings: The explanation for the selection of the value ranges for λ1 and λ2 is insufficient. It is recommended to provide a more detailed justification.

Response. Thank you for your observation. We agree that such an explanation was unclear. We have provided the corresponding details in lines 400-403 and in 422-427.

Comment 2. In Section 4 Experimental Settings: The experiments are based solely on the PolypGen and Jun Cheng's brain tumor datasets. The lack of dataset diversity may limit the generalizability of the research findings. It is suggested to include a more diverse set of datasets. In addition, a deeper analysis and summary of the experimental results is also recommended.

Response. Thank you for this comment. We showcase our proposed method on two challenging datasets. We acknowledge in the conclusion section that this method can generalize to other datasets, and that that investigation must be regarded as future work. Additionally, we included a deeper analysis of our current results.

Article Menu

Aggregating Image Segmentation Predictions with Probabilistic Risk Control Guarantees

Further Information

Guidelines

MDPI Initiatives

Follow MDPI