Aggregating Image Segmentation Predictions with Probabilistic Risk Control Guarantees

Alvarez, Joaquin; Roman-Rangel, Edgar

doi:10.3390/math13111711

Open AccessArticle

Aggregating Image Segmentation Predictions with Probabilistic Risk Control Guarantees

by

Joaquin Alvarez

and

Edgar Roman-Rangel

^*

Department of Computer Science, Instituto Tecnológico Autónomo de México, Mexico City 01080, Mexico

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(11), 1711; https://doi.org/10.3390/math13111711

Submission received: 4 April 2025 / Revised: 13 May 2025 / Accepted: 21 May 2025 / Published: 23 May 2025

(This article belongs to the Special Issue Artificial Intelligence: Deep Learning and Computer Vision)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In this work, we introduce a framework to combine arbitrary image segmentation algorithms from different agents under data privacy constraints to produce an aggregated prediction set satisfying finite-sample risk control guarantees. We leverage distribution-free uncertainty quantification techniques in order to aggregate deep neural networks for image segmentation tasks. Our method can be applied in settings to merge the predictions of multiple agents with arbitrarily dependent prediction sets. Moreover, we perform experiments in medical imaging tasks to illustrate our proposed framework. Our results show that the framework reduced the empirical false positive rate by 50% without compromising the false negative rate, with respect to the false positive rate of any of the constituent models in the aggregated prediction algorithm.

Keywords:

risk control; guarantees; distribution-free; uncertainty quantification; ensemble learning; polyps; brain tumors; semantic segmentation; convolutional neural network

MSC:

68T37

1. Introduction

Novel medical AI tools have introduced several challenges when adopted in practice. Very commonly, different research studies and predictive algorithms are built under different methodologies and assumptions, and considering different population demographics to assess their performance [1]. Furthermore, comparisons among predictive algorithms may not be straightforward, given that in various prediction tasks, there is no single measure to evaluate all the relevant properties of a model [2]. This situation is exacerbated by the inherent complexity of many of these predictive algorithms, which may ultimately be taken as black-box models. In turn, clinicians are left with the duty of figuring out which algorithm is likely to perform best for their target patient population in a complex landscape of alternatives and factors and to contemplate when to deploy them in practice.

In the context of our work, we address how to combine arbitrarily dependent image segmentation predictions in a black-box manner with risk control guarantees that are satisfied with high probability. Suppose that we have two agents (for example, two hospitals), each with their own image segmentation algorithms for the same task, e.g., polyp segmentation, or brain tumor segmentation from an MRI scan. The agents may have built their own predictive model under broadly different considerations, for example, using different datasets (based on different patient population demographics), using different architectures, or even solving a different optimization problem during training. It may happen that the agents trained their models with some overlapping data, for example, public data, gold-standard data or data from common patients between the two hospitals. These agents would like to combine the predictions from their algorithms without the need to retrain a new model, but without sharing their own data due to privacy constraints (e.g., competitive or regulatory reasons, or patients’ consent restrictions). For example, this is motivated by possible incentives to collaborate in order to make predictions on some marginalized target population, whose distribution may be different from the distribution of the training dataset that each agent used to train their own models. Another possible motivation is that the two agents benefit from making a merged prediction, for example, many hospitals share the same patients, so making a shared diagnosis can provide better clinical results than separate diagnoses [1,3,4,5,6].

Deep neural networks have become a powerful tool for decision-making in medical settings, which historically have been overlooked by most practitioners and clinicians, partly because of the lack of trust in these tools [7]. Recently, there has been remarkable progress to change this situation, due to successful efforts to build rigorous statistical guarantees for these models [8,9,10,11]. In this work, we propose a framework that contributes to this recent progress in statistical uncertainty quantification in AI. Despite the impressive performance of deep learning algorithms in medical imaging tasks, uncertainty quantification is crucial to deploy them safely in high-stakes scenarios rather than blindly trusting them without having some valid notion of the uncertainty of their predictions. Moreover, feature visualizations of the properties and the behavior of neural networks may be insufficient to guarantee their reliability [12]. Conformal prediction and its related frameworks [13,14,15,16] have proven to be versatile in practice for the purposes of uncertainty quantification. They work for a diverse spectrum of predictive algorithms, making few assumptions and offering desirable properties. For example, some of them are adaptive, in that they report higher uncertainty for harder predictions. Moreover, these frameworks are typically distribution-free, meaning that they do not require distributional assumptions on the data.

In this work, we leverage recent distribution-free uncertainty quantification frameworks [13,17] to ensemble deep neural networks for medical image segmentation tasks. Ensembles refer to the idea of combining the predictions of multiple models with the hope of obtaining a better model (in terms of a loss function, for example) than the constituent single models. Specifically, the term deep ensembles refers to aggregating individual deep neural networks. There are a variety of ways to combine deep neural networks. Two of the most common ones involve either averaging predictions or making majority voting systems [18,19,20,21,22]. Furthermore, when it comes to averaging predictions of a model, there are various ways to conduct this for deep ensembles. For example, either averaging logits or probabilities of multiple models [20].

In this work, we introduce an ensemble framework based on the weighted average of deep neural networks in a pool-then-calibrate fashion similar to the one described in [18], but using different post-processing approaches based on distribution-free uncertainty quantification. The pool-then-calibrate strategy refers to post-processing deep ensembles once they have been trained separately and aggregated afterwards. It has proven promising results in low-data regimes and out-of-distribution (OOD) samples [18]. Moreover, we also consider weighted averaging of the predictions and do not limit ourselves to plain arithmetic means. Our work is especially focused on small ensembles, allowing us to understand the prediction of the ensemble in terms of the predictions of the constituent models visually. For example, consider Figure 1, where the prediction task is to segment a colonoscopy image to predict tumor pixels (labeled with white pixels). The predictions of the ensemble can be visually analyzed in terms of the predictions of the constituent deep neural networks. The NN1 makes substantial false positives (blue pixels) and fewer false negatives (red region) compared to NN2, which creates more false negatives and almost no false positives. The ensemble creates fewer false negatives than any of the constituent models and has fewer false positives than the predicted mask by NN1.

We are motivated by determining how can we aggregate arbitrary image segmentation deep models with statistical guarantees on some relevant risk that we aim to control (e.g., the expected false negative rate), while obtaining an ensemble that outperforms any constituent model in another relevant risk that trades-off with the risk that we control (e.g., the false positive rate). We make the following contributions:

We provide finite-sample risk control guarantees for ensembles using distribution-free uncertainty quantification, leveraging health information exchanges between medical agents to address the practical considerations in healthcare: shared patients between medical institutions, as well as data privacy restrictions.
Our approach provides statistical guarantees even when the distribution of the private dataset of each agent is different from that of the target population for the ensemble.
To our knowledge, this is the first work to use the Learn then Test (LTT) calibration framework to merge prediction sets in medical AI.

The rest of this work is organized as follows: In Section 2 we present related work that covers the relevant literature to our work. In Section 3 we present our setup with details of our proposed calibration approach to aggregate image segmentation prediction sets with data privacy constraints. In Section 4 we present the experimental setting with two case studies of our proposed framework to predict

(a)

polyps from colonoscopy images [23] and

(b)

brain tumors from MRI scans [24]. In Section 5 we present the results of our experiments, and Section 6 contains the conclusions of this work.

2. Related Work

As a matter of related work in medical image segmentation settings using ensembles, ref. [25] introduces a useful algorithm to favor diversity when creating an ensemble model by aggregating models with low-correlated predictions for each iteration. We use similar U-Net architectures [26] as theirs in our experiments. Their work uses heterogeneous constituent models, whereas we consider homogeneous ensembles in our experiments. Homogeneous ensembles refer to the fact that the members within the ensemble have the same architecture. On the one hand, ref. [25] considers the bigger ensemble sizes than the ones that we study in this work, but they use uniform averages (arithmetic means). In our work, we allow asymmetric allocations to the predictions of the members of the ensemble. These allocations are weights that represent the contribution of each model to the overall prediction of the ensemble. We should highlight that our method works in a black-box manner, so that it is also valid when dealing with heterogeneous ensembles.

Previous work has studied the benefits of ensemble learning to improve and stabilize the performance of neural networks in the specific context of polyp segmentation [27,28]. Our approach does not offer adaptability by incorporating complementary high-level semantic features as in previous literature [27], but our method offers distribution-free risk control guarantees, which we consider to be a remarkable distinction of our work compared to previous literature on dual ensemble neural networks for image segmentation. On a similar basis, refs. [29,30] also study a dual ensemble approach to segmentation in medical contexts, but do not incorporate any notion of distribution-free uncertainty quantification in a black-box manner associated with the improved benefits obtained with their ensemble approach. On the other hand, comparing our approach with the celebrated Learn to Threshold framework [31] we can also point out that there is no consideration of distribution-free guarantees to threshold the so-called predicted likelihood map in order to build prediction masks.

Our method is different from other federated learning (FL) frameworks to aggregate brain tumor segmentation predictions [32,33]. We contribute a framework that accounts for the uncertainty quantification even when each agent trained an algorithm with data from different population demographics, without the need to retrain.

Another relevant framework that also does not require model retraining or data pooling is [34]. Their work introduces a DeGroot consensus mechanism for aggregating the agents’ predictions, though it differs from our work, because our focus is on finite sample statistical guarantees and experiments in computer vision, whereas the theory in [34] is asymptotic on the sample size, with experiments outside the scope of our work.

Recent work has introduced ensemble approaches for conformal methods [22,35,36]. Our work is inspired by such approaches, building on novel applications for medical imaging using the LTT framework [13]. On the other hand, distribution-free uncertainty quantification frameworks such as [14,16,17] have considered the problem of polyp segmentation to illustrate or motivate their approaches; however, the current state of the literature has not yet considered an aggregation approach of polyp segmentation models and brain tumor prediction using distribution-free uncertainty quantification. With this current state of the literature, we believe that our work can illustrate and potentially motivate further research directions that remain unexplored using ensembles with uncertainty quantification for image segmentation tasks.

We consider the performance of image segmentation algorithms to be outside the scope of our work. Instead, the contribution of our work focuses on a framework to combine such high-performance black-box models to obtain aggregated predictions through an ensemble, putting a special emphasis on finite-sample risk control guarantees. We also highlight that this method has limitations when used as a model selection tool after calibration, because the criteria are based on empirical risk minimization rather than on probable risk control that quantifies the improved performance of the ensemble.

3. Calibrating Deep Ensembles with Statistical Guarantees

In this section, we first formalize the setup of the problem we address. Then, we summarize the Learn-Then Test (LTT) framework adapted to the setup that we explore, and finally, we hybridize the LTT framework to ensemble deep learning models in image segmentation tasks.

Rather than having to make a hard decision of one pretrained deep learning model over another, in this work we consider combinations of pretrained predictive algorithms so that a user can perform model selection among a class of ensemble neural networks. Our approach is convenient to allow the aggregation of different predictive algorithms, which may have diverse (and sometimes unknown) properties. For example, one black-box may tend to overestimate the number of pixels with tumors in many instances (a more conservative model in terms of false negative rate control) compared to another black-box, which may tend to underestimate the number of pixels with a polyp. In turn, aggregating such black-box models can allow us to make the former less conservative and the latter more conservative when combining their predictions. This approach is motivated by health information exchange between hospitals that share patients, so that a shared diagnosis improves predictions in terms of a trade-off between false negatives and false positives.

We formalize the description of our problem and proceed to introduce our approach to solving it. We begin by describing the setup.

3.1. Setup

Formally, we have a random sample that constitutes our calibration dataset

D_{cal} : = {(X_{i}, Y_{i})}_{i = 1}^{n}

from a distribution

P_{X, Y}

in the space

X \times Y

. For example,

X_{i}

can be a colonoscopy image and

Y_{i}

is a binary mask (encoded through indices of a matrix that represents the location of the pixels in the image) that identifies the pixels where there is a polyp/tumor within the image. Consider two pretrained neural networks

{\hat{f}}_{k} : X ⟶ {[0, 1]}^{R \times N},

for

k = 1, 2

to make predictions in a segmentation setting in order to make binary per-pixel classifications, which we use to create prediction masks. These neural networks take as input an image in

X

(where

X \subset M^{R \times N \times 3}

if the image is represented in a RGB system or

X \subset R^{R \times N}

if the image is black and white) and output scores representing predicted probabilities for the category labeled with 1 in each pixel of the input image, so that

Y_{i} \in Y : = 2^{{1, 2, \dots, R} \times {1, 2, \dots, N}}

is a set containing the pixel indices where the image

X_{i}

has label 1, whereas

2^{{1, 2, \dots, R} \times {1, 2, \dots, N}}

denotes the power set of

{1, 2, \dots, R} \times {1, 2, \dots, N}

. We introduce a hyperparameter to ensemble the likelihood maps through a convex combination of the neural networks and a threshold hyperparameter to obtain prediction sets, which will be formally defined next. We denote the ensemble model by

{\tilde{f}}_{λ_{1}} : = λ_{1} {\hat{f}}_{1} + (1 - λ_{1}) {\hat{f}}_{2}, λ_{1} \in [0, 1] .

(1)

Note that the ensemble

{\tilde{f}}_{λ_{1}}

is a weighted average of the models

{\hat{f}}_{1}

and

{\hat{f}}_{2}

, allocating a weight of

λ_{1}

to

{\hat{f}}_{1}

and

1 - λ_{1}

to

{\hat{f}}_{2}

.

In order to build prediction masks, we introduce a hyperparameter

λ_{2} \in [0, 1]

, to threshold the likelihood map. We define

λ : = (\begin{matrix} λ_{1} \\ λ_{2} \end{matrix})

, and we denote the corresponding hyperparameter space as

Λ : = [0, 1] \times [0, 1]

.

Consider an arbitrary pair

(X_{s}, Y_{s}) \in X \times Y

sampled from

P_{X, Y}

. Our ensemble prediction set is given by the following:

T_{λ} (X_{s}) : = \{(i, j) \in {1, 2, \dots, R} \times {1, 2, \dots, N} : {\tilde{f}}_{λ_{1}} (X_{s}) (i, j) \geq 1 - λ_{2}\},

(2)

where

{\tilde{f}}_{λ_{1}} (X_{s}) (i, j)

represents the entry

(i, j)

of the matrix

{\tilde{f}}_{λ_{1}} (X_{s})

.

In order to motivate that our proposed calibration approach can be adapted to other ensemble schemes, we will also consider a formulation where we have arbitrary image segmentation algorithms

C_{k}

,

k = 1, \dots, K

. In particular,

C_{k}

may be built from some model

{\hat{f}}_{k}

with threshold

λ_{2, k} \in (0, 1), for each k = 1, \dots, K

leading to prediction sets of the form

C_{k} (X_{s}) : = \{(i, j) \in {1, 2, \dots, R} \times {1, 2, \dots, N} : {\hat{f}}_{k} (X_{s}) (i, j) \geq 1 - λ_{2, k}\} .

(3)

In this work we take

λ_{2, k} = λ_{2}

, the same threshold for the individual prediction set of each agent, and obtain a merged weighted-average set predictor of the form

C_{λ}^{W} (X_{s}) \{(i, j) : \sum_{k = 1}^{K} w_{k} 1 {(i, j) \in C_{k} (X_{s})} > λ_{1}\},

(4)

where

λ_{1} \in (0, 1)

and

w_{k} \in [0, 1]

for all

k \in {1, \dots, K}

, with

\sum_{k = 1}^{K} w_{k} = 1

. In this way,

\sum_{k = 1}^{K} w_{k} 1 {(i, j) \in C_{k} (X_{s})}

represents the collective vote to determine if the pixel

(i, j)

should be included in the prediction set. With this particular setup for generalized majority vote prediction sets, the calibration is also taken in a hyperparameter space

Λ : = [0, 1] \times [0, 1]

. In such a formulation, our prediction set is denoted by

C_{λ_{1}, λ_{2}}^{W} (X_{s})

to emphasize the dependence of the two hyperparameters.

Our work will leverage the Learn Then Test (LTT) framework to provide statistical guarantees with marginal coverage of the form

P (E (ℓ (S_{λ} (X_{s}), Y_{s}) | D_{c a l}) \leq α) \geq 1 - δ,

(5)

for any user-specified values of

α, δ \in (0, 1)

, where

S_{λ} (X) = T_{λ} (X)

in the case of the formulation (2) and

S_{λ} (X_{s}) = C_{λ_{1}, λ_{2}}^{W} (X_{s})

in the formulation (4), for a given loss,

ℓ (S_{λ} (X_{s}), Y_{s})

, bounded in

[0, 1]

, e.g., the false negative rate, which we will define next. The probability is computed on the calibration dataset, which we use to calibrate

λ

accordingly. By marginal guarantees, we mean that our stated statistical coverage at level

1 - δ

is achieved on average over the calibration dataset, as opposed to conditional coverage, which would allow us to obtain risk control guarantees for specific image instances.

We point the reader to an excellent reference that addresses how to choose

α

in a rigorous data-driven way [37], but we consider such data-driven choice of

α

to be outside the scope of our work. It is important to be thoughtful in the election of the bound

α \in (0, 1)

, especially to account for data dependence, which is typical in practice to account for risk–reward tradeoffs. Naïve approaches can introduce biases that violate the kind of statistical guarantees that we seek for our deep ensembles.

For calibrating, we will consider the false negative rate (FNR):

ℓ (S_{λ} (X_{s}), Y_{s}) : = 1 - \frac{| S_{λ} (X_{s}) \cap Y_{s} |}{| Y_{s} |},

(6)

where

| \cdot |

denotes the number of elements in a set, and ∩ denotes the intersection operation between sets. We can also monitor the IoU, binary cross-entropy, false positive proportion, etc. We focus on the FNR because it is a loss function bounded in

[0, 1]

, so we can provide finite-sample statistical guarantees by leveraging the valid p-values from concentration inequalities for bounded random variables, which we use to calibrate the models via hypothesis testing; the details are provided in the next subsection within this work. Moreover, this metric penalizes not including label-1 pixels in the prediction set.

In the case in which

Y_{s} = \emptyset

(namely,

| Y_{s} | = 0

, i.e., when the image does not contain pixels that are labeled with a 1), we set the FNR to zero by convention. In our post-processing step, we calibrate based on the FNR, but we will also monitor other loss functions for informative and comparison purposes. We define the false positive rate (FPR) on a given

(X_{s}, Y_{s})

pair when using a prediction set

S_{λ} (X_{s})

as

F P R : = 1 - \frac{| S_{λ} {(X_{s})}^{c} \cap Y_{s}^{c} |}{| Y_{s}^{c} |},

(7)

which penalizes including pixels in the prediction set whose true label is a 0 (thus, smaller values are better).

The complement of

S_{λ} (X_{s})

, denoted by

S_{λ} {(X_{s})}^{c}

, is defined as

S_{λ} {(X_{s})}^{c} : = {1, \dots, R} \times {1, \dots, N} - S_{λ} (X_{s})

, analogously, we take

Y_{s}^{c} : = {(i, j) \in {1, \dots, R} \times {1, \dots, N} : (i, j) \notin Y_{s}}

.

Next, we discuss the approach that we take to calibrate the hyperparameters,

λ \in [0, 1] \times [0, 1]

in such a way that we satisfy guarantees of the form of (5).

3.2. The Learn Then Test Framework

We will give a brief description of the Learn then Test (LTT) [13] framework adapted to our work. We proceed by discretizing the hyperparameter space

Λ

into a finite set

\tilde{Λ} \subseteq Λ

. Specifically, we discretize

Λ

into a set of

(M + 1) (Q - 1)

pairs and take

\tilde{Λ} : = {\tilde{Λ}}_{1} \times {\tilde{Λ}}_{2} : = \{\frac{m}{M} : m \in {0, 1, \dots, M}\} \times \{\frac{q}{Q} : q \in {1, \dots, Q - 1}\} \subseteq Λ .

(8)

In the case of a prediction set of the form (2), the number of ensemble models that we obtain is given by

M + 1

. Each of these models allocates varying weights to the constituent members of the ensemble. For each of these

M + 1

ensemble neural networks, we consider

Q - 1

thresholds to make binary classifications at each pixel. Whereas in the case of majority vote prediction, sets of the form (4),

{\tilde{Λ}}_{1}

represent the consensus thresholds to obtain a majority vote; thus, the discretization

{\tilde{Λ}}_{1}

should capture points in which the aggregated vote takes jumps. So, for example, if

w_{k} = \frac{1}{K}

, then we consider

{\tilde{Λ}}_{1} = {\frac{1}{K}, \frac{2}{K}, \dots, 1 - \frac{1}{K}}

.

{\tilde{Λ}}_{2}

contains the thresholds that we explore to obtain binary masks with each of the members of the prediction set, similar to the prediction set (2).

We adopt the notation

\tilde{Λ} = {λ_{i, j} : i \in {0, \dots, M} and j \in {1, \dots, Q - 1}}

, where

λ_{i, j} : = (\frac{i}{M}, \frac{j}{Q})

for each

i \in {0, \dots, M} and each j \in {1, \dots, Q - 1}

. We obtain valid p-values (which we will formally define next. For example, we may consider those provided by [13], or an alternative [38], see Appendix A for more details) for each null hypothesis

H_{0, λ_{i, j}} : E (ℓ (S_{λ_{i, j}} (X_{s}), Y_{s})) > α

. Namely, this null hypothesis states that the risk is not controlled under the hyperparameter pair

λ_{i, j}

with a set predictor

S_{λ_{i, j}}

.

Definition 1 (Valid p-value).

Given a null hypothesis in a hypothesis testing problem,

H_{0}

, a valid p-value (or super-uniform p-value),

\hat{p}

, is a random variable satisfying

P (\hat{p} \leq u | H_{0}) \leq u, for every u \in [0, 1] .

(9)

That is, the cumulative distribution function of

\hat{p}

is bounded above by the cumulative distribution function of a uniform random variable in

[0, 1]

, under the null hypothesis.

Smaller observations of a valid p-value,

\hat{p}

, represent stronger statistical evidence to reject the null hypothesis. See the Appendix A for explicit closed-form valid p-values that we use in this work. We need valid p-values as part of LTT to provide them as input to a FWER-controlling algorithm in a multiple testing setting.

Definition 2

(FWER-controlling algorithm). Given a collection of null hypotheses

{H_{t}}_{t = 1}^{T}

with corresponding valid p-values

p_{t}

,

t = 1, \dots T

, a FWER-controlling algorithm at level

δ \in (0, 1)

,

A = A (p_{1}, \dots, p_{T}; δ)

, is a procedure to decide on each

H_{t}

satisfying that

P (A (p_{1}, \dots, p_{T}; δ) \cap I \neq \emptyset) \leq δ,

(10)

where

A (p_{1}, \dots, p_{T}; δ)

denotes the set of indices whose null was rejected by the procedure and

I : = {i \in {1, \dots, T} | H_{i} is true}

. That is, the probability that the procedure produces at least one false positive is less than or equal to δ.

We will make use of one particular FWER-controlling algorithm which we present next.

The way in which we order the null hypothesis when implementing Algorithm 1 is fundamental. The algorithm begins by determining whether to reject

H_{1}

or not by comparing its p-value to

δ

and iterates over the index comparing each valid p-value with

δ

, until it exits the while loop for two possible reasons: either a valid p-value exceeds

δ

, or the procedure finishes rejecting all the null hypotheses. Thus, the ordering should reflect promising beliefs that the user has regarding the null hypotheses, making the first hypothesis the most promising to be rejected.

Algorithm 1 Fixed Sequence Algorithm

Inputs:

{H_{k}}_{k = 1}^{m}

a collection of ordered null hypotheses to test,

{p_{k}}_{k = 1}^{m}

the corresponding valid p-values of the nulls.

δ \in (0, 1)

a level to control the FWER.

Output:

O

set of rejected null hypotheses.

1:: $j \leftarrow 1$ ▹ initialize the index to check the nulls beginning from the most promising.
2:: $O \leftarrow \emptyset$ ▹ initialize the set of rejected nulls.
3:: while $p_{j} \leq δ$ do
4:: $O \leftarrow O \cup {j}$ ▹ $H_{j}$ gets rejected
5:: if j = m then
6:: Break while
7:: end if
8:: $j \leftarrow j + 1$
9:: end while
10:: return $O$

Proposition 1.

The Fixed Sequence algorithm controls the FWER at level δ.

See [13] for a proof.

Next, we present the key property of LTT, which formalizes the finite-sample risk control guarantee that we leverage to aggregate image segmentation algorithms. This guarantee is based on implementing an FWER-controlling algorithm.

Theorem 1

(Theorem 1 from LTT [13]). Let

p_{k}

be a valid p-value for each

H_{0, λ_{k}}

. Let

α \in (0, 1)

and let

δ \in (0, 1)

be an error budget. If

Γ : = A ({p_{k}}; δ)

denotes the corresponding values of

λ \in \tilde{Λ}

rejected by some FWER-controlling algorithm at level δ, then

P \{sup_{λ \in Γ} \{E [ℓ (S_{λ} (X), Y)]\} \leq α\} \geq 1 - δ,

(11)

where we define the supremum over the empty set as

- \infty

if the procedure did not reject any null hypothesis.

A natural question that arises next is which FWER-controlling algorithm to implement for the problem at hand. Here we consider some alternatives: one is a classical Bonferroni correction, and another more powerful approach, is Holm’s procedure. These are quite versatile approaches. They require almost no assumptions. But we would like to consider approaches that leverage the fact that the risk

R (λ_{1}, λ_{2}) : = E [ℓ (S_{λ} (X), Y)]

is non-increasing in

λ_{2}

for any fixed value of

λ_{1}

. Indeed, note that for any given

(X_{s}, Y_{s}) \sim P_{X, Y}

and any fixed

λ_{1} \in (0, 1)

, the prediction set

S_{λ} (X_{s})

satisfies

λ_{2} > λ_{2}^{'} \Rightarrow S_{λ_{1}, λ_{2}^{'}} (X_{s}) \subseteq S_{λ_{1}, λ_{2}} (X_{s}) .

(12)

Hence, the false negative rate is non-increasing in

λ_{2}

for any fixed value of

λ_{1}

. Note that (12) holds for either

S_{λ} (X_{s}) = C_{λ}^{W} (X_{s}) = \{(i, j) : \sum_{k = 1}^{K} w_{k} 1 {(i, j) \in C_{k} (X_{s})} > λ_{1}\},

or

S_{λ} (X_{s}) = T_{λ} (X_{s}) : = \{(i, j) \in {1, 2, \dots, R} \times {1, 2, \dots, N} : {\tilde{f}}_{λ_{1}} (X_{s}) (i, j) \geq 1 - λ_{2}\}

. Hence, given that the false negative rate is monotone in

λ_{2}

for fixed values of

λ_{1}

, this monotonicity is preserved in expected value (namely, the risk), as illustrated in Figure 2.

Note that in Figure 2, all the trajectories decrease in

λ_{2}

for any fixed value of

λ_{1}

. We will leverage this partial knowledge about the properties of the risk function. This will be crucial knowledge for the setup of the problem formulation in a pool-then-calibrate manner later on. It is the key idea that we use to hybridize the LTT framework using the fixed sequence procedure with a union bound. The idea is to implement LTT for a discretized subset of the ensemble hyperparameter space, and leverage the fact that the risk decreases in

λ_{2}

for any fixed value of

λ_{1}

, assigning a homogeneous risk budget

\frac{δ}{M + 1}

for each trajectory.

When thinking about this problem in terms of the corresponding graph in the multiple testing setting as described by a general graphical procedure [39], we notice that we do not have any prior knowledge that would suggest a graph design structure that is obviously optimal in terms of power, in order to maximize the number of rejected null hypotheses. However, at a sub-graph level, we identify a structure in the graph: for any fixed value of

λ_{1}

, the corresponding super-uniform p-values will be non-increasing in

λ_{2}

almost surely. This happens because of property (12). But still, considering the whole graph, there is no trivial way of designing it completely with prior knowledge. An ingenious approach to remediate this problem is presented in Appendix D of [13], namely the idea of learning the graph, that is, implementing Split Fixed Sequence Testing. However, that approach requires performing an additional split on the calibration dataset. It is important to keep in mind that in a low-data regime, performing an additional split may not be affordable. Low data regimes are common in many healthcare contexts. This work is no exception. Medical imaging technologies are usually expensive, and large-scale studies are unusual, making large datasets the exception rather than the norm.

Hence, our proposal is to use a union bound (which can be thought of as a Bonferroni correction at the confidence level) described in the following subsection, which also has a cost to pay. Both approaches have costs to pay in terms of power, and to the best of our knowledge, there is no theoretical optimal split size for learning the graph. On the one hand, using our union-bound approach, we must reduce the budget to do the tests. On the other hand, the Split Fixed Sequence Testing requires performing an additional split, which means that we have less data to do the calibration procedure, something quite undesirable if we already have a small calibration dataset to begin with.

3.3. Local Calibration with Global Statistical Guarantees

Our proposal is to reallocate the risk budget

δ

by applying LTT “locally” to obtain global statistical guarantees. In a sense, the approach can be thought of as a model selection procedure implementing LTT along with the fixed sequence procedure, leveraging the structure of the problem that we want to solve.

We partition the discretized hyperparameter space:

\tilde{Λ} = ⋃_{m = 0}^{M} Φ_{m},

(13)

where

Φ_{m} : = {m / M} \times \{\frac{q}{Q} | q \in {1, \dots, Q - 1}\},

for each

m \in {0, \dots, M}

, that is,

Φ_{m}

is a subset of

R^{2}

consisting of points, such that we fix an ensemble hyperparameter (the consensus hyperparameter in the case of prediction sets (4)) and it includes all the threshold hyperparameters of the discretized hyperparameter space. Thus,

Φ_{i} \cap Φ_{j} = \emptyset

, for every

i \neq j

. Let

Γ_{k}

be the output of the hyperparameters in

Φ_{k}

for which a FWER controlling algorithm at level

δ / (M + 1)

rejected the corresponding null hypotheses. Naturally, this can be the fixed sequence testing procedure, Algorithm 1, with the convenient ordering of the null hypotheses leveraging the property of monotonicity of the loss function for values of each

Φ_{k}

. See Appendix A for a pseudocode of the overall calibration method that we use. It is based on the fixed sequence algorithm presented in Algorithm 1.

We make use of Algorithm 1 for each ensemble that is explored to obtain global statistical guarantees, as formalized below.

Proposition 2.

Let

{Φ_{k}}_{k = 0}^{M}

be a partition of

\tilde{Λ}

of the form that we introduced above. Let

δ \in (0, 1)

. Let

Γ_{k}

be the output of the hyperparameters in

Φ_{k}

for which a FWER-controlling algorithm at level

δ / (M + 1)

rejected the corresponding null hypotheses, implementing LTT in

Φ_{k}

for each

k \in {0, \dots, M}

. Define

Γ : = \cup_{k = 0}^{M} Γ_{k}

. Then,

P (E (ℓ (S_{λ} (X_{s}), Y_{s})) \leq α for every λ \in Γ) \geq 1 - δ,

(14)

where we define the risk over the empty set as

- \infty

in the degenerate case when the procedure does not reject any null hypothesis.

Proof.

We make use of De Morgan’s laws and Boole’s inequality in the finite version:

\begin{matrix} P (E (ℓ (S_{λ} (X_{s}), Y_{s})) \leq α \forall λ \in Γ) & = P (⋂_{k = 0}^{M} \{sup_{λ \in Γ_{k}} {E (ℓ (S_{λ} (X_{s}), Y_{s}))} \leq α\}) \\ = 1 - P (⋃_{k = 0}^{M} \{sup_{λ \in Γ_{k}} {E (ℓ (S_{λ} (X_{s}), Y_{s}))} > α\}) \\ \geq 1 - \sum_{k = 0}^{M} P (\{sup_{λ \in Γ_{k}} {E (ℓ (S_{λ} (X_{s}), Y_{s}))} > α\}) \\ = 1 - M - 1 + \sum_{k = 0}^{M} P (sup_{λ \in Γ_{k}} {E (ℓ (S_{λ} (X_{s}), Y_{s}))} \leq α) \\ \overset{(a)}{\geq} - M + \sum_{k = 0}^{M} (1 - \frac{δ}{M + 1}) \\ = 1 - δ, \end{matrix}

where step

(a)

follows from Theorem 1 from [13], with the convention that the risk equals

- \infty

when the involved sets are empty, or the supremum over the empty set is defined as

- \infty

. □

More generally, if we have some useful prior information about the ensemble structure, we may allocate different confidence levels (budget) to each FWER controlling algorithm instead of taking

δ / (M + 1)

for each

k \in {0, 1, \dots, M}

, and take

δ_{0}, \dots, δ_{M} \in (0, 1)

such that

\frac{1}{M + 1} \sum_{k = 0}^{M} δ_{k} = δ

.

An interesting observation worth highlighting is that frameworks which require the risk to be a function of a single hyperparameter, and monotonicity in the hyperparameter is required (e.g., [17]), originally do not apply for the initial problem that we considered, since the risk is a function of two hyperparameters, namely

R (λ_{1}, λ_{2})

. But given the structure that we identified in the two-dimensional problem, our approach reformulates the original two-dimensional problem as multiple “local” one-dimensional problems, where the risk satisfies the monotonicity requirement. Thus, we can use our framework and hybridize it via the uniform concentration bounds (UCBs) presented in [17]. Though this is a possible formulation, we consider it to be outside the scope of this work.

Another relevant remark is to notice that our formulation is similar to designing a graph following [39] in which we run a fallback procedure such that the initial allocations for the error budget for

H_{0, λ_{i, Q - 1}}

are

\frac{δ}{M + 1}

for each

i \in {0, 1, \dots, M}

, and equal 0 for the rest of the null hypotheses, which correspond to the nodes of a graph. In such a graph, the edges are equal to one in a decreasing direction in

λ_{2}

for fixed values of

λ_{1}

, creating a chain that spreads the error budget throughout the values of

λ_{2}

as

λ_{2}

decreases and with a weight of 1 for the edges that go from the node

H_{0, λ_{i, 1}}

to

H_{0, λ_{i + 1, Q - 1}}

for each

i \in {0, 1, \dots, M - 1}

; these particular edges connect the ensemble models across the graph. This graph structure can be found in Appendix D of [13]. We highlight that different formulation approaches can yield the same data-driven decisions. In this case, implementing a fixed sequence test with a Bonferroni correction is somewhat equivalent to implementing such a fallback procedure with a sequential graphical approach.

4. Experimental Setting

In this section, we present experiments in polyp segmentation and brain tumor segmentation, where we implement the ideas that we presented in the previous sections. The discretization that we take for

Λ

is

\tilde{Λ} : = {\tilde{Λ}}_{1} \times {\tilde{Λ}}_{2},

with

{\tilde{Λ}}_{1} : = \{0.00, 0.25, 0.50, 0.75, 1.00\}

for prediction sets of the form (2) and

{\tilde{Λ}}_{1} : = \{1 / 5, 2 / 5, 3 / 5, 4 / 5\}

for prediction sets of the form (4). For

{\tilde{Λ}}_{2}

, depending on the context, we take

{\tilde{Λ}}_{2} = \{0.01, 0.02, \dots, 0.99\}

or sometimes even more refined (e.g.,

{\tilde{Λ}}_{2} = \{0.005, 0.01, \dots, 0.995\}

). The motivation behind this discretization is as follows: for

λ_{1}

we explore a uniform discretization of

[0, 1]

with increments of 0.25 to the constituent models, whereas for

λ_{2}

we explore across a more refined range of values for the threshold in Equation (2).

For the brain tumor prediction case study, we explore merging prediction sets that generalize a majority vote approach by calibrating the threshold, namely a particular case of (4) allocating uniform weights to the constituent voting members of the ensemble:

C_{λ}^{W} (X) \{(i, j) : \frac{1}{K} \sum_{k = 1}^{K} 1 {(i, j) \in C_{k} (X)} > λ_{1}\}, λ_{1} \in (0, 1) .

(15)

We consider taking

C_{k}

,

k = 1, \dots, K

, given explicitly by

C_{k} (X) : = \{(i, j) : {\hat{f}}_{λ_{1, k}} (i, j) (X) \geq 1 - λ_{2}\}

. In this way, we obtain different prediction sets by allocating different weights to the ensembles and using the same threshold for each ensemble model. We take

K = 5

and

λ_{1, k} \in {0.00, 0.25, 0, 50, 0.75, 1}

.

More generally, we can allocate different weights to the prediction sets

C_{λ}^{W} (X) : = \{(i, j) : \sum_{k = 1}^{K} w_{k} 1 {(i, j) \in C_{k} (X)} > λ_{1}\}, λ \in Λ,

where

w_{k} \in [0, 1]

for all

k \in {1, \dots, K}

,

\sum_{k = 1}^{K} w_{k} = 1

. For example, in the case where we consider

λ_{1, k} \in {0.00, 0.25, 0, 50, 0.75, 1}

we can explore working with non-uniform weights, e.g.,

w_{1} = 0.10 = w_{5}

,

w_{2} = 0.25 = w_{4}

and

w_{3} = 0.30

, in which the individual models have a less weighted contribution to the overall prediction, whereas the uniform weighted allocation of the models has the most representative contribution to the overall prediction set. Figure 3 provides a graphical representation of these nonuniform allocations in the majority vote prediction set for

K = 5

constituent models. Different allocations can be motivated by the option to incorporate prior knowledge that the user may have about the performance of each model and incorporate this prior knowledge in the weights that determine the contribution of each model to the aggregated prediction set.

For the brain tumor segmentation experiments with majority vote prediction sets, our discretization of

Λ

is as follows:

\tilde{Λ} : = {\tilde{Λ}}_{1} \times {\tilde{Λ}}_{2} : = \{\frac{1}{5}, \frac{2}{5}, \frac{3}{5}, \frac{4}{5}\} \times \{0.01, 0.02, \dots, 0.99\} \subseteq Λ,

the discretization of

{\tilde{Λ}}_{1}

is conducted accounting for the fact that

\sum_{k = 1}^{5} 1 {(i, j) \in C_{k} (X)}

takes jumps, attaining values in

{\tilde{Λ}}_{1}

as

λ_{2}

grows until it eventually reaches 1, in the case where all the prediction sets include the pixel in question. Hence, bigger values of

λ_{1}

represent a bigger consensus requirement to include a pixel in the majority-vote prediction set.

4.1. Polyp Segmentation Dataset

One of the best recent efforts to build a large-scale, high-quality and comprehensive dataset for polyp segmentation tasks was carried out by [23]. The authors provided a curated dataset called PolypGen, which contains 3762 annotated images from six different medical centers. Moreover, the entire dataset was verified by expert gastroenterologists. As the authors highlighted, they tried to make the dataset as realistic as possible; thus, ambiguous annotations were mostly removed.

PolypGen is rich within itself, in that it incorporates data and information from diverse sources, as well as diagnoses from a diverse population. Furthermore, the colonoscopy imaging was carried out with a diverse collection of endoscopic systems.

After a careful review from expert gastroenterologists, PolypGen’s final labeled dataset consists of 1537 images. Indeed, a big fraction of the initially annotated images were rejected by the experts because they did not consider them to be clinically acceptable. Table 1 shows the decomposition by dataset size for each of the six medical centers. For more details and properties about the dataset, we refer the reader to [23].

Importantly, in terms of pixel classification, we should emphasize that we are dealing with a binary classification with imbalanced classes, which is naturally common for some segmentation problems, in that for any given image, the number of pixels which do not have a polyp exceeds the number of pixels where there is a polyp. See Figure 4 to verify this observation.

Figure 4 shows the unbalanced classes in the calibration dataset. Indeed, a trivial model that predicts that there are no pixels with a tumor would achieve an average precision of 0.915 in the calibration dataset. Additionally, we can see in Figure 4 that the unbalanced categories are preserved across all the datasets of each of the six hospitals: all of the histograms present a big concentration of images with a very small number of pixels labeled with a tumor, a big positive skew and an average proportion of pixels labeled with a tumor close to 0.1, that is, roughly

10 %

of the pixels are labeled with a polyp, on average across all the images.

In order to train our constituent U-Net architectures [26] we performed stratified sampling, splitting the whole dataset into three disjoint subsets: two samples for training each neural network and one for calibrating the ensemble. The original images have varying pixel sizes. We resized the images so that all of them have the same dimensions and number of pixels:

256 \times 256

pixels. Our U-Net architectures for the constituent models in the ensemble were the same, but trained with disjoint datasets from the dataset split; each of the models has a total of 13,846,273 parameters. The calibration results that we present in the next section for the polyp segmentation experiments are obtained with a total of 465 images. These images were used for the results presented in Table 2.

4.2. Brain Tumor Segmentation Dataset

We use the dataset from Jun Cheng [24]. The dataset consists of 3064 T1-weighted contrast-enhanced (which is an MRI mechanism to detect tumors) labeled images with three kinds of brain tumors acquired from Nanfang Hospital and from Tianjin Medical University General Hospital, China, from 2005 to 2010. The dataset was first published online in 2015, and the last modification was updated in 2017.

The relevant literature using Convolutional Neural Networks and U-Net architectures for brain tumor segmentation, which inspired this experiment, is [40,41]. We proceeded in a similar way as with the polyp segmentation task by splitting the dataset into three disjoint sets to train the constituent U-Net models and calibrating different ensemble and majority vote prediction sets. We train two U-Net architectures for grayscale images with dimensions 256 × 256 with a total of 599,073 parameters each. The calibration process that we carry out uses a dataset with 920 images.

For all of our experiments, we made use of Python 3.10.0, using the modules of CV2 4.10.0, TensorFlow 2.15.0, Numpy 1.26.4, and Matplotlib 3.9.1.

5. Experimental Results

We first report the results and interpretation from implementing our calibration framework for image segmentation prediction ensembles in the polyp segmentation setting. Afterwards, we report the results for the brain tumor segmentation task using a majority vote prediction set. Finally, we highlight the key takeaways from our work.

5.1. Polyp Segmentation

In Figure 5 we can empirically validate the trade-off between false positives and false negatives. Overall, the FNR decays as

λ_{2}

increases, whereas the FPR increases as

λ_{2}

grows. From another perspective, we can appreciate that for any given

λ_{2}

the prediction set, taking

λ_{1} = 1

, has the lowest false positive rate compared to other ensemble hyperparameters, but it also has the highest false negative rate uniformly across

λ_{2}

.

Overall, with Figure 5 we can empirically verify the trade-off between the FNR and the FPR, given that the curves in Figure 5a are decreasing in

λ_{2}

, whereas the curves in Figure 5b are increasing in

λ_{2}

.

In Figure 6 we note the sensitivity of different metrics to the values of the ensemble hyperparameter

λ_{1}

when we run the calibration procedure for a specific value of

δ

. Both the FPR and the IOU present a convex relationship for different values of the risk tolerance threshold

α

.

Notably, Figure 6a shows that the output of the procedure minimizes the FPR with ensembles that are different from the constituent models, that is, the FPR gets minimized at values of

λ_{1}

such that

λ_{1} \neq 0

and

λ_{1} \neq 1

. Future work could study the stability of the minimizer when varying

δ

and

α

. In Figure 6 we can observe that the FPR is minimized at

λ_{1} = 0.50

or values of

λ_{1}

very close to

0.50

, across different values of

α

. More detailed results to explore this idea will be presented in the Table 2, where we can verify that, in most cases, the minimum FPR is reached in

λ_{1} = 0.50

. The support for these results is documented in Section 4.

Interestingly, Figure 7 has a similar shape to Figure 5b. As more pixels are classified with a 1, the more we incur the error of wrongly classifying pixels with a 1 in the prediction mask when their true label for such pixels is a 0 (making a false positive).

We can verify from Table 2 that the output of our procedure with

λ_{1} = 0.50

produces an aggregated prediction set that achieves an FPR which is substantially smaller than the comparable FPR that we obtain with any of the constituent image segmentation algorithms. Concretely, the ensembles with

λ_{1} = 0.50

achieved a reduction of

50 %

in the FPR compared to the constituent models.

Another finding in Table 2 is the low sensitivity of the results to changes in

δ

, which is something that favors our technique, because adjusting the confidence level for each ensemble does not compromise much power when performing a Bonferroni correction (union bound). We also remark that for bigger values of

α

the calibration results become more sensitive to changes in

δ

. We hypothesize that this can be explained due to the partial concavity of the false negative rates as a function of

λ_{2}

that we observe in Figure 5a.

We can appreciate from Table 2 that the procedure using the PRW valid p-value and the Hoeffding–Bentkus (H-B) valid p-value are the same, with some exceptions. In those exceptions (for example, taking

α = 0.2

and

δ = 0.20

with an ensemble hyperparameter of

λ_{1} = 0.25

), the procedure with the H-B outputs thresholds with smaller false positive rates, making the results using Hoeffding–Bentkus (H-B) more favorable compared to those of PRW.

5.2. Brain Tumor Segmentation

In Figure 8 we include a sample illustrating the experiment with a particular configuration. We should note the trade-off between false positive and false negative predictions in certain instances, as well as the benefits that the ensemble provides. For example, in some instances, one model may produce many false positives and the other may produce few of them but many false negatives. An ensemble allows us to obtain a middle point between the two of them, producing fewer false positives than any of the constituent models for any given FNR tolerance, on average.

Table 3 presents the calibration experimental results for a prediction set given by

C_{λ}^{M} (X) : = \{(i, j) : \frac{1}{5} \sum_{k = 1}^{5} 1 {(i, j) \in C_{k} (X)} > λ_{1}\}, λ_{1} \in {\tilde{Λ}}_{1},

we take

C_{k} (X) : = \{(i, j) : {\hat{f}}_{λ_{1, k}} (i, j) (X) \geq 1 - λ_{2}\}, for each λ_{1, k} \in {0.00, 0.25, 0.50, 0.75, 1.00}

.

Table 3 shows that for fixed values of

α

and

δ

, the obtained values of

λ_{2}

increase as the consensus hyperparameter

λ_{1}

increases. This is explained by the fact that a higher consensus hyperparameter requires a bigger agreement between the constituent models to include a pixel in the prediction set. Thus, to achieve the required FNR control, the threshold in the likelihood map needs to be smaller so that more pixels can be included in the prediction set as a way to avoid the error of not including a pixel labeled with a tumor into our prediction set. From Table 3, when we increase

α

, the false negatives rate tolerance, the calibration results lead to smaller values of the false positive rate. When the tolerance is small (

α = 0.10

) and the consensus agreement hyperparameter

λ_{1} = \frac{4}{5}

, the procedure did not obtain a hyperparameter

λ_{2}

satisfying the required FNR control, which is explained by the fact that a higher consensus hyperparameter leads to smaller prediction sets for fixed values of

λ_{2}

, making more false negatives.

Figure 9 presents the result of this table in plots.

One relationship to be highlighted from the graphs in Figure 9 is that for any fixed value of

λ_{2}

, the false negative rate increases with

λ_{1}

. This happens because prediction sets are smaller since more consensus (agreement between the constituent voting members) is required to include each pixel in the voting prediction set, hence with big values of

λ_{1}

we make fewer false positives but more false negatives.

Overall, the advantage of the generalized majority vote prediction set is that it allows us to incorporate the predictions of many agents and establish a consensus agreement to obtain an aggregated prediction set with reduced false positives. Furthermore, we showed how to calibrate a consensus agreement threshold that minimizes the empirical FPR from a discretized hyperparameter space for comparable majority vote prediction sets that provably satisfy a false negative rate control.

6. Conclusions

We explored different ensemble methods for image segmentation tasks, with an emphasis on a theoretical framework to ensure that we obtain risk control guarantees that are satisfied with high probability. We illustrate this technique in two experimental settings, where there is a trade-off between key risks: the false positive rate and the false negative rate. Our technique exploits the ensemble to address this tradeoff in a black-box fashion. We should emphasize that if the method is used for model selection, one limitation of our work is that the ultimate model choice (e.g., minimizing the empirical false positive rate) would be based on an empirical comparison of the false positive rates with no coverage guarantee that quantifies how much the selected model outperforms the constituent models in the ensemble in this metric.

Main Takeaway from This Work

Medical institutions using deep learning models for image segmentation tasks can benefit from joining efforts and merging their prediction models for shared patients without the need to retrain a model and without the need to share their private data, while obtaining risk control guarantees on key metrics of the aggregated diagnosis. This allows them to (on average) improve their predictions in instances where they would perform worse than without aggregating their predictions with other agents.

Author Contributions

Conceptualization, J.A. and E.R.-R.; methodology, J.A.; software, J.A.; validation, J.A. and E.R.-R.; formal analysis, J.A. and E.R.-R.; investigation, J.A. and E.R.-R.; data curation, J.A.; writing—original draft preparation, J.A.; writing—review and editing, J.A. and E.R.-R.; visualization, J.A.; supervision, E.R.-R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We thank the support of the Asociación Mexicana de Cultura, A. C.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Fixed Sequence Testing and Valid p-Values

The PRW valid p-value is given by

p^{PRW} (\hat{R}, α) \{\begin{matrix} \frac{α (n - ⌈ n \hat{R} ⌉)}{n α - ⌈ n \hat{R} ⌉} P {B i n (n, α) \leq ⌈ n \hat{R} ⌉}, & if \hat{R} \in [0, \frac{γ (α) - 1}{n}) \\ \max \{1, \frac{α (n - γ (α) + 1)}{n α - γ (α) + 1} P (B i n (n, α) \leq γ (α) - 1)\}, & if \frac{γ (α) - 1}{n} \leq \hat{R}, \end{matrix}

where we define

γ (α) : = \min {k \in N : k \geq α n}

, for every

α \in (0, 1)

. The Hoeffding-Bentkus (H-B) valid p-value is given by

p^{H B} (\hat{R}, α) : = \min \{e P (B i n (n, α) \leq ⌈ n \hat{R} ⌉), \exp {- n h (\min (α, \hat{R}), α)}\},

where

h (a, b) : = a l o g (a / b) + (1 - a) l o g (\frac{1 - a}{1 - b})

. Both are valid p-values that satisfy Definition 1. See [17,38] for proofs.

We can use any of these valid p-values as inputs to test each of the null hypotheses with any FWER-controlling algorithm. In particular, we use the following FWER-controlling algorithm for the case studies of this work.

As a remark about the algorithm, we may only store the value of the last rejected null hypothesis for each

Γ_{k}

because it is the one that will minimize the False Positive Rate among all the candidates that control the FNR below

α

with high probability. This is what we did to report the results in the tables. Also, notice that we could obtain the empirical risk and the valid p-value in the while loop to be more efficient and omit unnecessary computations.

Algorithm A1 Fixed Sequence Algorithm adapted to our framework

Inputs:

D_{n} : = {(X_{i}, Y_{i})}_{i = 1}^{n}

our calibration dataset,

{Φ_{t}}_{t = 0}^{M}

a partition of the discretized hyperparameter space,

δ \in (0, 1)

a level to control the FWER,

α \in (0, 1)

our desired risk control threshold.

Output:

Γ : = {Γ_{k}}_{k = 0}^{M}

family of subsets of

Λ

containing

(λ_{1}, λ_{2})

pairs satisfying the local and global statistical guarantees of (14).

1:: $Γ \leftarrow ⌀$ ▹ initialize the set of hypotheses to be rejected
2:: for $k = 0 : M$ do
3:: $Γ_{k} \leftarrow ⌀$ ▹ Initialize the set of pairs satisfying the global guarantees for the k-th ensemble
4:: for $j = 1 : N - 1$ do
5:: ${\hat{R}}_{k, j} \leftarrow \frac{1}{n} \sum_{i = 1}^{n} ℓ (S_{λ_{k, j}} (X_{i}), Y_{i})$ ▹ Evaluate the empirical risk
6:: $p_{k, j} \leftarrow p ({\hat{R}}_{k, j}, α)$ ▹ Obtain a valid p-value. This can be either the PRW or the H-B.
7:: end for
8:: $j \leftarrow N - 1$ ▹ initialize the sub-index to run the fixed sequence test with the valid
p-values. Begin with the smallest valid p-value
9:: while $p_{k, j} \leq δ / (M + 1) & j \geq 1$ do
10:: $Γ_{k} \leftarrow Γ_{k} \cup {λ_{k, j}}$ ▹ $H_{k, j}$ is rejected
11:: $j \leftarrow j - 1$
12:: end while
13:: $Γ \leftarrow Γ \cup {Γ_{k}}$ ▹ $S$ is updated
14:: end for
15:: return $Γ$

References

Zhang, F.; Kreuter, D.; Chen, Y.; Dittmer, S.; Tull, S.; Shadbahr, T.; Schut, M.; Asselbergs, F.; Kar, S.; Sivapalaratnam, S.; et al. Recent methodological advances in federated learning for healthcare. Patterns 2024, 5, 101006. [Google Scholar] [CrossRef] [PubMed]
Kelly, C.J.; Karthikesalingam, A.; Suleyman, M.; Corrado, G.C.; King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019, 17, 195. [Google Scholar] [CrossRef] [PubMed]
Rieke, N.; Hancox, J.; Li, W.; Milletari, F.; Roth, H.R.; Albarqouni, S.; Bakas, S.; Galtier, M.N.; Landman, B.A.; Maier-Hein, K.; et al. The future of digital health with federated learning. npj Digit. Med. 2020, 3, 119. [Google Scholar] [CrossRef]
Menachemi, N.; Rahurkar, S.; Harle, C.A.; Vest, J.R. The benefits of health information exchange: An updated systematic review. J. Am. Med. Inform. Assoc. 2018, 25, 1259–1265. [Google Scholar] [CrossRef]
Everson, J.; Adler-Milstein, J. Gaps in health information exchange between hospitals that treat many shared patients. J. Am. Med. Inform. Assoc. 2018, 25, 1114–1121. [Google Scholar] [CrossRef]
Nakayama, M.; Inoue, R.; Miyata, S.; Shimizu, H. Health information exchange between specialists and general practitioners benefits rural patients. Appl. Clin. Inform. 2021, 12, 564–572. [Google Scholar] [CrossRef]
Allen, B.; Agarwal, S.; Coombs, L.P.; Dreyer, K.; Wald, C. 2020 ACR Data Science Institute Artificial Intelligence Survey. J. Am. Coll. Radiol. JACR 2021, 18, 1153–1159. [Google Scholar] [CrossRef] [PubMed]
Angelopoulos, A.N.; Pomerantz, S.; Do, S.; Bates, S.; Bridge, C.P.; Elton, D.C.; Lev, M.H.; González, R.G.; Jordan, M.I.; Malik, J. Conformal Triage for Medical Imaging AI Deployment. medRxiv 2024. [Google Scholar] [CrossRef]
Vazquez, J.; Facelli, J.C. Conformal prediction in clinical medical sciences. J. Healthc. Inform. Res. 2022, 6, 241–252. [Google Scholar] [CrossRef]
Lu, C.; Angelopoulos, A.N.; Pomerantz, S. Improving Trustworthiness of AI Disease Severity Rating in Medical Imaging with Ordinal Conformal Prediction Sets. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2022, Singapore, 18–22 September 2022; Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 545–554. [Google Scholar]
Olsson, H.; Kartasalo, K.; Mulliqi, N.; Capuccini, M.; Ruusuvuori, P.; Samaratunga, H.; Delahunt, B.; Lindskog, C.; Janssen, E.A.; Blilie, A.; et al. Estimating diagnostic uncertainty in artificial intelligence assisted pathology using conformal prediction. Nat. Commun. 2022, 13, 7761. [Google Scholar] [CrossRef]
Geirhos, R.; Zimmermann, R.S.; Bilodeau, B.; Brendel, W.; Kim, B. Don’t trust your eyes: On the (un)reliability of feature visualizations. arXiv 2024, arXiv:2306.04719. [Google Scholar]
Angelopoulos, A.N.; Bates, S.; Candès, E.J.; Jordan, M.I.; Lei, L. Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control. arXiv 2022, arXiv:2110.01052. [Google Scholar]
Angelopoulos, A.N.; Bates, S.; Fisch, A.; Lei, L.; Schuster, T. Conformal Risk Control. arXiv 2023, arXiv:2208.02814. [Google Scholar]
Zecchin, M.; Simeone, O. Localized Adaptive Risk Control. arXiv 2024, arXiv:2405.07976. [Google Scholar]
Blot, V.; Angelopoulos, A.N.; Jordan, M.I.; Brunel, N.J.B. Automatically Adaptive Conformal Risk Control. arXiv 2024, arXiv:2406.17819. [Google Scholar]
Bates, S.; Angelopoulos, A.; Lei, L.; Malik, J.; Jordan, M. Distribution-free, risk-controlling prediction sets. J. ACM (JACM) 2021, 68, 1–34. [Google Scholar] [CrossRef]
Rahaman, R.; Thiery, A.H. Uncertainty Quantification and Deep Ensembles. arXiv 2021, arXiv:2007.08792. [Google Scholar]
Mohammed, A.; Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 757–774. [Google Scholar] [CrossRef]
Buchanan, E.K.; Pleiss, G.; Cunningham, J.P. The Effects of Ensembling on Long-Tailed Data. In Proceedings of the NeurIPS 2023 Workshop Heavy Tails in Machine Learning, New Orleans, LA, USA, 15 December 2023. [Google Scholar]
Bousselham, W.; Thibault, G.; Pagano, L.; Machireddy, A.; Gray, J.; Chang, Y.H.; Song, X. Efficient Self-Ensemble for Semantic Segmentation. arXiv 2022, arXiv:2111.13280. [Google Scholar]
Gasparin, M.; Ramdas, A. Merging uncertainty sets via majority vote. arXiv 2024, arXiv:2401.09379. [Google Scholar]
Ali, S.; Jha, D.; Ghatwary, N.; Realdon, S.; Cannizzaro, R.; Salem, O.E.; Lamarque, D.; Daul, C.; Riegler, M.A.; Anonsen, K.V.; et al. A multi-centre polyp detection and segmentation dataset for generalisability assessment. Sci. Data 2023, 10, 75. [Google Scholar] [CrossRef] [PubMed]
Cheng, J. Brain Tumor Dataset. 2017. Available online: https://figshare.com/articles/dataset/brain_tumor_dataset/1512427/5 (accessed on 26 July 2024).
Georgescu, M.I.; Ionescu, R.T.; Miron, A.I. Diversity-Promoting Ensemble for Medical Image Segmentation. arXiv 2022, arXiv:2210.12388. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
Xu, C.; Fan, K.; Mo, W.; Cao, X.; Jiao, K. Dual ensemble system for polyp segmentation with submodels adaptive selection ensemble. Sci. Rep. 2024, 14, 6152. [Google Scholar] [CrossRef]
Nanni, L.; Cuza, D.; Lumini, A.; Loreggia, A.; Brahman, S. Polyp Segmentation with Deep Ensembles and Data Augmentation. In Artificial Intelligence and Machine Learning for Healthcare: Vol. 1: Image and Data Analytics; Lim, C.P., Vaidya, A., Chen, Y.W., Jain, T., Jain, L.C., Eds.; Springer International Publishing: Cham, Switzerland, 2023; pp. 133–153. [Google Scholar]
Hong, T.T.T.; Thanh, N.C.; Long, T.Q. Polyp segmentation in colonoscopy images using ensembles of U-Nets with EfficientNet and asymmetric similarity loss function. In Proceedings of the 2020 RIVF International Conference on Computing and Communication Technologies (RIVF), Ho Chi Minh, Vietnam, 14–15 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Welikala, R.; Fraz, M.; Williamson, T.; Barman, S. The automated detection of proliferative diabetic retinopathy using dual ensemble classification. Int. J. Diagn. Imaging 2015, 2, 64–71. [Google Scholar] [CrossRef]
Guo, X.; Yang, C.; Liu, Y.; Yuan, Y. Learn to Threshold: ThresholdNet with Confidence-Guided Manifold Mixup for Polyp Segmentation. IEEE Trans. Med. Imaging 2021, 40, 1134–1146. [Google Scholar] [CrossRef]
Li, W.; Milletarì, F.; Xu, D.; Rieke, N.; Hancox, J.; Zhu, W.; Baust, M.; Cheng, Y.; Ourselin, S.; Cardoso, M.J.; et al. Privacy-preserving Federated Brain Tumour Segmentation. In Proceedings of the MLMI@MICCAI, Shenzhen, China, 13 October 2019. [Google Scholar]
Sheller, M.J.; Reina, G.A.; Edwards, B.; Martin, J.; Bakas, S. Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation. In Proceedings of the Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 16 September 2018; Revised Selected Papers, Part I 4. Springer: Cham, Switzerland, 2019; pp. 92–104. [Google Scholar]
Mendler-Dünner, C.; Guo, W.; Bates, S.; Jordan, M. Test-time collective prediction. Adv. Neural Inf. Process. Syst. 2021, 34, 13719–13731. [Google Scholar]
Cherubin, G. Majority vote ensembles of conformal predictors. Mach. Learn. 2019, 108, 475–488. [Google Scholar] [CrossRef]
Gauraha, N.; Spjuth, O. Synergy conformal prediction. In Proceedings of the Tenth Symposium on Conformal and Probabilistic Prediction and Applications, Virtual, 8–10 September 2021; Carlsson, L., Luo, Z., Cherubin, G., An Nguyen, K., Eds.; Proceedings of Machine Learning Research (PMLR): New York, NY, USA, 2021; Volume 152, pp. 91–110. [Google Scholar]
Nguyen, D.T.; Pathak, R.; Angelopoulos, A.N.; Bates, S.; Jordan, M.I. Data-Adaptive Tradeoffs among Multiple Risks in Distribution-Free Prediction. arXiv 2024, arXiv:2403.19605. [Google Scholar]
Alvarez, J. A distribution-free valid p-value for finite samples of bounded random variables. arXiv 2024, arXiv:2405.08975. [Google Scholar]
Bretz, F.; Posch, M.; Glimm, E.; Klinglmueller, F.; Maurer, W.; Rohmeyer, K. Graphical approaches for multiple comparison procedures using weighted Bonferroni, Simes, or parametric tests. Biom. J. Biom. Z. 2011, 53, 894–913. [Google Scholar] [CrossRef] [PubMed]
AlTahhan, F.E.; Khouqeer, G.A.; Saadi, S.; Elgarayhi, A.; Sallah, M. Refined Automatic Brain Tumor Classification Using Hybrid Convolutional Neural Networks for MRI Scans. Diagnostics 2023, 13, 864. [Google Scholar] [CrossRef] [PubMed]
Badža, M.M.; Barjaktarović, M.Č. Classification of Brain Tumors from MRI Images Using a Convolutional Neural Network. Appl. Sci. 2020, 10, 1999. [Google Scholar] [CrossRef]

Figure 1. An example of our aggregation approach for image segmentation. False positives are blue, false negatives are red, true positives are white, and true negatives are black. The ensemble is a weighted average of NN1 and NN2. When combining the predictions of NN1 and NN2, we observe an asymmetric trade-off between false positives and false negatives. Our ensemble is able to account for such a trade-off and finds a sweet spot to combine the predictions of two different neural networks (NN1 and NN2).

Figure 2. Representative 3D plot of our risk function motivating our framework. The trajectories represent the values of the risk surface for different fixed values of

λ_{1}

and the plane

z = α

represents the risk tolerance threshold. The blue parts of the trajectories of

R (λ)

represent values of

λ

such that

R (λ) < α

.

Figure 2. Representative 3D plot of our risk function motivating our framework. The trajectories represent the values of the risk surface for different fixed values of

λ_{1}

and the plane

z = α

represents the risk tolerance threshold. The blue parts of the trajectories of

R (λ)

represent values of

λ

such that

R (λ) < α

.

Figure 3. Weight allocations for the merged prediction set.

Figure 4. Histograms showing the proportion of pixels labeled 1 (with tumor) relative to the total number of pixels per image for the dataset of each hospital.

Figure 5. Average losses on the calibration dataset for different ensemble neural networks.

Figure 6. FPR and IoU as functions of

λ_{1}

for different false negative rate control thresholds calibrated via fixed sequence testing, taking

δ = 0.10

(hence

0.02 = \frac{δ}{5}

level for each ensemble), and using the H-B valid p-value.

Figure 6. FPR and IoU as functions of

λ_{1}

for different false negative rate control thresholds calibrated via fixed sequence testing, taking

δ = 0.10

(hence

0.02 = \frac{δ}{5}

level for each ensemble), and using the H-B valid p-value.

Figure 7. Average fraction of pixels where each ensemble predicts a tumor in the calibration dataset relative to the total number of pixels per image (namely,

256^{2}

pixels).

Figure 7. Average fraction of pixels where each ensemble predicts a tumor in the calibration dataset relative to the total number of pixels per image (namely,

256^{2}

pixels).

Figure 8. Sample of MRI scan images used in this work. Predictions correspond to

λ_{1} = 0.50

and

λ_{2} = 0.92

using Equation (2). True positives are green pixels, false negatives are red pixels, and false positives are yellow pixels.

Figure 8. Sample of MRI scan images used in this work. Predictions correspond to

λ_{1} = 0.50

and

λ_{2} = 0.92

using Equation (2). True positives are green pixels, false negatives are red pixels, and false positives are yellow pixels.

Figure 9. False negative rate (a) and false positive rate (b) in the majority vote approach under voting thresholds as a function of the per-pixel threshold for binary classification.

Table 1. Decomposition of the 1537 images in the final version of the PolypGen by medical center.

Medical Center	Number of Images in the Dataset
Ambroise Paré Hôpital, Paris, France	256
Istituto Oncologico Veneto, Padova, Italy	301
Centro Riferimento Oncologico, IRCCS, Italy	457
Oslo University Hospital, Oslo, Norway	227
John Radcliffe Hospital, Oxford, UK	208
University of Alexandria, Alexandria, Egypt	88

Table 2. Calibration results using the fixed sequence Algorithm A1 and different valid p-values (H-B and PRW).

$α$	$δ$	$λ_{1}$	$λ_{2}$ (H-B)	FPR (H-B)	$λ_{2}$ (PRW)	FPR (PRW)
0.10	0.05	0.00	0.970	0.4696	0.970	0.4696
		0.25	0.970	0.2206	0.970	0.2206
		0.50	0.970	0.1770	0.970	0.1770
		0.75	0.975	0.2495	0.975	0.2495
		1.00	0.980	0.3637	0.980	0.3637
0.10	0.10	0.00	0.970	0.4696	0.970	0.4696
		0.25	0.970	0.2206	0.970	0.2206
		0.50	0.970	0.1770	0.970	0.1770
		0.75	0.975	0.2495	0.975	0.2495
		1.00	0.980	0.3637	0.980	0.3637
0.10	0.20	0.00	0.970	0.4696	0.970	0.4696
		0.25	0.970	0.2206	0.970	0.2206
		0.50	0.970	0.1770	0.970	0.1770
		0.75	0.975	0.2495	0.975	0.2495
		1.00	0.980	0.3637	0.980	0.3637
0.20	0.05	0.00	0.960	0.2126	0.960	0.2126
		0.25	0.960	0.1205	0.960	0.1205
		0.50	0.960	0.0929	0.960	0.0929
		0.75	0.965	0.1030	0.965	0.1030
		1.00	0.975	0.1960	0.975	0.1960
0.20	0.10	0.00	0.960	0.2126	0.960	0.2126
		0.25	0.960	0.1205	0.960	0.1205
		0.50	0.960	0.0929	0.960	0.0929
		0.75	0.965	0.1030	0.965	0.1030
		1.00	0.975	0.1960	0.975	0.1960
0.20	0.20	0.00	0.960	0.2126	0.960	0.2126
		0.25	0.955	0.0831	0.960	0.1205
		0.50	0.960	0.0929	0.960	0.0929
		0.75	0.965	0.1030	0.965	0.1030
		1.00	0.975	0.1960	0.975	0.1960
0.30	0.05	0.00	0.945	0.1114	0.945	0.1114
		0.25	0.935	0.0086	0.935	0.0086
		0.50	0.935	0.0067	0.935	0.0067
		0.75	0.940	0.0091	0.945	0.0128
		1.00	0.960	0.0352	0.960	0.0352
0.30	0.10	0.00	0.940	0.0970	0.945	0.1114
		0.25	0.930	0.0057	0.935	0.0086
		0.50	0.935	0.0067	0.935	0.0067
		0.75	0.940	0.0091	0.940	0.0091
		1.00	0.955	0.0238	0.955	0.0238
0.30	0.20	0.00	0.940	0.0970	0.940	0.0970
		0.25	0.930	0.0057	0.930	0.0057
		0.50	0.930	0.0050	0.935	0.0067
		0.75	0.935	0.0066	0.940	0.0091
		1.00	0.955	0.0238	0.955	0.0238

Table 3. Calibration results for the majority vote prediction set using the H-B valid p-value with a uniform average across the voting members. We use a dash “−” to denote that the output was the empty set, meaning that the procedure did not reject any null hypothesis.

$α$	$δ$	$λ_{1}$	$λ_{2}$	FPR
0.10	0.05	1/5	0.98	0.0219
		2/5	0.99	0.0276
		3/5	0.99	0.0231
		4/5	–	–
	0.10	1/5	0.98	0.0219
		2/5	0.99	0.0276
		3/5	0.99	0.0231
		4/5	–	–
	0.20	1/5	0.98	0.0219
		2/5	0.99	0.0276
		3/5	0.99	0.0231
		4/5	–	–
0.15	0.05	1/5	0.93	0.0123
		2/5	0.95	0.0127
		3/5	0.97	0.0134
		4/5	–	–
	0.10	1/5	0.93	0.0123
		2/5	0.94	0.0117
		3/5	0.96	0.0117
		4/5	0.99	0.0145
	0.20	1/5	0.92	0.0116
		2/5	0.94	0.0117
		3/5	0.96	0.0117
		4/5	0.99	0.0145
0.20	0.05	1/5	0.82	0.0074
		2/5	0.86	0.0072
		3/5	0.91	0.0074
		4/5	0.98	0.0100
	0.10	1/5	0.82	0.0074
		2/5	0.86	0.0072
		3/5	0.91	0.0074
		4/5	0.98	0.0100
	0.20	1/5	0.80	0.0068
		2/5	0.86	0.0072
		3/5	0.90	0.0069
		4/5	0.98	0.0100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alvarez, J.; Roman-Rangel, E. Aggregating Image Segmentation Predictions with Probabilistic Risk Control Guarantees. Mathematics 2025, 13, 1711. https://doi.org/10.3390/math13111711

AMA Style

Alvarez J, Roman-Rangel E. Aggregating Image Segmentation Predictions with Probabilistic Risk Control Guarantees. Mathematics. 2025; 13(11):1711. https://doi.org/10.3390/math13111711

Chicago/Turabian Style

Alvarez, Joaquin, and Edgar Roman-Rangel. 2025. "Aggregating Image Segmentation Predictions with Probabilistic Risk Control Guarantees" Mathematics 13, no. 11: 1711. https://doi.org/10.3390/math13111711

APA Style

Alvarez, J., & Roman-Rangel, E. (2025). Aggregating Image Segmentation Predictions with Probabilistic Risk Control Guarantees. Mathematics, 13(11), 1711. https://doi.org/10.3390/math13111711

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Aggregating Image Segmentation Predictions with Probabilistic Risk Control Guarantees

Abstract

1. Introduction

2. Related Work

3. Calibrating Deep Ensembles with Statistical Guarantees

3.1. Setup

3.2. The Learn Then Test Framework

3.3. Local Calibration with Global Statistical Guarantees

4. Experimental Setting

4.1. Polyp Segmentation Dataset

4.2. Brain Tumor Segmentation Dataset

5. Experimental Results

5.1. Polyp Segmentation

5.2. Brain Tumor Segmentation

6. Conclusions

Main Takeaway from This Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Fixed Sequence Testing and Valid p-Values

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI