CUES: A Multiplicative Composite Metric for Evaluating Clinical Prediction Models Theory, Inference, and Properties

Alqudah, Ali Mohammad; Moussavi, Zahra

doi:10.3390/math14030398

Open AccessArticle

CUES: A Multiplicative Composite Metric for Evaluating Clinical Prediction Models Theory, Inference, and Properties

by

Ali Mohammad Alqudah

^1,*

and

Zahra Moussavi

^1,2

¹

Biomedical Engineering Program, University of Manitoba, Winnipeg, MB R3T 5V6, Canada

²

Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, MB R3T 5V6, Canada

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(3), 398; https://doi.org/10.3390/math14030398

Submission received: 7 December 2025 / Revised: 15 January 2026 / Accepted: 20 January 2026 / Published: 23 January 2026

(This article belongs to the Section E3: Mathematical Biology)

Download

Browse Figures

Versions Notes

Abstract

Evaluating artificial intelligence (AI) models in clinical medicine requires more than conventional metrics such as accuracy, Area Under the Receiver Operating Characteristic (AUROC), or F1-score, which often overlook key considerations such as fairness, reliability, and real-world utility. We introduce CUES as a multiplicative composite score for clinical prediction models; it is defined as

CUES = (C \cdot U \cdot E \cdot S)^{1 / 4}

, where C represents calibration, U integrated clinical utility, E equity across patient subpopulations, and S sampling stability. We formally establish boundedness, monotonicity, and differentiability on the domain

{(0,1]}^{4}

, derive first-order sensitivity relations, and provide asymptotic approximations for its sampling distribution via the delta method. To facilitate inference, we propose bootstrap procedures for constructing confidence intervals and for comparative model evaluation. Analytic examples illustrate how CUES can diverge from traditional metrics, capturing dimensions of predictive performance that are essential for clinical reliability but often missed by AUROC or F1-score alone. By integrating multiple facets of clinical utility and robustness, CUES provides a comprehensive tool for model evaluation, comparison, and selection in real-world medical applications.

Keywords:

artificial intelligence; performance metric; clinical decision support; decision curve analysis; fairness; calibration; health equity

MSC:

62P10; 62F07; 68T05; 62C10

1. Introduction

The integration of artificial intelligence (AI) into clinical medicine holds the promise of revolutionizing diagnostics, prognostics, and treatment planning [1,2,3]. From interpreting medical images to predicting patient outcomes, AI models are increasingly being developed to support clinicians and improve patient care. However, translating these models from research to routine clinical practice is fraught with challenges, not least the critical task of performance evaluation. The adage “what you measure is what you get” is particularly salient in this high-stakes domain; if we rely on inadequate or incomplete metrics, we risk developing and deploying models that are not only ineffective but potentially harmful [4,5,6].

For decades, the evaluation of classification models were dominated by a set of standard metrics, including accuracy, precision, recall, F1-score, and the Area Under the Receiver Operating Characteristic Curve (AUROC) [7,8,9]. While these metrics have served well in many traditional machine learning applications, they possess fundamental limitations that render them insufficient for the nuanced requirements of clinical AI. The most prominent limitation is their poor performance on imbalanced datasets, a common characteristic of clinical data where the prevalence of a disease or condition is often low. Accuracy, for example, can be deceptively high for a model that simply predicts the majority class (e.g., “no disease”), offering no real clinical value. Similarly, while AUROC is prevalence-invariant, it measures a model’s ability to rank patients by risk, not whether the risk predictions themselves are accurate or lead to good clinical decisions [10,11,12]. These limitations of AUROC and accuracy in clinical prediction have been extensively documented in the literature [13,14]; our contribution is not to restate these limitations, but to provide a unified framework that operationalizes them during model evaluation and selection.

1.1. The Need for a New Evaluation Metric

Traditional evaluation metrics for clinical AI models, such as accuracy, AUROC, and F1-score, are widely used but have significant limitations when applied to real-world healthcare settings. These metrics are often misleading in the presence of imbalanced datasets, commonly found in clinical data, and fail to reflect the actual clinical value of a model [7,10,11,15]. For instance, a model can achieve high accuracy by simply predicting the majority of the class while completely missing critical but rare cases. Furthermore, these metrics typically ignore essential aspects such as:

Clinical utility [3]: Do predictions actually result in better patient outcomes?
Calibration [4]: Are the predicted probabilities trustworthy?
Equity [16]: Does the model perform fairly across diverse patient groups, or does it exacerbate existing health disparities?
Stability [17]: Is the model’s performance consistent across different samples or environments?

Relying solely on current metrics risks deploying AI solutions that are statistically sound but clinically unsafe, unfair, or unreliable [4,5,6,18]. Therefore, a new evaluation score is urgently needed, one that incorporates a holistic perspective on a model’s fitness for clinical use.

1.2. What Makes This Score Different from Current Scores

The new metric CUES (Calibration, Utility, Equity, Stability), fundamentally differs from traditional metrics by addressing all critical dimensions required for trustworthy clinical AI as discussed in [7,11,15,19]; it provides the following:

Composite design: CUES is an unweighted geometric mean of four core components: utility (clinical usefulness), calibration (probability accuracy), equity (fairness across subgroups), and stability (robustness to sampling).
Holistic evaluation: Each component is independently normalized, meaning that poor performance in one area (e.g., high bias or instability) will substantially reduce the overall score, preventing one strong aspect from masking significant deficiencies elsewhere.
Explicit fairness and robustness: By measuring subgroup disparities and bootstrapped stability, CUES directly evaluates areas that traditional metrics ignore, such as algorithmic bias and performance fluctuations [16,17,20,21].

While individual performance measures remain essential for diagnostic insight, many practical tasks in clinical AI, such as model selection, hyperparameter tuning, feature selection, and regulatory approval, require a principled scalar criterion to enable consistent comparison across candidate models. CUES addresses this need by integrating calibration, utility, equity, and stability into a single decision-oriented metric. In practical terms, CUES provides a single, interpretable value that summarizes all essential facets of clinical model performance, something current standards cannot offer, as they often require examining several metrics separately (and sometimes missing critical failures) [1,4,6,8].

In such decision-making settings, reporting multiple metrics without an explicit aggregation rule leaves the final choice subjective and potentially inconsistent. CUES is introduced to address this gap by providing a conservative, decision-oriented composite score that formalizes trade-offs among calibration, clinical utility, equity, and stability. Importantly, CUES is not intended to replace component-wise evaluation, but to complement it by ensuring that no critical deficiency is overlooked during model comparison.

Although many evaluation strategies have been proposed ranging from accuracy-based metrics and AUROC [7,10,11,12] to decision-curve analysis for clinical utility [7,13,20,21] and fairness-aware performance measures [3,16,17,20,21], these approaches remain fragmented and often insufficient for the complexities of clinical practice. They frequently overlook critical dimensions such as calibration [2,4], robustness to data shifts [2,17], and subgroup equity [3,16], while imbalanced datasets further limit the reliability of standard metrics [7,10,11,15]. Consequently, there is a clear need for a unified, clinically grounded metric that synthesizes these essential but disparate components. In this work, we introduce CUES, a composite score that integrates utility, calibration, equity, and stability into a single interpretable measure, aiming to provide a more comprehensive and clinically meaningful assessment of model performance. The next section presents the formal definition of the CUES score and its mathematical properties.

2. Materials and Methods

In this section, we first describe how each component of the CUES framework is estimated in practice using finite samples. Theoretical properties, including asymptotic distributions and statistical inference, are presented subsequently to clarify the behavior of the proposed estimators.

2.1. Overview of the CUES Framework

The four components of CUES calibration, utility, equity, and stability were selected because they represent non-substitutable dimensions of clinical model performance. Calibration ensures the trustworthiness of predicted probabilities; utility reflects real-world clinical benefit; equity safeguards against systematic performance disparities across patient subgroups; and stability captures robustness to sampling variability and deployment uncertainty. Deficiencies in any one of these dimensions can independently render a clinical model unsafe or misleading, even when performance on the remaining dimensions is strong. It is calculated as a geometric mean of its core components: utility (U), calibration (C), equity (E), and stability (S). The use of a geometric mean ensures that a poor performance in any single component significantly reduces the overall score, preventing a high score in one area from masking deficiencies in another. Let a predictive model produce probability estimates

{\hat{p}}_{i} \in [0,1]

for instances

i = 1, \dots, n

with true binary outcomes

y_{i} \in {0,1}

. Let

G = {G_{1}, \dots, G_{m}}

be a (possibly empty) partition of the population into predefined subgroups used for the equity assessment. Although CUES produces a single composite score, it is intended to be interpreted jointly with its constituent components. Component-wise reporting of calibration, utility, equity, and stability is essential for diagnostic interpretation and enables direct identification of the primary contributors to be reduced composite performance. Define four normalized component scores

C, U, E, S \in [0,1]

(1)

where each component is a function of the model outputs and outcomes (and subgroup assignment where relevant):

C = C ({{\hat{p}}_{i}, y_{i}}_{i = 1}^{n})

(2)

U = U ({{\hat{p}}_{i}, y_{i}}_{i = 1}^{n})

(3)

E = E ({{\hat{p}}_{i}, y_{i}, G})

(4)

S = S ({{\hat{p}}_{i}, y_{i}}_{i = 1}^{n})

(5)

here,

C

is the calibration score,

U

the clinical utility score,

E

the equity (fairness) score across

G

, and

S

the stability (robustness) score. Each component is normalized to lie in

[0,1]

(1 = best possible performance, 0 = worst or baseline performance) by the normalization procedures described in the methods. Equal weighting of the four components is adopted as a neutral default in the absence of application-specific priorities. This choice does not imply that all dimensions are universally equally important but rather reflects a conservative stance in which no critical dimension is assumed to be secondary by default. The composite CUES score is defined as the fourth root (geometric mean) of the four components:

C U E S = (C (\hat{p}, y) \cdot U (\hat{p}, y) \cdot E (\hat{p}, y, G) \cdot S (\hat{p}, y))^{1 / 4}

(6)

where

\hat{p} = ({\hat{p}}_{1}, \dots, {\hat{p}}_{n})

and

y = (y_{1}, \dots, y_{n})

. The aggregation of CUES components is based on the geometric mean, which can be viewed as a special case of the generalized mean family. The framework naturally supports weighted aggregation, allowing users to specify component weights to reflect clinical, ethical, or regulatory priorities.

Arithmetic mean can mask severe deficiencies [22], and the minimum operator is excessively stringent [23]. Harmonic mean [24], as another common measure, places disproportionate weight on the smallest component, a well-known property in multi-criteria aggregation which can result in excessive score attenuation under moderate imbalance. Compared to these measures, the geometric mean provides a principled compromise. It is scale invariant, zero-preserving (i.e., any zero-valued component yields a zero composite score), and enforces proportional penalization of deficiencies without allowing any single component to dominate the overall evaluation. In addition, the geometric mean therefore offers a more stable and interpretable trade-off between sensitivity to localized weaknesses and robustness of the overall assessment.

Detailed mathematical definitions and the normalization mappings for

C, U, E,

and

S

follow in the methods section, and proof sketches for all theoretical propositions are provided in Appendix A. While the specific definitions of clinical utility functions, subgroup partitions for equity, and perturbation schemes for stability are necessarily application-dependent, CUES is intentionally designed as a modular framework. Each component can be adapted to reflect clinically meaningful priorities without altering the mathematical structure of the composite score.

To promote reproducibility, we recommend default choices where applicable: (i) decision-curve-based net benefit integrated over clinically relevant thresholds for utility, (ii) subgroup definitions based on established demographic or clinical variables for equity, and (iii) bootstrap resampling for stability assessment. Sensitivity analyses across reasonable design choices are encouraged and can be used to assess the robustness of CUES-based model comparisons.

Assumptions and conventions

Each component is mapped/scaled to $[0,1]$ .
If any component is undefined for a dataset, the corresponding mapping must be specified (e.g., define $E = 1$ if no subgrouping is used).
Logarithms use the natural base when needed.

State precisely:

Domain: $(C, U, E, S) \in [0,1]^{4}$ .
Range: $C U E S \in [0,1]$ .

2.2. Estimation of Individual Components

This subsection describes the practical estimation of each CUES component using observed data. All component scores are computed directly from model predictions and outcomes without reliance on asymptotic approximations. Theoretical properties of the resulting estimators are discussed separately in Section 2.3.

2.2.1. Utility (U): Integrated Net Benefit

Utility quantifies the clinical usefulness of a model’s predictions across a range of decision thresholds. It is derived from Decision Curve Analysis (DCA), which evaluates diagnostic and prognostic models by plotting the net benefit against threshold probability [7,15,25,26]. Net benefit considers the balance between true positives and false positives, weighted by the relative harm of false positives compared to false negatives [5,11,15,26]. For a given threshold

t

, the net benefit (

N B_{t}

) is calculated as

N B (t) = \frac{T P (t)}{N} - \frac{F P (t)}{N} \cdot \frac{t}{1 - t}

(7)

where

T P

and

F P

are the number of true and false positives, respectively;

N

is the total number of patients, and

t

is the threshold probability representing the minimum probability of the disease under the treatment.

To obtain a single utility score, we integrate the net benefit over a range of clinically relevant thresholds. The integrated net benefit (

I N B

) is calculated as the area under the net benefit curve, normalized by the maximum possible net benefit [25,27]. We use weighted integration to allow for the prioritization of certain thresholds, defined by a weighting function

w (t)

.

{I N B}_{model} = \int_{t_{m i n}}^{t_{m a x}} N B (t) w (t) d t

(8)

The utility score,

U,

is then normalized by the integrated net benefit of a perfect model (

I N B_{p e r f e c t}

) and a baseline model (

I N B_{b a s e l i n e}

) that is typically a strategy of treating all or none. In Decision Curve Analysis (DCA), a perfect model assigns a predicted probability of 1 to every true positive case and 0 to every true negative case. Therefore, at any threshold

t

, the perfect model identifies all

P

positive cases (TP =

P

) and never recommends treatment for negative cases (FP = 0). Its net benefit is

{NB}_{perfect} (t) = \frac{P}{N}

(9)

which equals the prevalence and forms a horizontal reference line in DCA plots.

For comparison, the net benefit of a “treat all” strategy is

{NB}_{all} (t) = \frac{P}{N} - \frac{N - P}{N} \cdot \frac{t}{1 - t}

(10)

and the “treat none’’ strategy has

{NB}_{none} (t) = 0

, which we use as the baseline for utility normalization. Using these definitions, the integrated net benefit (INB) of any model is computed as

U = \frac{{I N B}_{model} - {I N B}_{baseline}}{{I N B}_{perfect} - {I N B}_{baseline} + ϵ}, U \leftarrow c l i p (U, 0, 1)

(11)

where

ε

(e.g.,

10^{- 6}

) prevents division by zero when the denominator is very small, ensuring

U \in [0,1]

. The integrated net benefit of a model is computed as a weighted integral over decision thresholds

t \in [0,1]

:

{INB}_{model} = \int_{0}^{1} {NB}_{model} (t) w (t) d t

(12)

with analogous definitions for

{INB}_{baseline}

and

{INB}_{perfect}

. Here,

w (t)

is a weighting function that determines the relative contribution of each threshold. While a uniform weighting (

w (t) = 1

) treats all thresholds equally, it can be customized commonly using a Beta distribution,

w (t) \propto Beta (t; α, β)

to emphasize thresholds that are clinically most relevant (e.g., low-risk screening, high-specificity decision points, or intermediate actionable ranges) [28,29].

2.2.2. Calibration (C): Brier Skill Score

Calibration assesses the agreement between predicted probabilities and observed outcomes. A well-calibrated model is essential for clinical trust and decision-making. We use the Brier Score (

B S

) as a measure of calibration, which is the mean squared difference between predicted probabilities and actual outcomes [4,5,12,19,30]:

B S = \frac{1}{N} \sum_{i = 1}^{N} (p_{i} - y_{i})^{2}

(13)

where

p_{i}

is the predicted probability for instance

i

;

y_{i}

is the true binary outcome, for instance

i

(0 or 1). A lower Brier Score indicates better calibration. To normalize this into a score between 0 and 1 (where 1 is perfect), we use the Brier Skill Score (

B S S

), which compares the model’s Brier Score to that of a reference model [12,31,32], typically a model that always predicts the prevalence of the positive class (

B S_{r e f}

):

{B S}_{ref} = π (1 - π)

(14)

The calibration component

C

is then defined as

C = \max (0, 1 - \frac{B S}{{B S}_{ref} + ϵ})

(15)

This ensures that

C = 1

for a perfectly calibrated model and

C = 0

for a model that performs no better than predicting the prevalence [2,4,8,9].

2.2.3. Equity (E): Worst-Case Disparity with Variance Penalty

Equity measures the fairness of the model’s utility across different predefined subgroups. It is crucial to ensure that AI models do not exacerbate existing health disparities by performing poorly on specific demographic groups [3,6,16,17]. Our equity component

E

is based on the utility scores calculated for each subgroup. The equity component

E

It is based on the subgroup utilities. In the implementation provided, the dataset is divided into two subgroups according to the median of the first feature of the test set (

X_{test} [:, 0]

):

Subgroup 1: $U_{1}$ = utility score for all samples with $X_{test, i, 0} \leq median$ (first partition).
Subgroup 2: $U_{2}$ = utility score for all samples with $X_{test, i, 0} > median$ (second partition).

The equity score is then defined as

E = \max (0, \min (1, 1 - | U_{1} - U_{2} |))

(16)

This ensures that

E \approx 1

. When subgroup utilities are similar, they decrease toward zero as the disparities between them grow. If no subgrouping is performed (i.e.,

X_{test} = \emptyset

), equity defaults to

E = 1.0

indicating perfect equity by assumption [11,16,17,18].

Extended Equity (E) for Demographic and Anthropometric Fairness

The equity (E) component of CUES quantifies model fairness across clinically meaningful subgroups, accounting for anthropometric and demographic covariates such as age, sex, BMI, neck circumference, and body fat distribution [17]. In its simplest form, E measures the complement of the absolute difference in utility (U) [33,34] between the following two subgroups:

E = 1 - ∣ U_{g_{1}} - U_{g_{2}} ∣

(17)

where

g_{1}

and

g_{2}

represent subgroups of a stratification variable (e.g., median split of a continuous attribute). This formulation ensures that a model achieving equivalent clinical utility across subgroups obtains a maximal

E = 1

, while significant disparities reduce the fairness score.

However, in medical and physiological domains such as obstructive sleep apnea (OSA) screening, patient variability is strongly influenced by anthropometric and demographic confounders, including age, sex, body mass index (BMI), neck circumference, and body fat distribution [35]. These parameters are not mere nuisance variables but are clinically meaningful determinants of both disease manifestation and signal characteristics (e.g., breathing sounds, airway dynamics). Consequently, fairness should not be assessed solely on abstract data partitions but should explicitly account for population heterogeneity arising from these biological factors. To generalize

E

in this context, the equity component can be reformulated as the mean subgroup consistency in clinical utility across

K

. Predefined demographic or anthropometric strata:

E = 1 - \frac{1}{K - 1} \sum_{k = 1}^{K - 1} ∣ U_{k} - \bar{U} ∣

(18)

where

U_{k}

denotes the utility within the subgroup

k

(e.g., male, female, young, old, high-BMI, low-BMI), and

\bar{U}

Is the overall mean utility. This expression quantifies how uniformly a model performs across clinically relevant population partitions. A value of

E \approx 1

implies consistent predictive utility across subgroups, indicating demographic robustness and low bias.

2.2.4. Stability (S): Normalized Confidence Interval

Stability assesses the model’s robustness to sampling variability. A stable model yields consistent utility scores across different bootstrapped samples of the data. We quantify stability by repeatedly bootstrapping the dataset, calculating the utility score for each bootstrap sample, and then examining the variability of these utility scores [20,21,36]. Let

U^{(b)}

denote the utility score obtained from the

b,

the bootstrap sample. We calculate the coefficient of variation (CV) of the bootstrapped utility scores as follows:

C V = \frac{S D (U^{(b)})}{m e a n (U^{(b)}) + ϵ}

(19)

where

S D (U^{(b)})

is the standard deviation of the bootstrapped utility scores,

mean (U^{(b)})

is their mean, and

ϵ

is a small constant (e.g.,

10^{- 6}

) to prevent division by zero if the mean utility is very close to zero. A lower CV indicates higher stability [20,27,37]. The stability score

S

is then defined using an exponential decay function to map CV to a [0, 1] range as follows:

S = \exp (- λ \cdot C V)

(20)

where

λ

It is a sensitivity parameter that controls how quickly the score drops with increasing variability. A higher

λ

Makes the score more sensitive to instability. This formulation ensures that

S = 1 .

For perfect stability (CV = 0), it approaches zero as CV increases.

2.3. Practical Estimation Algorithm for CUES

For clarity and reproducibility, we summarize the practical computation of the CUES score in Algorithm 1. This algorithm describes the finite-sample estimation procedure independently of theoretical or asymptotic considerations.

Algorithm 1. Estimation of the CUES score
1.	Train the prediction model and obtain predicted probabilities ${\hat{p}}_{i}$ on the evaluation dataset.
2.	Compute clinical utility $U$ using integrated net benefit via decision curve analysis.
3.	Compute calibration score $C$ using the Brier Skill Score.
4.	Compute subgroup-specific utility scores and derive the equity score $E$ .
5.	Perform bootstrap resampling to estimate stability $S$ .
6.	Normalize all component scores to lie in [0, 1].
7.	Compute the composite CUES score using the geometric mean in Equation (6).

2.4. CUES Curve and Visualization as Interpretation

The CUES Curve provides a graphical representation of an AI model’s clinical utility across a continuum of decision thresholds, normalized to reflect meaningful benefit relative to baseline and a perfect model. This visualization complements the composite CUES score by enabling practitioners to examine how model performance varies with changes in the intervention threshold, a crucial step for informed clinical decision-making. The CUES curve is constructed through the following steps:

2.4.1. Net Benefit Computation

For each threshold value, the function calculates the net benefit for:

The model: using its predicted probabilities.
Treat-all strategy: as if all patients are predicted to be positive.
Treat-none strategy: as if no patients are predicted to be positive.
Perfect model: using true labels as predictions.

2.4.2. Normalization

The model’s net benefit is normalized relative to the best baseline (treat-all or treat-none) and the perfect possible benefit at each threshold as follows:

N o r m a l i z e d N e t B e n e f i t = m a x (0, \frac{({n b}_{m o d e l} - {n b}_{b a s e l i n e})}{({n b}_{p e r f e c t} - {n b}_{b a s e l i n e} + ϵ)})

(21)

where

ϵ

is a small constant to avoid division by zero.

2.4.3. Area Under the CUES Curve (AUC-CUES)

The area under this normalized benefit curve across all reasonable decision thresholds is computed (using trapezoidal integration). This value reflects the overall utility efficiency of the model over the full clinical decision spectrum:

A U C - C U E S = \int_{t h r e s h o l d m i n}^{t h r e s h o l d m a x} N o r m a l i z e d N e t B e n e f i t

(22)

A higher AUC-CUES indicates broader, more reliable clinical benefit [8,9,10].

2.4.4. Visualization

Thresholds where the model provides maximal clinical benefit.
Thresholds where the model underperforms or offers no gain over basic strategies.
The overall consistency of the model’s utility informs clinicians about advisable operating points.

2.4.5. Theoretical Comparison with Existing Metrics (Decoupling from AUROC)

High AUROC does not imply high CUES—AUROC measures ranking ability and is prevalence-invariant. Calibration

C

(e.g., Brier Skill Score) and utility

U

(integrated net benefit) can be low. At the same time, AUROC is high: consider a model that perfectly ranks but outputs probabilities compressed near 0 and 1, resulting in incorrect magnitudes (mispredictions). Since CUES multiplies calibration and utility, poor

C

or

U

will substantially lower the final score, even if AUROC is near 1 [4,11,15,19]. Provide explicit construction in Appendix A (e.g., scale logit outputs by a small factor to keep ordering but destroy calibration).

Theorem (equivalence condition)—if a model is perfectly calibrated (

C = 1

), achieving perfect integrated net benefit under the considered thresholds (

U = 1

), and is entirely equitable across subgroups (

E = 1

) and stable (

S = 1

), then

C U E S = 1

. Conversely, if

C U E S = 1,

then each component must be equal to 1, implying direct from definition and boundedness.

2.4.6. Interpretation

The CUES curve serves as a practical and interpretable tool for visualizing and quantifying how model predictions translate into actionable clinical benefits across a range of threshold settings. It directly supports model selection and threshold optimization by spotlighting both strengths and deficiencies that may be missed by scalar metrics alone. This approach grounds evaluation not just in point estimates but in a truly holistic understanding of clinical AI performance [8,9,36].

2.5. Gradient-Based Optimization of CUES (Optional)

Optimization of CUES is optional and not required for model evaluation. In this work, CUES is primarily used as a post hoc evaluation metric; optimization is discussed here for completeness and potential future extensions. If one wishes to optimize CUES directly (for example, during model training), the gradients of CUES with respect to the model parameters

θ

can be computed using the chain rule:

\frac{\partial C U E S}{\partial θ} = \sum_{X \in {C, U, E, S}} \frac{\partial C U E S}{\partial X} \cdot \frac{\partial X}{\partial θ}

(23)

with

\frac{\partial C U E S}{\partial X} = \frac{C U E S}{4 X}

(24)

thus,

\nabla_{θ} C U E S = \frac{C U E S}{4} \sum_{X} \frac{1}{X} \nabla_{θ} X

(25)

where components

X

may themselves be non-differentiable or estimated via complex procedures; in that case, surrogate differentiable proxies can be used.

Convexity

The mapping

(C, U, E, S) \mapsto (C U E S)^{1 / 4}

is not convex on

[0,1]^{4}

. Consequently, direct optimization is non-convex and may have multiple local maxima. Use stochastic gradient methods, simulated annealing, or surrogate losses.

2.: Practical approach

Rather than end-to-end optimizing CUES, one can optimize component proxies (e.g., Brier loss for calibration, decision-curve–informed loss for utility) and combine them in a multi-objective framework. While direct end-to-end optimization of CUES with respect to model parameters is ideal in theory, it is often infeasible in practice given the non-differentiability and non-convexity of several CUES components.

As an alternative, we adopt a pragmatic hyperparameter-tuning strategy in which CUES is evaluated as an out-of-fold metric via grid search and cross-validation. This approach does not require a closed-form gradient or direct parameter-level optimization but still enables principled model and hyperparameter selection via the CUES criterion. By comparing cross-validated optimization of CUES and traditional accuracy, we empirically assess the impact of the selected objective on both overall model performance and utility component profiles. This methodology complements the theoretical framework while being immediately applicable across a broad range of standard machine learning models.

2.6. Theoretical Properties and Statistical Inference

Although CUES is implemented using finite-sample estimators and bootstrap-based procedures, the following theoretical results are included to establish mathematical validity, consistency, and inferential grounding. These results are not required for practical computation of CUES, but justify uncertainty quantification, hypothesis testing, and principled comparison between models.

2.6.1. Boundedness

For

C, U, E, S \in [0,1]

,

0 \leq C U E S \leq 1

. The product

P = C U E S^{4} = C U E S

is in

[0,1]

because each factor is in

[0,1]

. Taking the fourth root preserves the interval:

0 \leq P^{1 / 4} \leq 1

.

2.6.2. Zero Sensitivity

If any component equals

0

, then

C U E S = 0

. If

C = 0

(or any factor zero) then

C U E S = 0

; thus, the fourth root is

0

.

2.6.3. Monotonicity

C U E S

is strictly increasing in each component on

(0,1]

. Formally, for fixed

U, E, S > 0

,

\partial C U E S / \partial C > 0

. For

C > 0

,

\frac{\partial C U E S}{\partial C} = \frac{1}{4} (C U E S)^{- 3 / 4} \cdot (U E S) = \frac{C U E S}{4 C} > 0

(26)

Analogous expressions hold for

U, E, S

. Thus, increasing any component increases CUES.

2.6.4. Relative Sensitivity

\frac{\partial l o g (C U E S)}{\partial l o g C} = \frac{1}{4}

(27)

Hence, a

x %

relative change in any component produces approximately a

\frac{x}{4} %

relative change in CUES (first-order).

l o g C U E S = \frac{1}{4} (l o g C + l o g U + l o g E + l o g S)

(28)

2.6.5. Continuous Mapping Theorem

If estimators

{\hat{C}}_{n}, {\hat{U}}_{n}, {\hat{E}}_{n}, {\hat{S}}_{n}

are (pointwise) consistent for

C, U, E, S

as

n \to \infty

, then

{\hat{C U E S}}_{n} = ({\hat{C}}_{n} {\hat{U}}_{n} {\hat{E}}_{n} {\hat{S}}_{n})^{1 / 4} ⟶ p (C U E S)^{1 / 4} = C U E S

(29)

Continuous mapping theorem: the fourth root of a convergent product converges to the fourth root of the limit.

2.6.6. Asymptotic Distribution (Delta Method Sketch)

To assess the variability and derive approximate confidence intervals for the CUES score, we consider its asymptotic distribution. Since CUES is a non-linear function of the four component estimates

(\hat{C}, \hat{U}, \hat{E}, \hat{S})

, standard multivariate central limit results do not directly apply. The delta method provides a convenient way to approximate the sampling distribution of a smooth function of asymptotically normal estimators. Suppose

\sqrt{n} ({\hat{θ}}_{n} - θ)

is asymptotically normal for

{\hat{θ}}_{n} = (\hat{C}, \hat{U}, \hat{E}, \hat{S})

with covariance

Σ

. Define

g (θ) = (CUES)^{1 / 4}

[20,37]. Then, by the delta method

\sqrt{n} (g ({\hat{θ}}_{n}) - g (θ)) \overset{d}{\to} N (0, \nabla g (θ)^{⊤} Σ \nabla g (θ))

(30)

where

\nabla g (θ) = {(\frac{g (θ)}{4 C}, \frac{g (θ)}{4 U}, \frac{g (θ)}{4 E}, \frac{g (θ)}{4 S})}^{⊤}

(31)

This yields an approximate standard error for

\hat{C U E S}

and facilitates hypothesis testing.

2.6.7. Bootstrap-Based Inference

Because component estimators are often complex (DCA integrals, bootstrapped stability, subgroup utilities), a non-parametric bootstrap (resampling observations with replacement, recomputing

C, U, E, S

and hence CUES) yields an empirical distribution for

\hat{C U E S}

. Use percentile or bias-corrected accelerated (BCa) intervals for Cis [20,21,36].

2.6.8. Statistical Testing and Confidence Statements

Bootstrap confidence intervals for CUES—resample the dataset

B

times; for each bootstrap

b

compute

{C U E S}^{(b)}

. Use percentile CI:

[{C U E S}_{(α / 2)}, {C U E S}_{(1 - α / 2)}]

[20,21,36].

Hypothesis test for difference between two models (paired bootstrap)

Null $H_{0} : {C U E S}_{A} = {C U E S}_{B}$ [27,36,37].
For each bootstrap sample, compute $Δ^{(b)} = {C U E S}_{A}^{(b)} - {C U E S}_{B}^{(b)}$ . Estimate p-value as proportion of $Δ^{(b)}$ at least as extreme as observed $Δ$ . Use a two-sided test or compute BCa intervals on $Δ$ [20,21,36].

Error propagation (first order)—using the delta method:

V a r (\hat{C U E S}) \approx \nabla g (θ)^{⊤} C o v (\hat{θ}) \nabla g (θ)

(32)

This expresses how uncertainty in components influences CUES variance. CUES is intended to be interpreted jointly with its component scores, which must always be reported to preserve transparency and diagnostic insight.

2.7. CUES as Feature Selection as Application

The CUES framework, initially designed to evaluate model performance across calibration (C), utility (U), equity (E), and stability (S), can also be used as a feature selection method. Unlike conventional feature importance approaches that primarily focus on predictive power (e.g., coefficients in linear models or tree-based importances), CUES decomposes performance into interpretable components, providing a multi-faceted measure of feature contribution.

To perform feature selection, we computed feature-wise contributions to each CUES component and to the overall CUES score. This was achieved by performing multiple perturbations of individual features in the test set (e.g., via shuffling several times) and measuring the resulting change in CUES for each permutation. This procedure generates a distribution of CUES contributions per feature, capturing the variability and robustness of each feature’s impact on model performance. Features causing a larger drop in CUES or its components across permutations are considered more critical for maintaining performance, not only in terms of accuracy but also fairness (equity) and robustness (stability). Formally, let

C_{i}, U_{i}, E_{i}, S_{i}

denote the change in the respective CUES components when feature

i

is perturbed, and let

{CUES}_{i}

denote the corresponding change in the overall CUES:

{CUES}_{i} = {(C - C_{i})}^{1 / 4} \cdot {(U - U_{i})}^{1 / 4} \cdot {(E - E_{i})}^{1 / 4} \cdot {(S - S_{i})}^{1 / 4}

(33)

Features are ranked based on their mean or median impact on overall CUES across permutations, and this ranking can be used to select the top-k features for model training. Compared with traditional methods such as SHAP, CUES-based feature selection incorporates multi-objective criteria, ensuring that selected features simultaneously contribute to predictive performance, robustness, and fairness. For each classifier, we computed component-wise and overall CUES contributions per feature across multiple perturbations, allowing direct comparison with conventional explainable AI metrics (e.g., SHAP) and visualizing distributions using violin plots. This highlights features that are critical not only for predictive accuracy but also for equitable and stable model behavior.

2.8. Computational Considerations and Scalability

The computational complexity of the CUES framework scales linearly with the number of samples and evaluated subgroups, with additional overhead introduced by bootstrap-based stability estimation and perturbation analyses. In practice, we find that a moderate number of bootstrap iterations (e.g., 100–200) is sufficient to obtain stable estimates of all CUES components, consistent with prior work [13,14] on resampling-based uncertainty estimation.

For large-scale datasets, computational burden can be reduced through stratified subsampling, parallelized bootstrap procedures, or early stopping criteria once component estimates converge. Approximate calibration methods, such as fixed-bin reliability diagrams or isotonic regression with constrained knots, further reduce runtime without materially affecting CUES rankings. Importantly, all CUES components are inherently parallel and can be efficiently implemented on modern multicore or GPU-enabled systems, making the framework scalable to real-world clinical datasets. Table 1 summarizes the computational cost associated with each component of the CUES framework.

2.9. Datasets and Experiment Setup

To evaluate the CUES framework, multiple publicly available datasets were used to cover both binary and multi-class classification scenarios. All datasets were preprocessed using standardized procedures, including missing-value imputation, one-hot encoding for categorical variables, and feature scaling with StandardScaler. The datasets include the following:

Breast cancer dataset (binary): Obtained from sklearn datasets, containing 569 samples with 30 numeric features distinguishing malignant and benign cases.
Pima diabetes dataset (binary): Retrieved from OpenML (machine learning), consisting of 768 samples with eight clinical features. The target variable indicates whether a patient tested positive for diabetes.
Heart disease dataset (binary and multi-class): Sourced from OpenML (Cleveland dataset). The binary version distinguishes between low (<50% stenosis) and high (>50% stenosis) disease severity, whereas the multi-class version extends to five ordered categories. Mixed numeric and categorical features were handled using median imputation and one-hot encoding.
Dermatology dataset (multi-class): Obtained from OpenML, comprising 366 patient samples and 34 clinical attributes for the classification of six dermatological conditions. Missing numeric values were filled with column medians.

All datasets were divided into training and testing subsets using repeated stratified k-fold cross-validation (five folds, two repeats, random state = 42) to ensure stable performance estimates. All experiments were conducted in Python 3.10 using the scikit-learn library. The provided implementation automates data loading, preprocessing, model training, metric computation, bootstrapping, and visualization of CUES components. Four standard classifiers were evaluated across both binary and multi-class settings:

Logistic Regression (LR): a linear model serving as a strong, interpretable baseline.
Support Vector Machine (SVMrbf): a non-linear classifier using a radial basis function kernel.
K-Nearest Neighbors (KNN): a distance-based algorithm capturing local data structure (k = 7).
Bagging Ensemble Classifier (Bagging): is an ensemble of base estimators that improves variance reduction and stability.

Each model was trained and tested using repeated stratified k-fold cross-validation (five folds, two repeats, random seed = 42) to ensure robust performance estimates. Data preprocessing included imputing missing values, one-hot encoding for categorical features, and standard scaling for all numerical features. The evaluation covered two main tasks:

Within-dataset validation, where classifiers were trained and tested on the same dataset using cross-validation.
Cross-dataset generalization, where models trained on one dataset were evaluated on another to assess transferability and robustness to feature-space alignment.

Performance was measured using both traditional metrics (AUROC, AUPRC (Area Under the Precision–Recall Curve), accuracy, sensitivity, specificity) and the proposed CUES components (C, U, E, S). Bootstrapping (n = 200) was applied to compute the stability component (S) and derive confidence-adjusted CUES estimates. All numerical results were automatically exported as CSV tables, and visualizations (CUES curves, calibration plots, and component bar charts) were saved for each model–dataset pair.

3. Validation and Results

Overall, the experimental results validate the effectiveness of the proposed CUES (calibration, utility, equity, and stability) framework in providing a more holistic evaluation of AI models for clinical applications. Across all binary and multi-class classification tasks, CUES consistently captured dimensions of model performance such as calibration reliability, clinical utility, fairness, and stability that conventional metrics like AUROC and accuracy often failed to represent.

3.1. Binary Classification Performance

The results for binary classifications are summarized in Table 2 and Table 3 demonstrating the superior performance of Logistic Regression across all datasets, particularly for the breast cancer dataset, which achieved a mean CUES score of 0.903, indicating excellent calibration and clinical utility. In contrast, while the diabetes dataset achieved an AUROC ≈ of 0.83, its corresponding CUES score dropped to 0.50, reflecting poor calibration and limited real-world benefit despite high discriminative ability. The heart disease dataset showed intermediate CUES values (~0.61), suggesting reasonable discrimination but reduced stability and equity between subgroups. Figure 1, Figure 2 and Figure 3 show the normalized net benefit curves, calibration curves, and CUES components for various scenarios, with classifiers shown.

Table 3 compares multiple classifiers across three datasets using both traditional metrics and the proposed CUES score. Logistic Regression consistently achieved the highest CUES values, particularly for the Breast Cancer dataset (0.903), indicating strong calibration, clinical utility, fairness, and stability. In contrast, the diabetes dataset shows high AUROC (~0.83) but a substantially lower CUES score (0.473), highlighting poor calibration and limited real-world benefit despite good discrimination. Heart disease models achieved intermediate CUES scores (~0.61), reflecting moderate calibration and some subgroup disparities. Across classifiers, ensemble (Bagging) and distance-based (KNN) methods exhibited slightly lower CUES values than Logistic Regression and SVM, likely due to sensitivity to sampling variation, as reflected in the stability component. Notably, the “AUC CUES” column summarizes performance across thresholds, providing additional insight beyond the single CUES score. These results underscore that CUES adds interpretive power by revealing clinical trustworthiness and fairness that traditional metrics alone may obscure.

These findings underscore the added interpretive power of CUES, whereas traditional performance measures (e.g., AUROC and accuracy) would classify both the breast cancer and diabetes models as “strong performers,” CUES exposes critical weaknesses in clinical trustworthiness and fairness. Furthermore, across all classifiers, ensemble and distance-based methods (Bagging and KNN) exhibited slightly lower CUES values than Logistic Regression and SVM, likely due to their sensitivity to sampling variation, as reflected in lower stability (S) components.

3.2. Multi-Class Classification Performance

The multi-class results in Table 4 and Table 5 reveal that the dermatology dataset achieved the highest overall CUES scores (0.85–0.88) across all classifiers, demonstrating excellent calibration and robustness across subgroups. The heart disease (multi-class) dataset produced moderate CUES values (~0.40–0.42), mirroring its binary version and suggesting that increased class granularity introduces greater instability and equity variation [2,3,4]. The SVMrbf and Logistic models consistently achieved the strongest balance between accuracy and calibration, with minimal performance degradation between binary and multi-class settings [16,17,20,38]. Figure 4, Figure 5 and Figure 6 show the normalized net benefit curves, calibration curves, and CUES components for various scenarios, with classifiers shown.

3.3. Correlation Analysis

Finally, correlation results summarized in Table 6 indicate a strong monotonic relationship between the CUES score and conventional discriminative metrics. The Spearman correlation coefficients between CUES and AUROC/AUPRC were consistently high (ρ = 0.94–0.97) [2,8,9,10], confirming that while CUES aligns with standard measures in general trends, it additionally captures aspects such as fairness, calibration, and robustness that AUROC and AUPRC overlook.

3.4. Detailed Analysis of Binary and Multi-Class Findings

The detailed examination of both binary and multi-class results further underscores how the CUES framework provides a more interpretable and clinically aligned assessment than conventional metrics.

3.4.1. Binary Classification Analysis

In the breast cancer dataset, all classifiers achieved high discriminative performance (AUROC > 0.99), yet the CUES score revealed subtle but meaningful differences in model quality. Logistic Regression, with a CUES score of 0.937, outperformed more complex models such as SVM and Bagging, emphasizing that simplicity and calibration can coexist with strong predictive accuracy. The nearly perfect equity (E = 1.0) and stability (S > 0.93) components confirm that Logistic Regression produced reliable, equitable predictions across subgroups and repeated sampling.

In contrast, the diabetes dataset demonstrated a pronounced gap between traditional and CUES-based metrics. Despite an AUROC of 0.83, all models showed CUES scores below 0.50, indicating that their predicted probabilities were poorly calibrated and offered limited clinical utility. The low utility (U ≈ 0.25) and calibration (C ≈ 0.30) components highlight that even statistically “acceptable” classifiers can fail to deliver trustworthy risk estimates. This discrepancy highlights one of CUES’s key strengths, as it penalizes models that perform well on paper but offer limited real-world decision support.

For the heart disease (binary) dataset, models achieved moderate CUES values (~0.61), consistent with AUROC scores around 0.90. The reduced equity (E ≈ 0.80) and stability (S ≈ 0.85) components suggest subgroup variability and sensitivity to sampling issues that traditional metrics do not expose. These findings demonstrate CUES’s role in identifying models that might perform inconsistently across patient populations.

3.4.2. Multi-Class Classification Analysis

In multi-class experiments, the dermatology dataset consistently achieved the highest overall CUES scores (~0.93–0.94), affirming the robustness of the framework across multiple output classes. Models such as LogisticOVR and SVMrbf maintained strong calibration and equity, with minimal utility degradation as the number of output classes increased. This suggests that CUES generalizes effectively to multi-class medical prediction tasks, capturing nuances of model behavior across different disease categories.

The heart disease (multi-class) dataset yielded moderate CUES values (~0.61), consistent with its binary version. While AUROC and AUPRC remained relatively high, CUES identified reduced calibration and higher performance variability, particularly among minority outcome classes. This finding highlights the importance of incorporating both equity and stability into model assessment, as these are crucial considerations for clinical deployment, particularly in rare but severe disease stages where misclassification can have significant consequences.

3.4.3. Cross-Dataset and Correlation Analysis

The cross-dataset experiments showed that CUES effectively penalizes models that fail to generalize across domains, particularly when feature distributions differ between datasets. For instance, a model trained on the breast cancer dataset performed poorly on heart disease data, resulting in a significant decrease in CUES despite maintaining moderate AUROC values. This sensitivity to generalization validates the stability component (S) as a key measure of real-world robustness.

The Spearman correlation analysis confirmed strong positive correlations between CUES and traditional metrics (ρ ≈ 0.94–0.97), demonstrating that while CUES aligns with overall model trends, it captures additional dimensions of performance that AUROC and AUPRC overlook. This indicates that CUES can serve as both a complementary and corrective framework, reinforcing trustworthy models while exposing hidden deficiencies in clinical utility and fairness.

3.5. Interpretation of CUES Components

The experimental results, as visualized in Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11, clearly demonstrate that traditional metrics alone are insufficient for robust clinical evaluation of AI systems. Each figure presents the stacked CUES components calibration (C), utility (U), equity (E), and stability (S) for different classifiers and datasets, showcasing why CUES offers a fuller picture than AUROC or accuracy alone.

For example, Figure 7 shows that while all classifiers perform reasonably well on traditional metrics, the CUES components differentiate them by clinical reliability. Although two Logistic models achieve high scores, subtle variations in stability (S) and equity (E) are apparent, underscoring these models’ trustworthy and consistent predictions.

In contrast, Figure 8 illustrates that, even with moderate scores in utility and equity, the overall CUES values are much lower for all classifiers in this dataset. This reveals that traditional metrics like AUROC might mislead one to overestimate clinical suitability if not considered alongside calibration or equity.

Figure 9 shows that simple, interpretable models such as Logistic Regression and its OVR variant consistently yield the highest composite CUES scores. This underscores how prioritizing parsimony and transparency, as reflected in higher calibration (C) and stability (S) components, can yield models that are both effective and clinically trustworthy.

The analyses in Figure 10 and Figure 11 further underscore CUES’ utility by highlighting cases in which even high-performing models on traditional metrics exhibit deficiencies in equity (E) or stability (S), thereby lowering their composite CUES scores. This is especially crucial for datasets with known subgroup variabilities or biases.

Across all figures, the CUES framework quantifies clinical reliability in a way that traditional metrics cannot. The stacked bar visualizations make it explicit which models are balanced across all required dimensions; for instance, stability and equity gaps are visually apparent, guiding model selection and refinement. By referencing Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11, it becomes evident that CUES bridges the gap between technical performance and clinical applicability, offering a standard yet nuanced approach for evaluating, comparing, and ultimately selecting models ready for real-world healthcare deployment. The framework’s visual and quantitative clarity ensures that decision-makers can spot not just the best-performing model but the most robust and fair one for clinical practice.

3.6. CUES Features Selection

To assess the utility of CUES as a feature selection tool, we computed feature-wise contributions to the overall CUES for each classifier and dataset using multiple perturbations of each feature. This procedure generated a distribution of CUES contributions per feature, capturing the variability of each feature’s impact on model performance. Features were ranked by their mean (or median) contribution across permutations, with higher contributions indicating greater importance for maintaining not only predictive performance but also fairness and robustness.

3.6.1. Binary Classification Results

In binary datasets, features with the most significant contributions to overall CUES corresponded to known domain-relevant predictors. For example, in the breast cancer dataset, the top-ranked features according to CUES largely overlapped with those identified by conventional importance measures, while also highlighting features that affect equity and stability. The feature contribution distributions were visualized using violin and bar plots to capture both variability and overall importance. The violin plots illustrate sample-level variability across permutations, while the bar charts summarize the mean absolute CUES values for each feature, highlighting their overall contribution to model performance. As shown in Figure 12, the left panel presents the analysis for the heart disease dataset using the KNN classifier, and the right panel presents the results for the diabetes dataset using the logistic classifier. Features that consistently exhibited high CUES values across samples and permutations were identified as strong candidates for top-k feature selection in subsequent model training.

3.6.2. Multi-Class Classification Results

In multi-class datasets, CUES highlighted features that consistently influenced model performance across multiple classes. By ranking features based on their overall CUES distributions, it was possible to identify a subset of features that preserved predictive utility while improving stability and fairness, capturing patterns often overlooked by traditional feature importance methods such as SHAP values or coefficient magnitudes. As an illustration, Figure 13 presents the overall CUES feature importance analysis for the multi-class datasets. The violin plots depict sample-level variability across permutations, while the bar charts summarize the mean absolute CUES values, reflecting each feature’s overall contribution to model performance. Specifically, the left panel shows the results for the dermatology dataset using the logistic classifier, and the right panel shows the results for the heart disease dataset using the bagging classifier. Features demonstrating consistently high CUES values across samples and permutations were identified as strong candidates for top-k feature selection in subsequent model training.

3.6.3. Results of Retraining Using Top Selected Features

To assess the impact of CUES-guided feature selection, the models were retrained on subsets of the top-ranked features identified in the original analysis. Specifically, we evaluated retraining performance using the top 50% (Figure 14) and top 75% (Figure 15) of features based on their overall CUES contributions in the breast cancer dataset, using the Bagging classifier as an example.

Overall, retraining using CUES-guided feature subsets confirms that the method can identify a compact, informative set of features that preserves model accuracy while enhancing interpretability, stability, and fairness. These results further support the utility of CUES as a practical tool for feature selection in both binary and multi-class classification tasks. Table 7 shows the performance of all models on binary and multi-class datasets, using 50% and 75% of selected features via CUES.

3.7. Optimization Results

The optimization process compared classifier hyperparameter tuning based on accuracy versus the CUES (calibration, utility, effectiveness, stability) metric across four classifiers: SVM, Bagging, and KNN. For each classifier, two separate grid searches were conducted:

Accuracy-based grid search: Hyperparameters were optimized to maximize cross-validated classification accuracy.
CUES-based grid search: Hyperparameters were optimized to maximize the overall CUES score, which incorporates utility (U), calibration (C), stability (S), and effectiveness (E).

The best scores and selected hyperparameters for each method were recorded. Table 8 shows the hyperparameter search grid values.

The evaluation of classifier performance demonstrates the trade-offs between classical metrics and the explainable CUES framework. For instance, the SVM achieved near-perfect scores across all metrics and CUES components (accuracy = 1, sensitivity = 1, specificity = 1, CUES = 0.9999), indicating both strong predictive performance and stability (Figure 16). In comparison, Bagging maintained perfect accuracy and sensitivity but showed slightly lower CUES components (CUES = 0.9606), reflecting some variability in explainable utility measures. The KNN classifier exhibited lower accuracy (0.9533) and sensitivity (0.9329), with CUES also reduced to 0.9245, highlighting the trade-off between conventional performance metrics and the overall explainable utility (Figure 17 and Figure 18). The bar plots and heatmap allow visual inspection of these differences, emphasizing how optimization for CUES versus accuracy affects both metric scores and the interpretability of predictions.

CUES-based optimization provides a more holistic evaluation than accuracy alone by accounting for utility, calibration, and model stability. Across the four classifiers, CUES-optimized hyperparameters often differ from accuracy-optimized ones, emphasizing the benefit of explainable AI metrics in guiding model selection.

3.8. Results of Demographic and Anthropometric Fairness

To test how anthropometric or demographic data affect the CUES score, especially the E score, a comprehensive anthropometric equity analysis was conducted on the Body Measurements Obesity Risk dataset, comprising 500 samples and 12 features, with a binary target distribution of 220 versus 280. Four classifiers, Bagging, KNN, Logistic Regression, and SVM with RBF kernel, were evaluated using the CUES metric alongside feature-wise equity assessment, with anthropometric features including height, weight, waist circumference, hip circumference, BMI, WHR, gender, and age.

The Bagging classifier achieved the highest overall CUES score of 0.947, indicating superior balanced performance across utility, calibration, equity, and stability, whereas KNN performed worst with a CUES of 0.717, reflecting a relative improvement of 32.1% between the best and worst classifiers. Logistic Regression and SVMrbf scored 0.723 and 0.806, respectively, highlighting that ensemble methods can better maintain consistent performance across diverse anthropometric subgroups. Decomposition of CUES into its components revealed that Bagging outperformed other classifiers in all dimensions, with the equity component reaching 0.951, emphasizing the critical role of fairness in overall model performance. Figure 19 shows a comprehensive summary of anthropometric equity analysis for different classifiers.

Feature-wise equity analysis demonstrated notable disparities, particularly for Waist-to-Hip Ratio (WHR), which had the lowest equity (0.674 ± 0.203, range 0.219–1.000), suggesting substantial differences in model performance between subgroups defined by this feature. In contrast, height, BMI, weight, gender, and age exhibited high equity scores above 0.9, indicating consistent predictive performance across these subgroups. In comparison, waist circumference showed moderate equity (0.806 ± 0.127), suggesting potential benefit from subgroup-specific calibration. The equity heatmap further illustrates that Bagging consistently achieved higher equity scores across all features, whereas KNN displayed lower fairness, particularly in WHR and waist circumference.

Examination of the correlation between feature-wise equity and overall CUES revealed a Pearson correlation of r = 0.933 (p = 6.686 × 10⁻²), suggesting a positive but non-significant relationship between overall model performance and fairness. Overall, these results underscore that ensemble models like Bagging not only achieve superior predictive performance but also maintain higher fairness across anthropometric subgroups. Features with lower equity, such as WHR, should be carefully monitored, and feature-specific calibration or stratified validation may be warranted. The CUES metric provides a practical, unified framework for evaluating both performance and fairness, facilitating the reliable deployment of models across heterogeneous biomedical populations.

4. Discussion

The experimental results collectively emphasize that traditional evaluation metrics alone are insufficient for assessing the clinical reliability of AI systems [2,4,8,9]. While measures such as AUROC, AUPRC, and accuracy provide valuable information about overall discriminative power, they fail to capture whether a model’s predictions are well-calibrated, fair across patient subgroups, or robust to sampling variability [4,15,16,17,19]. These overlooked aspects are critical for clinical decision support, where model outputs directly influence diagnostic or treatment choices.

The CUES framework addresses these limitations by integrating four essential components: calibration, utility, equity, and stability into a unified composite score. By doing so, it quantifies dimensions that align closely with clinical reliability and ethical deployment. For example, a model with high AUROC but poor calibration and subgroup disparities receives a low CUES score, signaling its unsuitability for clinical use [4,5,18,38]. Conversely, models with balanced calibration and stable performance are rewarded, even if their AUROC is marginally lower [2,3,6]. CUES is intended to be interpreted jointly with its component scores, which must always be reported to preserve transparency and diagnostic insight.

The multiplicative structure of CUES enforces a conservative evaluation principle, whereby deficiencies in any critical dimension cannot be masked by strengths in others. To preserve interpretability, we emphasize the importance of reporting all four component scores alongside the composite value. This decomposition enables transparent clinical interpretation and supports targeted model refinement.

This multidimensional evaluation provides a clinically meaningful hierarchy of models, guiding practitioners and developers toward systems that are both statistically strong and ethically sound. Results across multiple datasets further confirm that Logistic Regression, a simpler and more interpretable model, often achieved the highest CUES scores, demonstrating that parsimony and transparency can enhance real-world trustworthiness [5,6,19]. Such findings are particularly relevant for regulatory frameworks, where explainability and fairness are increasingly prioritized alongside accuracy. Although the equity analysis in current experiments uses simple median splits for subgroup evaluation, future studies should apply CUES to more granular demographic and clinical subgroups. Extending the metric to complex, multi-ethnic, or multi-center datasets would allow CUES to uncover and mitigate hidden biases, thereby validating its broader applicability for safe and equitable deployment in healthcare.

Moreover, by explicitly quantifying stability and equity, CUES promotes more responsible model selection across heterogeneous healthcare populations. This is vital given the known biases in medical datasets and the variability of patient subgroups. The framework’s ability to highlight disparities encourages targeted model refinement and ensures broader generalizability and equitable clinical outcomes. While CUES incorporates decision-curve-based net benefit, future work should explore in greater depth how CUES-guided model selection translates into tangible improvements in clinical decision-making and patient outcomes. For instance, integrating simulated clinical pathways, decision analyses, or prospective patient-impact projections would provide stronger evidence that the metric yields actionable benefits in practice. Such analyses would further reinforce the real-world credibility and clinical relevance of CUES.

An additional advantage of CUES is its ability to support feature selection in a performance- and fairness-aware manner. By quantifying feature-wise contributions to overall CUES across multiple perturbations, the framework identifies not only features critical for predictive accuracy but also those that maintain calibration, stability, and equity. This multidimensional importance measure complements, rather than replaces, traditional explainable AI tools such as SHAP, which primarily capture predictive contributions. In our experiments, CUES-based rankings often highlighted features overlooked by conventional importance metrics but essential for robust, equitable model behavior. Visualizing per-feature CUES contributions through violin and bar plots enables direct assessment of variability across samples and perturbations, providing interpretable guidance for top-k feature selection. This approach ensures that model simplification or feature reduction does not compromise clinical reliability, making CUES particularly valuable for biomedical datasets where fairness, robustness, and interpretability are critical.

In several empirical examples, calibration emerges as the dominant limiting factor of the composite CUES score. This behavior is intentional rather than incidental. The multiplicative structure of CUES enforces a conservative evaluation principle: a model producing unreliable probability estimates should be penalized regardless of its discriminative ability or subgroup equity. Such dominance therefore highlights clinically meaningful failure modes rather than obscuring them. To preserve interpretability, all component scores are reported alongside the composite CUES value.

Articulating how CUES aligns with regulatory and ethical frameworks (e.g., FDA, EU MDR) for AI in healthcare would further strengthen its translational impact. Detailing such alignments would help position CUES not only as a scientific standard but also as a practical tool for ethical and regulatory compliance, facilitating adoption by clinicians, policymakers, and developers [39,40]. Detailing such alignments would help position CUES not only as a scientific standard but also as a tool for ethical and regulatory compliance, facilitating stakeholder adoption in clinical practice and policy. Overall, CUES bridges the gap between technical performance and clinical interpretability, offering a standardized yet flexible approach to AI evaluation. Its adoption could foster more transparent, fair, and dependable machine learning systems in healthcare, helping ensure that predictive models truly serve clinical decision-making rather than simply optimizing statistical scores.

Beyond analytical and experimental contributions, the CUES framework also aligns strongly with emerging regulatory and ethical guidelines for AI in healthcare. Regulatory bodies such as the U.S. FDA and the EU AI Act emphasize transparency, fairness, and robustness as prerequisites for clinical deployment of AI-driven decision systems [34,41]. Specifically, the FDA’s framework calls for validated, explainable, and stable model behavior across populations, while the EU AI Act mandates documentation of fairness, bias mitigation, and performance monitoring [33,42]. CUES directly operationalizes these principles by providing a quantifiable measure of equity, calibration, and stability, enabling developers and regulators to assess whether models meet these ethical and regulatory standards. Integrating CUES into evaluation pipelines could thus facilitate compliance with international guidelines, promote responsible AI adoption, and accelerate the translation of predictive models into clinical practices [34,43,44].

Optimizing model hyperparameters with CUES as the target objective proved highly effective for guiding models toward clinically reliable performance. Across multiple classifiers, tuning for CUES consistently produced models that balanced traditional metrics such as accuracy, sensitivity, and specificity with calibration, stability, and fairness. Even classifiers that did not achieve the highest AUROC values exhibited improved overall clinical reliability when optimized for CUES, highlighting enhanced stability and subgroup fairness. These results demonstrate that CUES-guided optimization not only identifies statistically strong models but also ensures they are robust, interpretable, and ethically sound, validating the framework’s practical utility for selecting models suited for real-world clinical deployment.

A practical limitation of equity and stability estimation arises when subgroup sizes are small or highly imbalanced. In such cases, component estimates may exhibit increased variance. To mitigate this issue, subgroup aggregation based on clinical similarity, minimum sample-size thresholds, or uncertainty-aware reporting of equity and stability scores may be employed. Alternatively, hierarchical or Bayesian shrinkage approaches can be incorporated to stabilize subgroup-level estimates. Importantly, CUES is flexible to these adaptations and does not require uniform subgroup sizes to remain informative.

We acknowledge that a wide range of performance metrics has been proposed in the literature, and that no composite evaluation framework is unique. CUES is not intended to exhaustively incorporate all possible metrics, but to provide a minimal, interpretable, and extensible foundation capturing dimensions that are repeatedly emphasized as essential for clinical reliability. Additional components can be incorporated into the framework if clinically justified, without altering its mathematical structure.

Importantly, the CUES framework is also particularly applicable to datasets containing anthropometric features, which are frequently used in biomedical research and clinical risk stratification. By quantifying calibration, equity, utility, and stability, CUES enables a systematic evaluation of how variations in anthropometric variables, such as BMI, neck circumference, or body fat distribution, affect model performance across subgroups. This capability helps researchers detect and mitigate biases associated with body composition or demographic characteristics, ensuring that predictive models remain robust and equitable across heterogeneous populations. Additionally, CUES-based feature analyses can identify which anthropometric measures meaningfully contribute to predictive accuracy while preserving stability and equity, guiding both model selection and feature prioritization for studies investigating the clinical and physiological implications of body metrics.

5. Conclusions

This study introduced the CUES (calibration, utility, equity, and stability) framework as a comprehensive evaluation metric for assessing the performance and clinical readiness of artificial intelligence models in medicine. Unlike traditional metrics such as accuracy, AUROC, or F1-score, CUES captures the multidimensional aspects of model reliability by integrating measures of calibration, clinical utility, fairness, and robustness.

Through extensive experiments on both binary and multi-class clinical datasets, CUES demonstrated its ability to uncover shortcomings that conventional metrics overlook such as poor calibration, inequitable subgroup performance, and instability under data variability. The results consistently showed that models with high CUES scores were consistently not only accurate but also trustworthy and clinically interpretable, reinforcing the importance of multidimensional evaluation in healthcare AI. By offering a single interpretable score that reflects real-world suitability, the CUES framework promotes transparent model comparison, informed clinical adoption, and ethical deployment of AI systems.

Importantly, CUES also enables a fairness-and robustness-aware feature selection mechanism. By computing feature-wise contributions to overall CUES across samples and perturbations, the framework identifies features that are critical not just for predictive accuracy, but also for maintaining calibration, equity, and stability. This approach complements traditional explainable AI methods, such as SHAP, by offering a multi-objective perspective on feature importance. Visualizations such as Violin and bar plots of per-feature CUES contributions allow illustrating both variability and mean impact, supporting the selection of top-k features for model training without compromising clinical reliability. This ensures that simplified models or reduced feature sets preserve interpretability and ethical performance, an essential consideration for biomedical applications where patient safety and fairness are paramount.

CUES is also particularly applicable to datasets containing anthropometric features, such as BMI, neck circumference, and body fat distribution. By quantifying calibration, equity, utility, and stability, CUES enables systematic assessment of how variations in these features affect model performance across patient subgroups. This facilitates the identification of potential biases in body composition and ensures that predictive models remain robust and fair across heterogeneous populations. Feature-wise CUES analyses can further highlight which anthropometric variables contribute meaningfully to predictive accuracy without undermining stability or equity, guiding both model selection and feature prioritization in studies exploring the clinical and physiological implications of body metrics.

Moreover, optimizing model hyperparameters using CUES as the objective proved highly effective in promoting clinically reliable performance. Tuning for CUES consistently produced models that balanced traditional metrics such as accuracy, sensitivity, and specificity with calibration, stability, and fairness. Even classifiers that did not achieve the highest AUROC values exhibited improved overall clinical reliability when optimized for CUES, demonstrating enhanced stability and subgroup fairness. These results confirm that CUES-guided optimization identifies not only statistically strong models but also robust, interpretable, and ethically sound models, validating the framework’s practical utility for real-world clinical deployment.

Future work will extend the CUES framework to larger, real-world clinical datasets and explore its integration into model development pipelines for automated bias detection, domain adaptation, and regulatory evaluation. While the current study establishes a strong theoretical and empirical foundation, several promising directions remain open. These include extending the CUES framework to unsupervised and survival prediction tasks, developing a differentiable surrogate loss to enable direct end-to-end optimization, and incorporating a complexity penalty term to balance model parsimony with predictive performance. Furthermore, validating CUES on clinical registries and longitudinal datasets will be essential to evaluate its temporal stability and generalizability. Ultimately, adopting the CUES framework can help align AI model evaluation with clinical priorities, ensuring that high-performing algorithms are not only statistically valid but also safe, fair, and effective in patient care.

Author Contributions

Conceptualization, A.M.A. and Z.M.; methodology, A.M.A. and Z.M.; software, A.M.A.; validation, A.M.A. and Z.M.; formal analysis, A.M.A. and Z.M.; investigation, A.M.A. and Z.M.; data curation, A.M.A. and Z.M.; writing—original draft preparation, A.M.A.; writing—review and editing, A.M.A. and Z.M.; visualization, A.M.A. and Z.M.; supervision, Z.M.; project administration, Z.M.; funding acquisition, Z.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC).

Data Availability Statement

The code of this metric is available through GitHub at https://github.com/aliqudah/CUES (accessed on 6 December 2025).

Acknowledgments

We acknowledge the support of the NSERC (Natural Sciences and Engineering Research Council of Canada). During the preparation of this manuscript/study, the author(s) used Grammarly (https://www.grammarly.com/) for the purposes of improving grammar, spelling, punctuation, and overall clarity of the text.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AUC	Area Under the Curve
AUPRC	Area Under the Precision–Recall Curve
AUROC	Area Under the Receiver Operating Characteristic Curve
BCa	Bias-Corrected and Accelerated
BS	Bootstrap
BSS	Between-Subjects Sum of Squares
BMI	Body Mass Index
CI	Confidence Interval
COVID	Coronavirus Disease
CUES	Calibration, Utility, Equity, Stability Metric
CV	Cross-Validation
DCA	Decision Curve Analysis
EU	European Union
FDA	Food and Drug Administration
FPR	False Positive Rate
FN	False Negative
INB	Incremental Net Benefit
KNN	K-Nearest Neighbors
LR	Logistic Regression
ML	Machine Learning
NB	Naive Bayes
NPV	Negative Predictive Value
OSA	Obstructive Sleep Apnea
OVR	One-Versus-Rest
PPV	Positive Predictive Value
PR	Precision–Recall
ROC	Receiver Operating Characteristic
SaMD	Software as a Medical Device
SD	Standard Deviation
SVM	Support Vector Machine
TP	True Positive
TRIPOD	Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis
UAR	Unweighted Average Recall

Appendix A. Proof Sketches for Theoretical Properties of the CUES Metric

This appendix provides detailed proof sketches supporting the theoretical properties of the proposed CUES (calibration–utility–equity–stability) metric introduced in Section 2. The results formalize the boundedness, monotonicity, sensitivity, consistency, and asymptotic behavior of CUES under the assumptions stated in the main text. Throughout, we adopt the notation and component definitions introduced in Section 2.2, Section 2.3, Section 2.4, Section 2.5 and Section 2.6.

Appendix A.1. Definition and Notation

Let

C \in [0,1], U \in [0,1], E \in [0,1], S \in [0,1]

denote the normalized calibration, utility, equity, and stability components, respectively. Each component is constructed as a bounded functional of the data-generating distribution

P

, as described in Section 2.2, Section 2.3, Section 2.4, Section 2.5 and Section 2.6. The CUES metric is defined as

C U E S = f (C, U, E, S) : = (C \cdot U \cdot E \cdot S)^{1 / 4} .

Let

\hat{C}, \hat{U}, \hat{E}, \hat{S}

denote the corresponding empirical estimators.

Appendix A.2. Boundedness the CUES Metric Satisfies $0 \leq C U E S \leq 1$

Proof.

By construction, each component is normalized to lie in the unit interval. Therefore,

0 \leq C \cdot U \cdot E \cdot S \leq 1

The function

x \mapsto x^{1 / 4}

is continuous and monotone increasing on

[0, 1]

, preserving bounds. Hence,

0 \leq (C \cdot U \cdot E \cdot S)^{1 / 4} \leq 1 .

This ensures that CUES is interpretable on a standardized scale comparable across datasets and models. □

Appendix A.3. Failure Dominance and Non-Compensability

If any component equals zero, then

C U E S = 0

.

Proof.

Suppose

C = 0

. Then

C \cdot U \cdot E \cdot S = 0,

and consequently,

C U E S = 0^{1 / 4} = 0 .

The same argument applies if

U = 0

,

E = 0

, or

S = 0

. This property formalizes the non-compensatory design principle of CUES: severe failure in calibration, utility, equity, or stability cannot be offset by strong performance in other dimensions. □

Appendix A.4. Monotonicity in Each Component

CUES is strictly increasing in each component on

(0, 1]^{4}

Proof.

Consider the partial derivative with respect to

C

:

\frac{\partial f}{\partial C} = \frac{1}{4} (C U E S)^{- 3 / 4} \cdot U E S .

Since

U, E, S > 0

, the derivative is strictly positive for all

C \in (0,1]

. By symmetry, identical expressions hold for

U, E, S

. Therefore, improving any single component while holding the others fixed strictly increases CUES. This guarantees that CUES respects component-wise performance improvements without introducing paradoxical behavior. □

Appendix A.5. Relative Sensitivity and Balanced Contribution to First Order, CUES Responds Equally to Relative Changes in Each Component

Proof.

Taking logarithms,

l o g (C U E S) = \frac{1}{4} (l o g C + l o g U + l o g E + l o g S) .

Differentiating yields

\frac{d C U E S}{C U E S} = \frac{1}{4} (\frac{d C}{C} + \frac{d U}{U} + \frac{d E}{E} + \frac{d S}{S}) .

Thus, a 1% relative increase in any single component produces an approximately 0.25% relative increase in CUES, assuming the other components remain fixed. This property ensures balanced weighting across dimensions without requiring explicit tuning parameters. □

Appendix A.6. Continuity and Consistency of the CUES Estimator if $\hat{C} \overset{p}{\to} C, \hat{U} \overset{p}{\to} U, \hat{E} \overset{p}{\to} E, \hat{S} \overset{p}{\to} S,$ then $\hat{C U E S} \overset{p}{\to} C U E S$

Proof.

The mapping

f (c, u, e, s) = (c u e s)^{1 / 4}

is continuous on

[0, 1]^{4}

. Since each component estimator is consistent under standard regularity conditions (e.g., empirical risk minimization for utility, law of large numbers for calibration error, subgroup-wise averaging for equity, and bootstrap resampling for stability), the joint convergence holds. By the Continuous Mapping Theorem,

f (\hat{C}, \hat{U}, \hat{E}, \hat{S}) \overset{p}{\to} f (C, U, E, S) .

□

Appendix A.7. Asymptotic Normality via the Delta Method Assume

\sqrt{n} ([\begin{matrix} \hat{C} \\ \hat{U} \\ \hat{E} \\ \hat{S} \end{matrix}] - [\begin{matrix} C \\ U \\ E \\ S \end{matrix}]) \overset{d}{\to} N (0, Σ),

for some covariance matrix

Σ

. Then,

\sqrt{n} (\hat{C U E S} - C U E S) \overset{d}{\to} N (0, \nabla f^{⊤} Σ \nabla f) .

Proof.

The function

f

is continuously differentiable on

(0, 1]^{4}

. Applying the multivariate delta method yields asymptotic normality with gradient

\nabla f = \frac{1}{4} (C U E S)^{- 3 / 4} [\begin{matrix} U E S \\ C E S \\ C U S \\ C U E \end{matrix}] .

This result provides theoretical justification for inference on CUES when component estimators are asymptotically normal. □

Appendix A.8. Bootstrap Validity for Finite-Sample Inference

While the asymptotic variance expression is analytically tractable, it depends on the unknown covariance matrix

Σ

. In practice, we therefore rely on bootstrap resampling, as described in Section 2.6.7, to estimate the sampling distribution of CUES. Under standard regularity conditions, the bootstrap consistently approximates the distribution of smooth functionals such as CUES, making it well suited for uncertainty quantification in moderate sample sizes.

References

Wynants, L.; Van Calster, B.; Collins, G.S.; Riley, R.D.; Heinze, G.; Schuit, E.; Bonten, M.M.J.; Dahly, D.L.; Damen, J.A.; Debray, T.P.A.; et al. Prediction Models for Diagnosis and Prognosis of COVID-19: Systematic Review and Critical Appraisal. BMJ 2020, 369, m1328. [Google Scholar] [CrossRef] [PubMed]
Siontis, G.C.M.; Tzoulaki, I.; Siontis, K.C.; Ioannidis, J.P.A. Comparisons of Established Risk Prediction Models for Cardiovascular Disease: Systematic Review. BMJ 2012, 344, e3318. [Google Scholar] [CrossRef] [PubMed]
Snell, K.I.; Archer, L.; Ensor, J.; Bonnett, L.J.; Debray, T.P.; Phillips, B.; Collins, G.S.; Riley, R.D. External validation of clinical prediction models: Simulation-based sample size calculations were more reliable than rules-of-thumb. J. Clin. Epidemiol. 2021, 135, 79–89. [Google Scholar] [CrossRef] [PubMed]
Van Calster, B.; McLernon, D.J.; van Smeden, M.; Wynants, L.; Steyerberg, E.W. Calibration: The Achilles Heel of Predictive Analytics. BMC Med. 2019, 17, 230. [Google Scholar] [CrossRef]
Harrell, F.E.; Lee, K.L.; Mark, D.B. Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Stat. Med. 1996, 15, 361–387. [Google Scholar] [CrossRef]
Collins, G.S.; Reitsma, J.B.; Altman, D.G.; Moons, K.G.M. Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): The TRIPOD Statement. Ann. Intern. Med. 2015, 162, 55–63. [Google Scholar] [CrossRef]
Vickers, A.J.; Elkin, E.B. Decision Curve Analysis: A Novel Method for Evaluating Prediction Models. Med. Decis. Mak. 2006, 26, 565–574. [Google Scholar] [CrossRef]
Hand, D.J. Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve. Mach. Learn. 2009, 77, 103–123. [Google Scholar] [CrossRef]
Cook, N.R. Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction. Circulation 2007, 115, 928–935. [Google Scholar] [CrossRef]
Saito, T.; Rehmsmeier, M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
Pencina, M.J.; D’Agostino, R.B.; Vasan, R.S. Evaluating the Added Predictive Ability of a New Marker: From Area Under the ROC Curve to Reclassification and Beyond. Stat. Med. 2008, 27, 157–172. [Google Scholar] [CrossRef] [PubMed]
Brier, G.W. Verification of Forecasts Expressed in Terms of Probability. Mon. Weather Rev. 1950, 78, 1–3. [Google Scholar] [CrossRef]
Steyerberg, E.W. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating; Statistics for Biology and Health; Springer International Publishing: Cham, Switzerland, 2019; ISBN 978-3-030-16398-3. [Google Scholar]
Riley, R.D.; Archer, L.; Snell, K.I.E.; Ensor, J.; Dhiman, P.; Martin, G.P.; Bonnett, L.J.; Collins, G.S. Evaluation of Clinical Prediction Models (Part 2): How to Undertake an External Validation Study. BMJ 2024, 384, e074820. [Google Scholar] [CrossRef] [PubMed]
Vickers, A.J.; van Calster, B.; Steyerberg, E.W. Net Benefit Approaches to the Evaluation of Prediction Models, Molecular Markers, and Diagnostic Tests. BMJ 2016, 352, i6. [Google Scholar] [CrossRef]
Rajkomar, A.; Hardt, M.; Howell, M.D.; Corrado, G.; Chin, M.H. Ensuring Fairness in Machine Learning to Advance Health Equity. Ann. Intern. Med. 2018, 169, 866–872. [Google Scholar] [CrossRef]
Obermeyer, Z.; Powers, B.; Vogeli, C.; Mullainathan, S. Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science 2019, 366, 447–453. [Google Scholar] [CrossRef]
van Smeden, M.; Moons, K.G.M.; de Groot, J.A.H.; Collins, G.S.; Altman, D.G.; Eijkemans, M.J.C.; Reitsma, J.B. Sample Size for Binary Logistic Prediction Models: Beyond Events per Variable Criteria. Stat. Methods Med. Res. 2019, 28, 2455–2474. [Google Scholar] [CrossRef]
Steyerberg, E.W.; Vickers, A.J.; Cook, N.R.; Gerds, T.; Gonen, M.; Obuchowski, N.; Pencina, M.J.; Kattan, M.W. Assessing the Performance of Prediction Models: A Framework for Some Traditional and Novel Measures. Epidemiology 2010, 21, 128–138. [Google Scholar] [CrossRef]
Efron, B.; Tibshirani, R. Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy. Stat. Sci. 1986, 1, 54–75. [Google Scholar] [CrossRef]
Efron, B. Bootstrap Methods: Another Look at the Jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
Dubois, D.; Prade, H. A Review of Fuzzy Set Aggregation Connectives. Inf. Sci. 1985, 36, 85–121. [Google Scholar] [CrossRef]
Dujmovic, J.J. Continuous Preference Logic for System Evaluation. IEEE Trans. Fuzzy Syst. 2007, 15, 1082–1099. [Google Scholar] [CrossRef]
Rozovsky, L.V. Comparison of Arithmetic, Geometric, and Harmonic Means. Math. Notes 2021, 110, 118–125. [Google Scholar] [CrossRef]
Kerr, K.F.; Brown, M.D.; Zhu, K.; Janes, H. Assessing the Clinical Impact of Risk Prediction Models with Decision Curves: Guidance for Correct Interpretation and Appropriate Use. J. Clin. Oncol. 2016, 34, 2534–2540. [Google Scholar] [CrossRef] [PubMed]
van Houwelingen, H.C.; le Cessie, S. Predictive Value of Statistical Models. Stat. Med. 1990, 9, 1303–1325. [Google Scholar] [CrossRef]
DeGroot, M.H.; Fienberg, S.E. The Comparison and Evaluation of Forecasters. J. R. Stat. Soc. Ser. D Stat. 1983, 32, 12–22. [Google Scholar] [CrossRef]
Kass, R.E.; Wasserman, L. A Reference Bayesian Test for Nested Hypotheses and Its Relationship to the Schwarz Criterion. J. Am. Stat. Assoc. 1995, 90, 928–934. [Google Scholar] [CrossRef]
Parmigiani, G. Decision Theory: Principles and Approaches; John Wiley & Sons: Chichester, UK, 2009; ISBN 978-0-471-49657-1. [Google Scholar]
Murphy, A.H. A New Vector Partition of the Probability Score. J. Appl. Meteorol. 1973, 12, 595–600. [Google Scholar] [CrossRef]
Pepe, M.S.; Janes, H.; Longton, G.; Leisenring, W.; Newcomb, P. Limitations of the Odds Ratio in Gauging the Performance of a Diagnostic, Prognostic, or Screening Marker. Am. J. Epidemiol. 2004, 159, 882–890. [Google Scholar] [CrossRef]
Nagelkerke, N.J.D. A Note on a General Definition of the Coefficient of Determination. Biometrika 1991, 78, 691–692. [Google Scholar] [CrossRef]
U.S. Food and Drug Administration. Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan. 2021. Available online: https://www.fda.gov/media/145022/download (accessed on 2 November 2025).
Kotter, E.; D’Antonoli, T.A.; Cuocolo, R.; Hierath, M.; Huisman, M.; Klontzas, M.E.; Martí-Bonmatí, L.; May, M.S.; Neri, E.; Nikolaou, K.; et al. Guiding AI in Radiology: ESR’s Recommendations for Effective Implementation of the European AI Act. Insights Imaging 2025, 16, 33. [Google Scholar] [CrossRef] [PubMed]
Elwali, A.; Moussavi, Z. A Novel Decision Making Procedure during Wakefulness for Screening Obstructive Sleep Apnea Using Anthropometric Information and Tracheal Breathing Sounds. Sci. Rep. 2019, 9, 11467. [Google Scholar] [CrossRef] [PubMed]
DeLong, E.R.; DeLong, D.M.; Clarke-Pearson, D.L. Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach. Biometrics 1988, 44, 837–845. [Google Scholar] [CrossRef] [PubMed]
Oehlert, G.W. A Note on the Delta Method. Am. Stat. 1992, 46, 27–29. [Google Scholar] [CrossRef]
Riley, R.D.; Ensor, J.; Snell, K.I.E.; Debray, T.P.A.; Altman, D.G.; Moons, K.G.M.; Collins, G.S. Calculating the Minimum Sample Size Required for Developing a Multivariable Prediction Model: Part II—Binary and Time-to-Event Outcomes. Stat. Med. 2019, 38, 1276–1296. [Google Scholar] [CrossRef]
Murdoch, B. Privacy and Artificial Intelligence: Challenges for Protecting Health Information in a New Era. BMC Med. Ethics 2021, 22, 122. [Google Scholar] [CrossRef]
Wiesinger, B.; Smith, R.; Morley, J.; Floridi, L. Ethics and Biases in Health AI Model Development. NPJ Digit. Med. 2020, 3, 86. [Google Scholar] [CrossRef]
Onitiu, D.; Wachter, S.; Mittelstadt, B. How AI Challenges the Medical Device Regulation: Patient Safety, Benefits, and Intended Uses. J. Law Biosci. 2024, lsae007. [Google Scholar] [CrossRef]
Ueda, D.; Kakinuma, T.; Fujita, S.; Kamagata, K.; Fushimi, Y.; Ito, R.; Matsui, Y.; Nozaki, T.; Nakaura, T.; Fujima, N.; et al. Fairness of Artificial Intelligence in Healthcare: Review and Recommendations. Jpn. J. Radiol. 2024, 42, 3–15. [Google Scholar] [CrossRef]
Vardas, E.P.; Marketou, M.; Vardas, P.E. Medicine, Healthcare and the AI Act: Gaps, Challenges and Future Implications. Eur. Heart J.—Digit. Health 2025, 6, 833–839. [Google Scholar] [CrossRef]
Bignami, E.; Darhour, L.J.; Franco, G.; Guarnieri, M.; Bellini, V. AI Policy in Healthcare: A Checklist-Based Methodology for Structured Implementation. J. Anesth. Analg. Crit. Care 2025, 5, 56. [Google Scholar] [CrossRef]

Figure 1. CUES component visualization for breast cancer (Logistic Regression). Each panel summarizes complementary aspects of model performance: (a) normalized net benefit (utility curve,

U

) across decision thresholds; (b) reliability diagram showing calibration performance (

C

); (c) bar chart of CUES components (

C, U, E, S)

and overall CUES score; (d,e) bootstrap-based utility stability analysis showing temporal variation and distribution of

U

; (f) threshold-wise stability band (

\pm 1

SD) for the CUES curve; (g,h) equity analysis comparing subgroup CUES curves and utility balance (

E

); (i) distribution of predicted probabilities for positive and negative classes. Together, these plots provide a holistic view of model utility, calibration, equity, and stability in imbalanced binary data.

Figure 1. CUES component visualization for breast cancer (Logistic Regression). Each panel summarizes complementary aspects of model performance: (a) normalized net benefit (utility curve,

U

) across decision thresholds; (b) reliability diagram showing calibration performance (

C

); (c) bar chart of CUES components (

C, U, E, S)

and overall CUES score; (d,e) bootstrap-based utility stability analysis showing temporal variation and distribution of

U

; (f) threshold-wise stability band (

\pm 1

SD) for the CUES curve; (g,h) equity analysis comparing subgroup CUES curves and utility balance (

E

); (i) distribution of predicted probabilities for positive and negative classes. Together, these plots provide a holistic view of model utility, calibration, equity, and stability in imbalanced binary data.

Figure 2. CUES component visualization for diabetes (Bagging). Each panel summarizes complementary aspects of model performance: (a) Normalized net benefit (utility curve,

U

) across decision thresholds; (b) reliability diagram showing calibration performance (

C

); (c) bar chart of CUES components (

C, U, E, S

) and overall CUES score; (d,e) bootstrap-based utility stability analysis showing temporal variation and distribution of

U

; (f) threshold-wise stability band (

\pm 1

SD) for the CUES curve; (g,h) equity analysis comparing subgroup CUES curves and utility balance (

E

); (i) distribution of predicted probabilities for positive and negative classes. Together, these plots provide a holistic view of model utility, calibration, equity, and stability in imbalanced binary data.

Figure 2. CUES component visualization for diabetes (Bagging). Each panel summarizes complementary aspects of model performance: (a) Normalized net benefit (utility curve,

U

) across decision thresholds; (b) reliability diagram showing calibration performance (

C

); (c) bar chart of CUES components (

C, U, E, S

) and overall CUES score; (d,e) bootstrap-based utility stability analysis showing temporal variation and distribution of

U

; (f) threshold-wise stability band (

\pm 1

SD) for the CUES curve; (g,h) equity analysis comparing subgroup CUES curves and utility balance (

E

); (i) distribution of predicted probabilities for positive and negative classes. Together, these plots provide a holistic view of model utility, calibration, equity, and stability in imbalanced binary data.

Figure 3. CUES component visualization for diabetes (KNN). Each panel summarizes complementary aspects of model performance: (a) normalized net benefit (utility curve,

U

) across decision thresholds; (b) reliability diagram showing calibration performance (

C

); (c) bar chart of CUES components

(C, U, E, S

) and overall CUES score; (d,e) bootstrap-based utility stability analysis showing temporal variation and distribution of

U

; (f) threshold-wise stability band (

\pm 1

SD) for the CUES curve; (g,h) equity analysis comparing subgroup CUES curves and utility balance (

E

); (i) distribution of predicted probabilities for positive and negative classes. Together, these plots provide a holistic view of model utility, calibration, equity, and stability in imbalanced binary data.

Figure 3. CUES component visualization for diabetes (KNN). Each panel summarizes complementary aspects of model performance: (a) normalized net benefit (utility curve,

U

) across decision thresholds; (b) reliability diagram showing calibration performance (

C

); (c) bar chart of CUES components

(C, U, E, S

) and overall CUES score; (d,e) bootstrap-based utility stability analysis showing temporal variation and distribution of

U

; (f) threshold-wise stability band (

\pm 1

SD) for the CUES curve; (g,h) equity analysis comparing subgroup CUES curves and utility balance (

E

); (i) distribution of predicted probabilities for positive and negative classes. Together, these plots provide a holistic view of model utility, calibration, equity, and stability in imbalanced binary data.

Figure 4. Macro-OVR CUES visualization for dermatology (logistic model). Each panel summarizes complementary aspects of model performance: (a) macro-averaged normalized net benefit curve (utility,

U

); (b) macro-OVR reliability diagram (calibration,

C

); (c) mean CUES component scores (

C, U, E, S

); (d,e) bootstrap-based macro-utility stability analysis; (f) stability band (

\pm 1

SD) around the Macro-OVR CUES curve; (g) per-class CUES curves illustrating class-wise trade-offs; (h) bar chart of per-class utility (

U

) values; (i) probability distributions across classes. This figure summarizes how multi-class models balance calibration, utility, equity, and robustness under the Macro-OVR framework.

Figure 4. Macro-OVR CUES visualization for dermatology (logistic model). Each panel summarizes complementary aspects of model performance: (a) macro-averaged normalized net benefit curve (utility,

U

); (b) macro-OVR reliability diagram (calibration,

C

); (c) mean CUES component scores (

C, U, E, S

); (d,e) bootstrap-based macro-utility stability analysis; (f) stability band (

\pm 1

SD) around the Macro-OVR CUES curve; (g) per-class CUES curves illustrating class-wise trade-offs; (h) bar chart of per-class utility (

U

) values; (i) probability distributions across classes. This figure summarizes how multi-class models balance calibration, utility, equity, and robustness under the Macro-OVR framework.

Figure 5. Macro-OVR CUES visualization for heart diseases (logistic model). Each panel summarizes complementary aspects of model performance: (a) macro-averaged normalized net benefit curve (utility,

U

); (b) macro-OVR reliability diagram (calibration,

C

); (c) mean CUES component scores (

C, U, E, S

); (d,e) bootstrap-based macro-utility stability analysis; (f) stability band (

\pm 1

SD) around the Macro-OVR CUES curve; (g) per-class CUES curves illustrating class-wise trade-offs; (h) bar chart of per-class utility (

U

) values; (i) probability distributions across classes. This figure summarizes how multi-class models balance calibration, utility, equity, and robustness under the Macro-OVR framework.

Figure 5. Macro-OVR CUES visualization for heart diseases (logistic model). Each panel summarizes complementary aspects of model performance: (a) macro-averaged normalized net benefit curve (utility,

U

); (b) macro-OVR reliability diagram (calibration,

C

); (c) mean CUES component scores (

C, U, E, S

); (d,e) bootstrap-based macro-utility stability analysis; (f) stability band (

\pm 1

SD) around the Macro-OVR CUES curve; (g) per-class CUES curves illustrating class-wise trade-offs; (h) bar chart of per-class utility (

U

) values; (i) probability distributions across classes. This figure summarizes how multi-class models balance calibration, utility, equity, and robustness under the Macro-OVR framework.

Figure 6. Macro-OVR CUES visualization for heart diseases (KNN). Each panel summarizes complementary aspects of model performance: (a) macro-averaged normalized net benefit curve (utility,

U

); (b) Macro-OVR reliability diagram (calibration,

C

); (c) mean CUES component scores (

C, U, E, S

); (d,e) bootstrap-based macro-utility stability analysis; (f) stability band (

\pm 1

SD) around the Macro-OVR CUES curve; (g) per-class CUES curves illustrating class-wise trade-offs; (h) bar chart of per-class utility (

U

) values; (i) probability distributions across classes. This figure summarizes how multi-class models balance calibration, utility, equity, and robustness under the Macro-OVR framework.

Figure 6. Macro-OVR CUES visualization for heart diseases (KNN). Each panel summarizes complementary aspects of model performance: (a) macro-averaged normalized net benefit curve (utility,

U

); (b) Macro-OVR reliability diagram (calibration,

C

); (c) mean CUES component scores (

C, U, E, S

); (d,e) bootstrap-based macro-utility stability analysis; (f) stability band (

\pm 1

SD) around the Macro-OVR CUES curve; (g) per-class CUES curves illustrating class-wise trade-offs; (h) bar chart of per-class utility (

U

) values; (i) probability distributions across classes. This figure summarizes how multi-class models balance calibration, utility, equity, and robustness under the Macro-OVR framework.

Figure 7. Stacked CUES component scores for multiple classifiers trained on the breast cancer dataset. The figure demonstrates how overall CUES values reflect critical dimensions of clinical reliability beyond traditional metrics, highlighting differences in model stability and equity.

Figure 8. Stacked CUES component scores for classifiers on the heart disease binary dataset. The components illustrate the importance of calibration and equity, showing that moderate traditional performance does not guarantee clinical suitability when important reliability factors are lacking.

Figure 9. Stacked CUES component scores for classifiers on the dermatology dataset. Here, interpretable and straightforward models, such as Logistic Regression, achieve high composite CUES scores, demonstrating the benefits of transparency and stability for trustworthy clinical predictions.

Figure 10. Stacked CUES component scores for classifiers on the heart disease multi-class dataset. The figure shows how subgroup variability and equity deficiencies lower CUES values, prompting a refined model selection to achieve more equitable clinical impact.

Figure 11. Stacked CUES component scores for classifiers on the diabetes dataset. The visualization reveals cases in which stability or equity gaps reduce the overall CUES score, reinforcing CUES’s role in identifying reliable and fair models for clinical deployment.

Figure 12. Overall, CUES features importance analysis for the binary datasets. The bar chart (A) displays the mean absolute CUES values, summarizing each feature’s overall contribution to model performance for the heart disease dataset using the KNN classifier. The bar chart (B) displays the mean absolute CUES values, summarizing each feature’s overall contribution to model performance for the diabetic dataset using a logistic classifier.

Figure 13. Overall, CUES features importance analysis for the multi-class datasets. The bar chart (A) displays the mean absolute CUES values, summarizing each feature’s overall contribution to model performance in the dermatology dataset using a logistic classifier. The bar chart (B) displays the mean absolute CUES values, summarizing each feature’s overall contribution to model performance for the heart disease dataset using a bagging classifier.

Figure 14. CUES component visualization for breast cancer (Bagging) using only the top 50% features selected using the CUES Score. Each panel summarizes complementary aspects of model performance: (a) normalized net benefit (utility curve,

U

) across decision thresholds; (b) reliability diagram showing calibration performance (

C

); (c) bar chart of CUES components (

C, U, E, S

) and overall CUES score; (d,e) bootstrap-based utility stability analysis showing temporal variation and distribution of

U

; (f) threshold-wise stability band (

\pm 1

SD) for the CUES curve; (g,h) equity analysis comparing subgroup CUES curves and utility balance (

E

); (i) distribution of predicted probabilities for positive and negative classes. Together, these plots provide a holistic view of model utility, calibration, equity, and stability in imbalanced binary data.

Figure 14. CUES component visualization for breast cancer (Bagging) using only the top 50% features selected using the CUES Score. Each panel summarizes complementary aspects of model performance: (a) normalized net benefit (utility curve,

U

) across decision thresholds; (b) reliability diagram showing calibration performance (

C

); (c) bar chart of CUES components (

C, U, E, S

) and overall CUES score; (d,e) bootstrap-based utility stability analysis showing temporal variation and distribution of

U

; (f) threshold-wise stability band (

\pm 1

SD) for the CUES curve; (g,h) equity analysis comparing subgroup CUES curves and utility balance (

E

); (i) distribution of predicted probabilities for positive and negative classes. Together, these plots provide a holistic view of model utility, calibration, equity, and stability in imbalanced binary data.

Figure 15. CUES component visualization for dermatology (Bagging) using only the top 50% features selected using the CUES Score. Each panel summarizes complementary aspects of model performance: (a) normalized net benefit (utility curve,

U

) across decision thresholds; (b) reliability diagram showing calibration performance (

C

); (c) bar chart of CUES components (

C, U, E, S

) and overall CUES score; (d,e) bootstrap-based utility stability analysis showing temporal variation and distribution of

U

; (f) threshold-wise stability band (

\pm 1

SD) for the CUES curve; (g,h) equity analysis comparing subgroup CUES curves and utility balance (

E

); (i) distribution of predicted probabilities for positive and negative classes. Together, these plots provide a holistic view of model utility, calibration, equity, and stability in imbalanced binary data.

Figure 15. CUES component visualization for dermatology (Bagging) using only the top 50% features selected using the CUES Score. Each panel summarizes complementary aspects of model performance: (a) normalized net benefit (utility curve,

U

) across decision thresholds; (b) reliability diagram showing calibration performance (

C

); (c) bar chart of CUES components (

C, U, E, S

) and overall CUES score; (d,e) bootstrap-based utility stability analysis showing temporal variation and distribution of

U

; (f) threshold-wise stability band (

\pm 1

SD) for the CUES curve; (g,h) equity analysis comparing subgroup CUES curves and utility balance (

E

); (i) distribution of predicted probabilities for positive and negative classes. Together, these plots provide a holistic view of model utility, calibration, equity, and stability in imbalanced binary data.

Figure 16. Accuracy vs. CUES (cross-validation scores): shows differences in optimal hyperparameters when tuned for accuracy versus CUES. CUES-optimized models often achieved slightly lower accuracy but higher overall CUES, reflecting improved calibration and stability.

Figure 17. Components and metrics comparison: displays bar plots of accuracy, sensitivity, specificity, and individual CUES components (C, U, E, S). This allows inspection of trade-offs between classical metrics and explainable utility metrics.

Figure 18. Heatmap: a comprehensive heatmap of all metrics and CUES components highlights each classifier’s performance profile. Darker colors indicate higher values, allowing rapid comparison across classifiers and optimization strategies.

Figure 19. Comprehensive anthropometric equity analysis across classifiers. The figure includes overall CUES performance, decomposition into utility, calibration, equity, and stability, a heatmap of feature-wise equity, the relationship between equity and CUES, utility disparities, and a summary of feature-wise equity across all models.

Table 1. Computational complexity and scalability of CUES components.

Component	Primary Operation	Time Complexity	Parallelizable
Utility (U)	Decision Curve Analysis across decision thresholds	O(N × T)	Yes
Calibration (C)	Reliability diagram/calibration error computation	O(N)	Yes
Equity (E)	Subgroup-wise utility estimation	O(N × G)	Yes
Stability (S)	Bootstrap resampling (B iterations)	O(B × N)	Yes
CUES Aggregation	Geometric mean of normalized components	O(1)	Yes

N

denotes the number of samples,

T

the number of decision thresholds,

G

the number of predefined subgroups, and

B

the number of bootstrap iterations. Time complexity is expressed using Big-O notation

O (\cdot)

, which characterizes asymptotic computational cost. All CUES components are parallelizable because computations are independent across samples, thresholds, subgroups, or bootstrap iterations, enabling efficient execution on multicore or GPU-enabled systems.

Table 2. Binary classification: multiple classifiers CUES performance.

Dataset	Classifier	CUES Score	C	U	E	S	AUC CUES
Breast Cancer	Logistic	0.903	0.914	0.900	0.862	0.941	0.900
Breast Cancer	SVMrbf	0.909	0.909	0.894	0.894	0.942	0.894
Breast Cancer	Bagging	0.858	0.858	0.834	0.813	0.936	0.834
Breast Cancer	KNN	0.866	0.870	0.846	0.824	0.932	0.846
Diabetes	Logistic	0.473	0.302	0.246	0.933	0.732	0.246
Diabetes	SVMrbf	0.443	0.276	0.220	0.897	0.718	0.220
Diabetes	Bagging	0.446	0.278	0.224	0.913	0.715	0.224
Diabetes	KNN	0.383	0.204	0.169	0.944	0.705	0.169
Heart Disease	Logistic	0.614	0.525	0.474	0.801	0.746	0.474
Heart Disease	SVMrbf	0.602	0.487	0.429	0.859	0.756	0.429
Heart Disease	Bagging	0.587	0.460	0.413	0.849	0.760	0.413
Heart Disease	KNN	0.589	0.473	0.416	0.841	0.746	0.416

Table 3. Binary classification: multiple classifiers performance comparison between CUES and traditional metrics.

Dataset	Classifier	CUES Score	AUC CUES	AUROC	AUPRC	Accuracy	Sensitivity	Specificity
Breast Cancer	Logistic	0.903	0.900	0.995	0.997	0.976	0.990	0.953
Breast Cancer	SVMrbf	0.909	0.894	0.995	0.997	0.969	0.975	0.960
Breast Cancer	Bagging	0.858	0.834	0.988	0.989	0.961	0.975	0.939
Breast Cancer	KNN	0.866	0.846	0.987	0.986	0.965	0.989	0.925
Diabetes	Logistic	0.473	0.246	0.830	0.719	0.768	0.565	0.877
Diabetes	SVMrbf	0.443	0.220	0.823	0.711	0.762	0.575	0.862
Diabetes	Bagging	0.446	0.224	0.819	0.698	0.756	0.616	0.831
Diabetes	KNN	0.383	0.169	0.780	0.624	0.727	0.543	0.826
Heart Disease	Logistic	0.614	0.474	0.911	0.907	0.845	0.786	0.894
Heart Disease	SVMrbf	0.602	0.429	0.896	0.886	0.835	0.800	0.864
Heart Disease	Bagging	0.587	0.413	0.890	0.884	0.818	0.771	0.858
Heart Disease	KNN	0.589	0.416	0.889	0.850	0.832	0.808	0.852

Table 4. Multi-class classification: multiple classifiers CUES performance.

Dataset	Classifier	CUES Score	C	U	E	S	AUC CUES
Heart Disease	Logistic	0.421	0.262	0.229	0.849	0.629	0.229
Heart Disease	SVMrbf	0.400	0.235	0.194	0.871	0.653	0.194
Heart Disease	Bagging	0.390	0.231	0.205	0.824	0.608	0.205
Heart Disease	KNN	0.414	0.248	0.227	0.848	0.636	0.227
Dermatology	Logistic	0.877	0.948	0.926	0.747	0.910	0.926
Dermatology	SVMrbf	0.875	0.948	0.931	0.739	0.907	0.931
Dermatology	Bagging	0.848	0.913	0.888	0.752	0.854	0.888
Dermatology	KNN	0.852	0.920	0.890	0.732	0.885	0.890

Table 5. Multi-class classification: multiple classifiers performance comparison between CUES and traditional metrics.

Dataset	Classifier	CUES Score	AUC CUES	AUROC	AUPRC	Accuracy	Sensitivity	Specificity
Heart Disease	Logistic	0.421	0.229	0.818	0.589	0.682	0.564	0.820
Heart Disease	SVMrbf	0.400	0.194	0.812	0.585	0.682	0.541	0.819
Heart Disease	Bagging	0.390	0.205	0.801	0.588	0.652	0.544	0.805
Heart Disease	KNN	0.414	0.227	0.807	0.572	0.654	0.504	0.804
Dermatology	Logistic	0.877	0.926	0.998	0.992	0.978	0.975	0.996
Dermatology	SVMrbf	0.875	0.931	0.999	0.993	0.975	0.973	0.995
Dermatology	Bagging	0.848	0.888	0.993	0.974	0.954	0.949	0.990
Dermatology	KNN	0.852	0.890	0.996	0.978	0.963	0.965	0.993

Table 6. Correlations between CUES and performance metrics.

Scope	Spearman CUES AUROC	Spearman CUES AUPRC
Binary Single	1.000	1.000
Multi Many	0.952	0.881
Binary Many	0.979	0.979

Table 7. Multiple classifiers performance comparison between CUES and traditional metrics using two selected features subsets.

Dataset	Classifier	Feature Subset	CUES Score	AUC CUES	AUROC	AUPRC	Accuracy	Sensitivity	Specificity
Binary
Breast Cancer	Logistic	Top 50% (15)	0.950	0.908	0.997	0.998	0.975	0.970	0.948
		Top 75% (22)	0.962	0.929	0.997	0.998	0.989	0.987	0.976
	SVMrbf	Top 50% (15)	0.957	0.929	0.996	0.997	0.984	0.981	0.967
		Top 75% (22)	0.964	0.944	0.997	0.998	0.988	0.984	0.972
	Bagging	Top 50% (15)	0.970	0.940	1.000	1.000	1.000	1.000	1.000
		Top 75% (22)	0.979	0.957	1.000	1.000	1.000	1.000	1.000
	KNN	Top 50% (15)	0.907	0.852	0.996	0.997	0.968	0.961	0.934
		Top 75% (22)	0.926	0.879	0.997	0.998	0.975	0.970	0.948
Diabetes	Logistic	Top 50% (4)	0.357	0.128	0.754	0.594	0.710	0.641	0.868
		Top 75% (6)	0.503	0.254	0.836	0.713	0.780	0.731	0.892
	SVMrbf	Top 50% (4)	0.534	0.297	0.859	0.773	0.801	0.754	0.908
		Top 75% (6)	0.570	0.337	0.882	0.807	0.815	0.770	0.920
	Bagging	Top 50% (4)	0.905	0.807	1.000	1.000	1.000	1.000	1.000
		Top 75% (6)	0.908	0.807	1.000	1.000	1.000	1.000	1.000
	KNN	Top 50% (4)	0.598	0.378	0.892	0.821	0.818	0.777	0.912
		Top 75% (6)	0.595	0.362	0.886	0.800	0.797	0.764	0.872
Heart Disease	Logistic	Top 50% (12)	0.677	0.493	0.923	0.916	0.838	0.833	0.897
		Top 75% (18)	0.714	0.551	0.939	0.932	0.868	0.863	0.915
	SVMrbf	Top 50% (12)	0.750	0.587	0.951	0.942	0.881	0.878	0.915
		Top 75% (18)	0.795	0.649	0.962	0.958	0.908	0.906	0.927
	Bagging	Top 50% (12)	0.905	0.804	1.000	1.000	1.000	1.000	1.000
		Top 75% (18)	0.911	0.827	1.000	1.000	1.000	1.000	1.000
	KNN	Top 50% (12)	0.643	0.422	0.890	0.885	0.792	0.790	0.818
		Top 75% (18)	0.709	0.515	0.927	0.923	0.855	0.853	0.873
Multi-class
Dermatology	Logistic	Top 50% (17)	0.750	0.902	0.999	0.993	0.975	0.973	0.995
		Top 75% (25)	0.760	0.940	0.999	0.996	0.981	0.979	0.996
	SVMrbf	Top 50% (17)	0.826	0.930	0.999	0.994	0.973	0.969	0.995
		Top 75% (25)	0.829	0.942	0.999	0.995	0.975	0.972	0.995
	Bagging	Top 50% (17)	0.966	0.963	1.000	1.000	0.997	0.997	0.999
		Top 75% (25)	0.970	0.968	1.000	1.000	1.000	1.000	1.000
	KNN	Top 50% (17)	0.778	0.826	0.995	0.978	0.932	0.929	0.986
		Top 75% (25)	0.814	0.873	0.997	0.988	0.948	0.948	0.990
Heart Disease	Logistic	Top 50% (6)	0.388	0.159	0.785	0.591	0.607	0.470	0.760
		Top 75% (9)	0.481	0.244	0.849	0.652	0.700	0.594	0.825
	SVMrbf	Top 50% (6)	0.435	0.188	0.856	0.695	0.719	0.593	0.830
		Top 75% (9)	0.566	0.326	0.912	0.824	0.789	0.690	0.872
	Bagging	Top 50% (6)	0.826	0.698	0.981	0.976	0.941	0.941	0.966
		Top 75% (9)	0.889	0.776	1.000	1.000	1.000	1.000	1.000
	KNN	Top 50% (6)	0.506	0.285	0.865	0.699	0.710	0.586	0.828
		Top 75% (9)	0.523	0.296	0.873	0.722	0.706	0.595	0.825

Table 8. Values for hyperparameter optimization grid search.

Classifier	Parameters
SVM	C = 10, γ = ‘scale’
Bagging	Estimators = 20
KNN	Neighbors = 3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alqudah, A.M.; Moussavi, Z. CUES: A Multiplicative Composite Metric for Evaluating Clinical Prediction Models Theory, Inference, and Properties. Mathematics 2026, 14, 398. https://doi.org/10.3390/math14030398

AMA Style

Alqudah AM, Moussavi Z. CUES: A Multiplicative Composite Metric for Evaluating Clinical Prediction Models Theory, Inference, and Properties. Mathematics. 2026; 14(3):398. https://doi.org/10.3390/math14030398

Chicago/Turabian Style

Alqudah, Ali Mohammad, and Zahra Moussavi. 2026. "CUES: A Multiplicative Composite Metric for Evaluating Clinical Prediction Models Theory, Inference, and Properties" Mathematics 14, no. 3: 398. https://doi.org/10.3390/math14030398

APA Style

Alqudah, A. M., & Moussavi, Z. (2026). CUES: A Multiplicative Composite Metric for Evaluating Clinical Prediction Models Theory, Inference, and Properties. Mathematics, 14(3), 398. https://doi.org/10.3390/math14030398

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CUES: A Multiplicative Composite Metric for Evaluating Clinical Prediction Models Theory, Inference, and Properties

Abstract

1. Introduction

1.1. The Need for a New Evaluation Metric

1.2. What Makes This Score Different from Current Scores

2. Materials and Methods

2.1. Overview of the CUES Framework

2.2. Estimation of Individual Components

2.2.1. Utility (U): Integrated Net Benefit

2.2.2. Calibration (C): Brier Skill Score

2.2.3. Equity (E): Worst-Case Disparity with Variance Penalty

Extended Equity (E) for Demographic and Anthropometric Fairness

2.2.4. Stability (S): Normalized Confidence Interval

2.3. Practical Estimation Algorithm for CUES

2.4. CUES Curve and Visualization as Interpretation

2.4.1. Net Benefit Computation

2.4.2. Normalization

2.4.3. Area Under the CUES Curve (AUC-CUES)

2.4.4. Visualization

2.4.5. Theoretical Comparison with Existing Metrics (Decoupling from AUROC)

2.4.6. Interpretation

2.5. Gradient-Based Optimization of CUES (Optional)

2.6. Theoretical Properties and Statistical Inference

2.6.1. Boundedness

2.6.2. Zero Sensitivity

2.6.3. Monotonicity

2.6.4. Relative Sensitivity

2.6.5. Continuous Mapping Theorem

2.6.6. Asymptotic Distribution (Delta Method Sketch)

2.6.7. Bootstrap-Based Inference

2.6.8. Statistical Testing and Confidence Statements

2.7. CUES as Feature Selection as Application

2.8. Computational Considerations and Scalability

2.9. Datasets and Experiment Setup

3. Validation and Results

3.1. Binary Classification Performance

3.2. Multi-Class Classification Performance

3.3. Correlation Analysis

3.4. Detailed Analysis of Binary and Multi-Class Findings

3.4.1. Binary Classification Analysis

3.4.2. Multi-Class Classification Analysis

3.4.3. Cross-Dataset and Correlation Analysis

3.5. Interpretation of CUES Components

3.6. CUES Features Selection

3.6.1. Binary Classification Results

3.6.2. Multi-Class Classification Results

3.6.3. Results of Retraining Using Top Selected Features

3.7. Optimization Results

3.8. Results of Demographic and Anthropometric Fairness

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Proof Sketches for Theoretical Properties of the CUES Metric

Appendix A.1. Definition and Notation

Appendix A.2. Boundedness the CUES Metric Satisfies 0 ≤ C U E S ≤ 1

Appendix A.3. Failure Dominance and Non-Compensability

Appendix A.4. Monotonicity in Each Component

Appendix A.5. Relative Sensitivity and Balanced Contribution to First Order, CUES Responds Equally to Relative Changes in Each Component

Appendix A.6. Continuity and Consistency of the CUES Estimator if C ^ → p C , U ^ → p U , E ^ → p E , S ^ → p S , then C U E S ^ → p C U E S

Appendix A.7. Asymptotic Normality via the Delta Method Assume

Appendix A.8. Bootstrap Validity for Finite-Sample Inference

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix A.2. Boundedness the CUES Metric Satisfies $0 \leq C U E S \leq 1$

Appendix A.6. Continuity and Consistency of the CUES Estimator if $\hat{C} \overset{p}{\to} C, \hat{U} \overset{p}{\to} U, \hat{E} \overset{p}{\to} E, \hat{S} \overset{p}{\to} S,$ then $\hat{C U E S} \overset{p}{\to} C U E S$