Next Article in Journal
Artificial Intelligence for Iteration Count Prediction in Real-Time CORDIC Processing
Previous Article in Journal
A Nekhoroshev-Type Result for a Generalized Boussinesq Equation
Previous Article in Special Issue
Affine Invariance of Bézier Curves on Digital Grid
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prevalence-Insensitive Evaluation of Diagnostic Systems Under Class Imbalance: The Harmonic Mean of Per-Class Sensitivity

by
Jesús S. Aguilar-Ruiz
School of Engineering, Pablo de Olavide University, 41013 Seville, Spain
Mathematics 2025, 13(24), 3956; https://doi.org/10.3390/math13243956
Submission received: 10 November 2025 / Revised: 6 December 2025 / Accepted: 10 December 2025 / Published: 12 December 2025

Abstract

In safety-critical diagnosis under class imbalance, widely used global metrics—e.g., accuracy, F-score, and MCC—conflate class-conditional behavior with class priors, often yielding overly optimistic assessments of overall performance, thereby hindering stable comparisons across datasets. We formalize the prevalence sensitivity of these indices and advance a prevalence-insensitive evaluation built solely from class-conditional rates (binary TPR/TNR; multiclass per-class sensitivities). We analyze the arithmetic, geometric, and harmonic means as class-symmetric aggregators and provide a geometric characterization of their isocurves. Empirically, on 18 UCI datasets, accuracy exceeded the harmonic mean by ∼3% on average, but the gap widened on skewed sets (e.g., cardiotocography: Acc = 0.896 vs. H = 0.803 , + 11.5 % ). On a 6756-sample, 35-class tumor dataset, high overall agreement ( Acc = 0.906 , Cohen’s κ = 0.899 ) coexisted with dispersed per-class sensitivities ( A = 0.795 , G = 0.743 , H = 0.658 ) and moderate MCP ( 0.564 ), evidencing weak minority-class performance despite strong accuracy. These results show that reliance on accuracy or κ can overstate performance precisely where it matters most, inflating perceived safety and obscuring rare-class failure modes. Prevalence-insensitive, class-symmetric means—especially the harmonic mean of per-class sensitivity—yield conservative, comparable summaries better aligned with risk-aware evaluation and deployment in critical systems.

1. Introduction

Model evaluation is central to the scientific value of machine learning (ML): the chosen metric defines what counts as reliable performance for a given application and directly shapes model selection and deployment decisions. Because different metrics emphasize different aspects of behavior (e.g., error rate versus sensitivity to class imbalance), they can rank the same models differently on the same data [1]. In safety-critical contexts, this makes the metric choice part of the problem specification itself, since it encodes which kinds of errors the system must avoid and how uneven class performance should be properly penalized [2].
There exists a wide ecosystem of performance indices—accuracy and error rate, class-conditional sensitivity/recall, F -scores [3,4], rank-based and agreement-based measures such as Matthews Correlation Coefficient (MCC) [5] and Cohen’s κ [6], the ROC curve [7,8,9], and multiclass generalizations—each with distinct statistical and operational semantics [10,11,12]. This richness is useful but hazardous if metrics are chosen without regard to data and objectives, or when reported scores are internally inconsistent [13].
In many applications the target (positive) class is rare (e.g., fraud, medical screening, fault detection); more broadly, in multiclass settings one or more classes may have very low prevalence—despite being equally important as the common classes—so imbalance manifests as uneven per-class prevalences rather than a single positive–negative asymmetry. Under such class imbalance, metrics that mix class-conditional terms (probabilities conditioned on the true class; e.g., true positive rate) with class priors (unconditional probabilities of each class, i.e., class proportions) can be dominated by the majority class, masking poor minority performance. Foundational studies and surveys document both the ubiquity of imbalance and its effects on learning and evaluation [14,15,16,17]. Consequently, conclusions drawn from prevalence-sensitive metrics can be unstable across datasets that differ only in class proportions.
When the scientific goal is to develop a predictive model that is accurate and reliable, it is preferable to base model selection and validation on prevalence-insensitive criteria that reflect intrinsic class-conditional performance. Using aggregates of per-class sensitivity and threshold-free analyses (e.g., Multiclass Classification Performance (MCP) [18]) decouples model quality from the class prior, yielding estimates that remain stable under base-rate shifts. Operational costs and prevalences can then be incorporated at deployment through decision analysis, without confounding model selection with prevalence effects [19].
Thus, a principled route to prevalence-insensitivity is to construct evaluation indices from the class-conditional true positive rate (sensitivity) and true negative rate (specificity)—and, in the multiclass case, from per-class sensitivities. Because these are conditioned on the true class, they are invariant to changes in prevalence and thus provide a stable basis for summary measures [20].
Aggregating sensitivity (and specificity) via the Pythagorean means (arithmetic, geometric, and harmonic) yields class-symmetric, prevalence-insensitive summaries with different penalizations of imbalance between components. This means-based perspective makes explicit how the choice of mean encodes the desired conservativeness against asymmetric errors, while remaining robust to class-prior changes [21,22].
For binary datasets, the arithmetic mean (balanced accuracy), A , and the geometric mean, G , are widely reported in imbalanced-learning research, whereas the harmonic mean, H , of sensitivity and specificity is comparatively rare despite its appealingly conservative nature. Its use is documented in domain-specific studies (e.g., biomedical diagnostics), yet it remains less common as a headline metric than balanced accuracy or the G-mean in general ML practice [23,24,25]. In multiclass settings, the harmonic mean of per-class true positive rates is even less common, despite its ability to reveal weak per-class performance in critical systems.
In safety-critical diagnosis, the evaluation criterion must encode minimum class-wise detection requirements and remain stable under prevalence shifts; otherwise, aggregate scores can coexist with unacceptable under-detection of specific conditions. Adopting prevalence-insensitive, class-symmetric summaries built from per-class sensitivities—and, in particular, the harmonic mean—aligns model selection with clinical safety by rewarding uniformly adequate coverage rather than compensatory trade-offs. This choice yields estimates that are robust across datasets and supports transparent auditing of class-wise performance.
This paper advocates prevalence-insensitive evaluation in critical domains and, in particular, the use of measures built from per-class true positive rates. It addresses three shortcomings in current practice: (i) the widespread reliance on prevalence-sensitive indices—such as accuracy and F-score—whose values vary with class proportions and can therefore overstate reliability under imbalance; (ii) the absence of a conservative, class-symmetric summary that prevents high overall scores when any class is under-detected; and (iii) the scarcity of multiclass-oriented formulations that remain invariant to base-rate shifts while retaining clear interpretability.
On the UCI benchmarks, we observe a modest yet systematic overestimation by global metrics relative to class-symmetric means. In the multiclass cancer case study, high overall scores coexist with low per-class sensitivities for several tumor types; these global aggregates mask class-specific weaknesses and can lead to missed diagnoses with serious clinical consequences. In summary, these findings support the harmonic mean as a prudent, conservative choice to mitigate overestimation when some classes exhibit weak sensitivity.
Our contributions are: (i) a formal analysis that separates prevalence-sensitive from prevalence-insensitive measures by examining their dependence on class priors; (ii) a systematic treatment that frames the Pythagorean means—arithmetic, geometric, and harmonic—as class-symmetric aggregators of true-positive rates (binary: sensitivity and specificity; multiclass: per-class sensitivities); (iii) practical guidance for safety-critical evaluation, recommending the harmonic mean of per-class sensitivities as a conservative, imbalance-robust summary; and (iv) empirical validation across standard benchmarks and a high-dimensional, 35-class cancer diagnosis task, demonstrating consistent gaps between accuracy-like indices and prevalence-insensitive summaries.

2. Prevalence-Insensitive vs. Prevalence-Sensitive Measures

Let ( X , Y ) P with Y { 0 , 1 } , where Y = 1 denotes the positive class and Y = 0 the negative class. A (deterministic) classifier is a mapping h : X { 0 , 1 } . More generally, a scoring model outputs a function s : X R , which is converted into a classifier by thresholding: for t R , the induced classifier is Y ^ t ( x ) : = 1 { s ( x ) t } . Let π = P ( Y = 1 ) denote the positive-class prevalence (so P ( Y = 0 ) = 1 π ). We write 1 { E } for the indicator of an event E, equal to 1 if E holds, and 0 otherwise.
Performance summaries aggregate different types of errors that have distinct operational meaning. Global metrics that average over classes (e.g., overall accuracy) depend on the class prior π , while class-conditional rates are prevalence-invariant and thus separate the intrinsic discriminative ability of a classifier from the composition of the test population.
For a sample { ( x i , y i ) } i = 1 n and a fixed classifier Y ^ , define the counts
TP = i = 1 n 1 { Y ^ ( x i ) = 1 , y i = 1 } FP = i = 1 n 1 { Y ^ ( x i ) = 1 , y i = 0 } FN = i = 1 n 1 { Y ^ ( x i ) = 0 , y i = 1 } TN = i = 1 n 1 { Y ^ ( x i ) = 0 , y i = 0 }
Let P = TP + FN and N = TN + FP be the numbers of positive and negative instances, with n = P + N . The confusion matrix is shown in Table 1.
The fundamental class-conditional rates are
TPR = P ( Y ^ = 1 Y = 1 ) = TP P ( True Positive Rate ) TNR = P ( Y ^ = 0 Y = 0 ) = TN N ( True Negative Rate ) FPR = P ( Y ^ = 1 Y = 0 ) = FP N ( False Positive Rate ) .
These rates quantify, respectively, sensitivity to positives, specificity to negatives, and the spurious activation rate on negatives. Because they are conditional on the true class, TPR and TNR are invariant to changes in π .
In binary classification with class label Y { 0 , 1 } , the prevalence π : = P ( Y = 1 ) (class prior) can be highly skewed. Class imbalance is ubiquitous in ML and profoundly affects model selection and metric interpretation [14,15,16]. A central distinction is between prevalence-insensitive (class-imbalance-invariant) measures—those that depend only on class-conditional performance—and prevalence-sensitive measures—those whose values change with π even if the class-conditional behavior of the classifier remains fixed. Below we structure key measures into these two groups and formalize their dependence.

2.1. Prevalence-Insensitive (Class-Imbalance-Invariant) Measures

Let TPR (sensitivity) and TNR (specificity), as defined in Equation (2). These class-conditional rates are invariant to π by definition. Several standard summaries inherit this invariance.

2.1.1. Balanced Accuracy (Arithmetic Mean A of TPR and TNR)

A = TPR + TNR 2
It aggregates per-class sensitivities additively and equals 1 iff both class-conditional sensitivities are 1. Properties are developed in [26]. By construction, A is invariant to π since it is a function of ( TPR , TNR ) only.

2.1.2. Youden’s J (Informedness)

J = TPR + TNR 1 = 2 A 1
Introduced for diagnostic testing [27], J is a linear rescaling of A and hence also prevalence-invariant. It measures the vertical distance of a ROC point from the no-skill diagonal [9].

2.1.3. G -Mean (Geometric Mean of TPR and TNR)

G = TPR · TNR
Widely used in imbalanced learning as a class-symmetric criterion that penalizes uneven performance [15]. As a product of class-conditionals, G is independent of π .

2.1.4. H -Mean (Harmonic Mean of TPR and TNR)

H = 2 TPR TNR TPR + TNR
This metric is stricter than A and G under dispersion. It has been mainly adopted in biomedical applications [28,29,30]. As a symmetric mean of ( TPR , TNR ) , it is prevalence-invariant.

2.1.5. The ROC Curve

For a score s ( X ) thresholded at t, the ROC locus is FPR ( t ) , TPR ( t ) with FPR = 1 TNR . Both ROC and its area depend only on the pair of class-conditional score distributions and are therefore invariant to π [9].

2.2. Prevalence-Sensitive (Class-Imbalance-Dependent) Measures

The following measures vary with π even when ( TPR , TNR ) are fixed; they mix class-conditional terms with priors or marginals.

2.2.1. Overall Accuracy

The decomposition Acc ( t ) = P Y ^ t = Y = π TPR ( t ) + ( 1 π ) TNR ( t ) shows that accuracy conflates class prior and conditional performance, whereas TPR and TNR isolate sensitivity to positives and negatives, respectively. Thus, for fixed ( TPR , TNR ) , accuracy increases as the prior mass shifts toward the better-performing class [14,15].

2.2.2. Precision

Precision, also named Positive Predictive Value (PPV), is PPV = P ( Y = 1 Y ^ = 1 ) , which by Bayes’ rule equals
PPV = π TPR π TPR + ( 1 π ) ( 1 TNR )
explicitly dependent on π .

2.2.3. F1-Score

The F 1 -score [3,4],
F 1 = 2 PPV · TPR PPV + TPR
inherits the PPV dependence.

2.2.4. Precision-Recall Curve

Precision-recall (PR) curves and their area are likewise prevalence-sensitive [11,31]. As PR plots TPR against precision, and the latter depends on π , the area under the PR curve is sensitive to class imbalance.

2.2.5. Matthews Correlation Coefficient

In binary classification, outcomes are summarized by the 2 × 2 confusion matrix shown in Table 1, where their elements are calculated in Equation (1). The Matthews correlation coefficient (MCC) [5] is defined as
MCC = TP · TN FP · FN ( TP + FP ) ( TP + FN ) ( TN + FP ) ( TN + FN )
It equals the Pearson correlation between predicted and true binary labels and takes values in [ 1 , 1 ] (1 for perfect prediction, 1 for total disagreement). MCC is not invariant to π : for fixed ( TPR , TNR ) the numerator and denominator scale differently with positives and negatives, i.e., ( P , N ) = ( π n , ( 1 π ) n ) , so MCC varies with π [1].

2.2.6. Cohen’s κ

Let p o = Acc be the observed agreement and p e the chance agreement computed from the empirical marginals; then
κ = p o p e 1 p e
In the binary case, if π = Pr ( Y = 1 ) is the prevalence and ρ = Pr ( Y ^ = 1 ) the predicted positive rate, then
p o = π TPR + ( 1 π ) TNR p e = π ρ + ( 1 π ) ( 1 ρ )
so both p o and p e depend on π (and on ρ ). The prevalence paradox is the empirical fact that κ can be low—even near zero—despite very high accuracy when classes are rare: as π moves away from 1 / 2 , p e increases toward p o , shrinking the numerator p o p e [6,32]. Hence κ is prevalence-sensitive.

2.3. Summary

In general, prevalence-insensitive measures (TPR, TNR, A -mean, J, G -mean, H -mean, ROC) quantify intrinsic discrimination independently of π , which is critical for comparing models across datasets with different priors. Prevalence-sensitive measures (accuracy, PPV, F1, PR, MCC, κ ) remain essential for operational decision-making because they reflect the interaction between model behavior and base rates. However, under imbalance they must be interpreted jointly with class-conditional rates and priors [15,16] or complemented by a class-specific perspective [33].
Prevalence-insensitive evaluation should be the default when reliability across classes matters. Metrics that mix class-conditional terms with class priors (e.g., accuracy and F1) can change merely because the class proportions change, even if the class-conditional behavior of the classifier does not. In contrast, indices constructed from true positive rates—binary TPR/TNR or, in the multiclass case, the vector of per–class TPRs—are invariant to prevalence by definition. Consequently, any symmetric summary of these components (e.g., the Pythagorean means of per-class TPRs: arithmetic, geometric, and harmonic) inherits prevalence-insensitivity, enabling fair comparisons across datasets and shifts in base rates.

3. The Pythagorean Means

The Pythagorean school, founded by Pythagoras of Samos (6th–5th century BCE), treated number and proportion as the underlying structure of reality. Its program linked arithmetic, geometry, and musical harmony. Within this tradition arose the triad of the so-called Pythagorean means—arithmetic (equality of increments), geometric (proportionality), and harmonic (reciprocity)—which later authors (e.g., Euclid, Nicomachus) formalized and transmitted.
Henceforth, we write A, G, and H for the arithmetic, geometric, and harmonic mean, respectively.
Let x 1 , , x n > 0 and let w 1 , , w n > 0 be weights with W = i = 1 n w i .
Arithmetic mean : A w = 1 W i = 1 n w i x i
Geometric mean : G w = i = 1 n x i w i / W
Harmonic mean : H w = W i = 1 n w i x i
In the unweighted case ( w i = 1 , W = n ) these reduce to
A = 1 n i = 1 n x i G = i = 1 n x i 1 / n H = n i = 1 n 1 x i
All three means are symmetric and satisfy the chain of inequalities H w G w A w , with equality throughout if and only if x 1 = = x n .
While the Pythagoreans singled out the triad (A,G,H) for their distinct conceptual roles, later analysis unified them under the power means (a.k.a. generalized means), defined for x i > 0 by
M r ( x 1 , , x n ) = 1 n i = 1 n x i r 1 / r r 0 i = 1 n x i 1 / n r = 0
so that M 1 = A, M 0 = G, and M 1 = H.
In general, the monotonicity property ( r < s M r M s ) yields H G A . This perspective clarifies the structural position of the Pythagorean means as canonical representatives of the exponents 1 , 0 , 1 within a continuous one-parameter family.
Beyond the three classical Pythagorean means, generalized families—such as the Kolmogorov–Nagumo–de Finetti quasi-arithmetic means (Appendix A) and the Gini means (Appendix B)—provide parametric lenses for analyzing monotonicity, convexity, and the comparative behavior of TPR/TNR aggregators. These families subsume the Pythagorean triad as special cases and enable principled trade-offs, while remaining prevalence-insensitive because they can be built exclusively from class-conditional quantities [34,35,36,37,38,39].

4. Mean-Based Evaluation from TPR and TNR

Let TPR [ 0 , 1 ] and TNR [ 0 , 1 ] denote, respectively, the true positive rate (sensitivity) and the true negative rate (specificity) of a binary classifier. Any class-symmetric scalar summary based solely on ( TPR , TNR ) is invariant to the class prior π = P ( Y = 1 ) .

4.1. Arithmetic Mean ( A )

A = TPR + TNR 2 [ 0 , 1 ]
It aggregates sensitivity and specificity with equal weight and is widely used as a prevalence-robust alternative to overall accuracy [9,10].
For w [ 0 , 1 ] ,
A w = w TPR + ( 1 w ) TNR
a weighted arithmetic mean that reduces to A at w = 1 2 . The family { A w } traces a line segment in the unit square connecting the axes-aligned extremes.
As an evaluation criterion it preserves the probabilistic semantics of class-conditional sensitivities and it is invariant to class-prior skews, thereby mitigating the dominance of the majority class in imbalanced settings. In binary classification it is a strictly proper scoring rule [40] for the pair ( TPR , TNR ) , and it admits a principled uncertainty treatment, deriving exact posteriors and credible intervals under conjugate priors, and showing how to compare classifiers on finite samples without relying on asymptotics [26]. In modern ML practice—particularly in imbalanced binary tasks such as anomaly/fraud detection, medical diagnosis, or cost-sensitive classification—balanced accuracy is routinely reported to reflect symmetric performance across the positive and negative strata, and it is often preferred when one wants an additive aggregation of TPR and TNR with transparent interpretation in terms of average per-class sensitivity.

4.2. Geometric Mean ( G )

G = TPR · TNR [ 0 , 1 ]
A weighted variant uses exponents ( w , 1 w ) :
G w = TPR w TNR 1 w for w [ 0 , 1 ]
Both G and G w are strictly increasing in each argument on ( 0 , 1 ] 2 and equal 1 iff TPR = TNR = 1 .
The geometric mean G is widely adopted in imbalanced-learning research because it heavily penalizes uneven performance: if either sensitivity or specificity is small, G collapses accordingly; conversely, it rewards classifiers that simultaneously maintain both rates high [10,41]. Comprehensive surveys on imbalanced data explicitly recommend G-mean as a robust, threshold-based indicator that captures the joint behavior of minority- and majority-class sensitivities without being biased by class priors [15].

4.3. Harmonic Mean ( H )

H = 2 1 TPR + 1 TNR = 2 TPR TNR TPR + TNR
with TPR + TNR > 0 , and define H = 0 at ( 0 , 0 ) by continuity. A weighted harmonic mean (emphasizing TPR versus TNR ) is
H β = ( 1 + β 2 ) TPR TNR β 2 TPR + TNR β > 0
which reduces to H at β = 1 and assigns more weight to TNR as β increases (and to TPR when β < 1 ). While mathematically most conservative among the three classical means, the harmonic mean of ( TPR , TNR ) is notably less common in mainstream machine learning reporting than A -mean and G -mean. Contemporary surveys typically favor A or G for class-symmetric summaries [10,29], but H arises in scientific applications where one demands a symmetric and strongly conservative aggregation of TPR and TNR . In biomedical prediction, H has been explicitly adopted to evaluate classifiers under class imbalance, precisely because it treats both groups equally and punishes departures in either sensitivity or specificity.
Historically, one of the earliest explicit ML uses appears in medical informatics when developing predictors of diabetic nephropathy under irregular and imbalanced clinical data [23], with subsequent adoption in traffic-safety Bayesian networks [24,42], genome-wide psoriasis susceptibility prediction [25], seizure prediction [28], neurodevelopmental EEG [43], and clinical pharmacovigilance for QTc risk [30], all defining or employing H as the harmonic mean of sensitivity and specificity to weight both error types equally and to deter degenerate single-class solutions under skew.
Methodologically, H is a symmetric, reciprocal aggregator that penalizes dispersion between class-conditional accuracies and remains prevalence-insensitive because it is built from within-class rates (TPR and TNR computed conditional on the true class). In multiclass settings, applying the same principle to the per-class sensitivities { TPR k } yields a stringent scalar summary that typically penalizes dispersion more than the arithmetic or geometric mean; a detailed analysis is provided in Section 6.

5. Geometric Characterization of A, G, and H Isocurves

Let x = TPR [ 0 , 1 ] and y = TNR [ 0 , 1 ] . For a fixed level c ( 0 , 1 ) , the level sets (isocurves) of the three means in the ( x , y ) -plane are
A : x + y 2 = c x + y = 2 c , G : x y = c x y = c 2 , H : 2 x 1 + y 1 = c ( 2 x c ) ( 2 y c ) = c 2 .
Hence, A has straight isocurves (lines of slope 1 ); G has rectangular hyperbolas with asymptotes x = 0 and y = 0 ; and H also yields rectangular hyperbolas but translated, with asymptotes x = c 2 and y = c 2 , and center at ( c 2 , c 2 ) .
Equivalently, as shown in Figure 1, in the ( F , T ) = ( FPR , TPR ) plane—similar to the ROC representation—where y = TNR = 1 F , the same level c takes the forms
A : T = F + 2 c 1 G : T = c 2 1 F H : T = c ( 1 F ) 2 c 2 F
In these ( F , T ) -coordinates, A isocurves are again lines (now of slope + 1 ); G isocurves are rectangular hyperbolas with asymptotes T = 0 and F = 1 ; and H isocurves are rectangular hyperbolas with asymptotes at F = 1 c 2 and T = c 2 . Geometrically, A admits linear trade-offs along straight indifference lines; G enforces multiplicative balance ( x y = const ) that pulls level sets toward the axes; and H is the most stringent, with value-dependent asymptotes that require each coordinate to exceed c / 2 to attain level c.

6. Harmonic Mean of Per-Class Sensitivity

In the binary case, the harmonic mean of sensitivity and specificity, has been used—albeit sparingly—in ML as a symmetric, prevalence-insensitive trade-off between the true positive rate and the true negative rate. In the multiclass setting, this extends naturally by taking the harmonic mean of the per-class sensitivities. Let TPR k = Pr ( Y ^ = k Y = k ) denote the true positive rate of class k for k = 1 , , K ; then
H = K k = 1 K 1 TPR k
To illustrate how an imbalanced class distribution can distort overall performance while masking weak per-class behavior, consider a four-class problem with supports | A | = 800 , | B | = 600 , | C | = 500 , | D | = 100 (total N = 2000 ). This yields prevalences ( 0.40 , 0.30 , 0.25 , 0.05 ) , a long-tailed pattern typical of safety-critical multiclass tasks in which some classes—though equally important—are rare. We will compare the standard metrics with class-symmetric summaries (arithmetic, geometric, and harmonic means of the per-class sensitivities). If misclassifications concentrate in the minority class D, the standard metrics remain dominated by the frequent classes AC and can appear deceptively high, whereas the means—especially the harmonic mean—drop to reflect the poor recall of D. With this in mind, take the following confusion matrix (rows = true classes, columns = predicted classes):
A ^ B ^ C ^ D ^ A 800 0 0 0 B 0 600 0 0 C 0 0 500 0 D 40 24 20 16
For this 4 × 4 confusion matrix, the classical performance scores are numerically high—accuracy = 0.958, F 1 0.803 , weighted- F 1 0.943 , multiclass Matthews Correlation Coefficient MCC 0.940 [44,45], and Cohen’s κ 0.938 . Yet these aggregates conceal a severe weakness: class D has TPR D = 0.16 (many false negatives). This happens because averaging and global agreement/correlation summaries allow substantial compensation across classes, so high overall scores can coexist with an underrepresented minority.
Per-class sensitivity (true positive rates) are
t = ( t A , t B , t C , t D ) = 800 800 , 600 600 , 500 500 , 16 100 = ( 1 , 1 , 1 , 0.16 ) .
The three Pythagorean means over t :
A = 1 4 k = 1 4 t k = 0.79 G = k = 1 4 t k 1 / 4 0.6325 H = 4 k = 1 4 1 / t k 0.4324
Thus, while A and G may remain high, H can drop sharply—correctly signaling that at least one class has very low sensitivity. This behavior is structural: (i) H is prevalence-insensitive because it aggregates within-class recalls; (ii) it is strictly more stringent than G and A ( H G A , with equality only when all TPR k are equal); (iii) it penalizes dispersion by Schur-concavity (any increase in inter-class inequality lowers H ); and (iv) it is quantitatively constrained by underperforming classes (Proposition 1).
Proposition 1 
(Bounds for the harmonic mean of per-class TPRs). Let r = ( r 1 , , r K ) ( 0 , 1 ] K be the per-class sensitivities and r min = min k r k and r max = max k r k . Then:
1. 
Universal bounds.
r min H r max
with equality on either side if and only if r 1 = = r K .
2. 
Upper bound with m weak classes. Assume at least m classes satisfy r k τ with 0 < τ r max 1 , and that r k r max for all classes. Then
H K m τ + K m r max r max
and the first inequality is attained at the extremal vector ( τ , , τ m , r max , , r max K m ) .
Proof. 
(1) Since r min r k r max and x 1 / x is decreasing on ( 0 , ) , we have 1 r max 1 r k 1 r min . Summing over k yields
K r max k = 1 K 1 r k K r min r min H r max
Equality on either side forces all r k to be equal, and conversely.
(2) For the m weak classes, r k τ implies 1 r k 1 τ ; for the remaining K m classes, the global ceiling r k r max implies 1 r k 1 r max . Therefore
k = 1 K 1 r k m τ + K m r max H = K k 1 / r k K m τ + K m r max
To prove that this upper bound is itself r max , observe
K m τ + K m r max r max K r max m τ + K m r max m m r max τ
which holds because τ r max . □
Proposition 1 formalizes the nature of the harmonic mean when it aggregates class-conditional true-positive rates. Geometrically, along harmonic-mean isocurves in the binary case, achieving any target level requires both sensitivity and specificity to be sufficiently far from zero; a very low value on one axis cannot be offset by arbitrarily increasing the other. In the multiclass setting, the same principle holds: a subset of classes with low sensitivity constrains the overall harmonic mean, regardless of how well the remaining classes perform. This is precisely the behavior sought in safety-critical evaluation: the score remains conservative unless every class is adequately detected, preventing high aggregate numbers from obscuring poorly served classes.
The upper bound for H entails two practical consequences: (a) monotonicity: strengthening the weak class or improving the best classes can raise the attainable value of H, whereas adding more weak classes lowers it); (b) meeting a target H : e.g., with K = 35 , m = 1 , r max = 1 , and H = 0.80 , the critical τ 0.103 , which means that a single class with TPR 0.103 forces H < 0.80 , regardless of how well the other 34 classes perform.
Figure 2a illustrates Proposition 1 in a stylized four-class scenario (three classes at TPR = 1 , one varying). The curve H ( r ) = 4 r 1 + 3 r coincides with the upper bound (with K = 4 , m = 1 , r max = 1 ), while A ( r ) = r + 3 4 and G ( r ) = r 1 / 4 lie strictly above it for r ( 0 , 1 ) . The implication is explicit: even when most classes are perfect, a single weak class bounds H from above and induces a concave, saturating response. Hence, while A and G remain comparatively optimistic, H exposes—and quantitatively constrains—rare-class underperformance, which is decisive in safety-critical evaluation. Section 7.2 analyzes a multiclass cancer dataset in which a few low-sensitivity tumor types exert a substantial influence on H , producing a pronounced gap with global metrics (e.g., accuracy, κ ) and making the rare-class deficits explicit.
Figure 2b plots Accuracy as a function of the varying class: three classes have perfect sensitivity TPR = 1 , and the remaining (varying) class has TPR = r [ 0 , 1 ] . Denote by π var the prevalence of the varying class and by 1 π var the total prevalence of the three perfect classes. Accuracy depends linearly on r: Acc ( r ) = ( 1 π var ) + π var r = 1 π var ( 1 r ) , so the slope equals π var and the intercept at r = 0 is 1 π var . Thus, for ( 25 , 25 , 25 , 25 ) the slope is 0.25 , for ( 40 , 30 , 20 , 10 ) it is 0.10 , and for ( 80 , 10 , 5 , 5 ) it is 0.05 . As the class becomes rarer, the line flattens and Accuracy becomes largely insensitive to r; even poor sensitivity of the rare class barely changes the overall score. This linearity follows directly from the class-wise decomposition Acc = k π k TPR k and does not depend on how errors are distributed among the other classes. The H curve (red, identical to panel (a)) shows the opposite behavior: it penalizes small r strongly, making rare-class underperformance visible that Accuracy may mask.
The harmonic mean offers two practical advantages over A and G in safety-critical evaluation. First, it is non-compensatory: a single weak class exerts decisive influence, so high scores are impossible unless every class attains adequate sensitivity. Second, it is the most conservative of the Pythagorean means, providing a stable, prevalence-insensitive summary that aligns with worst-class reliability requirements. By contrast, A and G are progressively more permissive, allowing strong performance on frequent or easy classes to offset weaknesses elsewhere. Therefore, H may be not preferable when stakeholders explicitly accept compensatory trade-offs (e.g., screening pipelines where aggregate throughput is prioritized), or when some classes have very scarce/noisy labels, where H strong penalty near zero can overreact to estimation noise; in such cases G or A provide smoother summaries consistent with the application goals. However, if H is set aside for either reason, the choice should be explicit and well-justified, as it means accepting either compensatory masking of weak classes or reduced protection against rare-class failures—trade-offs with significant consequences in safety-critical settings.
Section 6 formalized the harmonic mean H as a prevalence-insensitive, class-symmetric aggregator of per-class true-positive rates and established that it is the most conservative of the Pythagorean means; in particular, Proposition 1 proves that even a small set of weak classes places a hard upper bound on H , while A and G can remain comparatively high. Building on these theoretical considerations, Section 7 empirically validates the corresponding claims.

7. Experimental Analysis

The experimental analysis is designed to evaluate the benefits of using the Pythagorean means of true positive rates compared to the standard, commonly used classification performance measures, and to analyze the contribution of the particular behavior of the harmonic mean.
As a baseline classifier, the Random Forest (RF) algorithm [46] was chosen because it is an ensemble method rather than a simple model, typically achieves good results (highly competitive), is computationally efficient (built on fast decision trees), and can be considered a black-box model—similar to deep learning approaches, though without large-scale parameterization. Concretely, the settings were: n_estimators = 100 (number of trees); max_depth = None (unbounded); criterion = “gini”; max_features = “sqrt”; bootstrap = True; min_samples_split = 2; min_samples_leaf = 1; class_weight = None; random_state = 0. These are the library defaults; no hyperparameter tuning or resampling was performed. However, the choice of classifier is purely illustrative, i.e., the experiments are model-agnostic, allowing readers to apply any classifier.
All experiments were carried out with stratified 10-fold cross-validation. We evaluated two multiclass performance measures: accuracy and the area under the MCP (Multi-class Classification Performance) curve [18,47]. The MCP curve is computed directly from the class probabilities output by the classifier—analogous to the ROC curve in the binary case—and, unlike accuracy, its area reflects the probabilistic nature of the predictions [48].

7.1. UCI Datasets

The study involves 18 datasets from the UCI Machine Learning Repository [49], varying in the number of samples (150 to 19,020), number of variables (4 to 166), and number of classes (2 to 11) (see Table 2).
Across the 18 UCI datasets, accuracy exceeds the harmonic mean of per-class sensitivities by a modest margin on average (mean row: Acc = 0.886 vs. H = 0.863 , Δ = + 0.023 , ≈2.7% relative to H ), but the gap is highly dataset-dependent. On several skewed or multiclass problems the difference is pronounced: cardiotocography_morphologic shows Acc = 0.896 vs. H = 0.803 (absolute + 0.093 , 11.6 % relative), vertebral_column 0.842 vs. 0.767 ( + 0.075 , 9.8 % ), customer_churn 0.956 vs. 0.895 ( + 0.061 , 6.8 % ), landsat_satellite 0.916 vs. 0.869 ( + 0.047 , 5.4 % ), and musk_v2 0.979 vs. 0.931 ( + 0.048 , 5.2 % ). These are precisely the settings where per-class sensitivities are uneven: A G H , with H dropping most when some classes are weak, while Acc remains buoyed by majority classes. By contrast, on nearly balanced/easy tasks—e.g., rice_cammeo_osmancik, data_banknote_authentication, divorce, phishing_websites, or the classic iris—all three means nearly coincide with accuracy, indicating limited dispersion across per-class recalls.
The last row highlights a striking contrast between class-label summaries and the probability-based MCP: the mean MCP is 0.761 , substantially below H ( 0.863 ), G ( 0.866 ), and A ( 0.869 ). This systematic drop reflects that MCP assesses the full probability vectors, not just the final class labels; consequently it is sensitive to weak margins or miscalibration even when Acc and H look high. The effect is pronounced on vowel (11 classes): Acc = A = G = H = 0.973 but MCP = 0.605 , revealing limited probability separation despite excellent label accuracy. Similar patterns appear in musk_v1 ( 0.899 vs. 0.894 for labels, MCP = 0.653 ), sonar ( 0.822 vs. 0.808 , MCP = 0.607 ), and heart ( 0.833 vs. 0.824 , MCP = 0.643 ). Overall, the mean row ( Acc / G / H 0.86 0.89 vs. MCP = 0.761 ) supports a dual recommendation: use H to guard against rare-class underperformance in label space, and report MCP to capture probability-level quality. Together they reduce optimism from prevalence-sensitive aggregates and expose weaknesses that accuracy alone can obscure.

7.2. Case Study: Predicting 35 Cancer Types

To illustrate the effect of imbalance in multiclass settings, we adopt the cancer prediction dataset of Nguyen et al. [50]. That study applies machine learning to infer the tumor tissue of origin in advanced-stage metastatic cases—a clinically critical task because a substantial subset of patients lack definitive histopathologic diagnoses and, since therapy choices depend on the primary site, face limited treatment options. The source compendium aggregates multiple molecular assays and initially provides 4131 features designed to discriminate 35 cancer types based on driver/passenger status and simple/complex mutation patterns. It contains 6756 samples drawn from the Hartwig Medical Foundation (metastatic tumors) and the Pan-Cancer Analysis of Whole Genomes consortium (primary tumors). The authors then perform univariate feature selection to retain 463 variables and derive 48 additional regional mutational-density features via nonnegative matrix factorization, yielding the version used here: 6756 samples, 511 predictors, and a 35-class target (biliary, breast, cervix, liver, pancreas, thyroid, etc.).
The multiclass nature of the task makes some binary measures (e.g., ROC) less standardized, since several non-equivalent generalizations exist (one-vs-rest, one-vs-one, macro/micro AUC, volume under the ROC surface). We therefore emphasize prevalence-insensitive, class-conditional summaries and the MCP curve. The class distribution is markedly imbalanced: among 35 tumor types, Breast has 996 samples (14.7%) while Skin_Carcinoma has 25 (0.4%), a ∼40:1 frequency ratio. Such skew can make overall accuracy appear high while concealing weak per-class sensitivities, underscoring the need for metrics that reflect class-wise performance (class-independent vs. class-specific perspective [33]).
On this high-dimensional, many-class dataset, global agreement indices are high ( Acc = 0.906 , Cohen’s κ = 0.899 ), indicating that most predictions match the true labels and that agreement remains strong even after correcting for chance under a large label space. However, the prevalence-insensitive, aggregated means of the per-class true positive rates reveal substantial heterogeneity across classes: the arithmetic mean is A = 0.795 , the geometric mean G = 0.743 , and the harmonic mean H = 0.658 , with the strict ordering H < G < A 0.906 . The large gap between A and H (about 0.137 ) signals marked dispersion in class sensitivities: the harmonic mean is disproportionately penalized by low per-class sensitivities, so even a small subset of under-served classes (with very low TPR) can depress H while leaving Accuracy and κ relatively unaffected if those classes are rare. In fact, the value of Acc is about 38% higher than the value of H , what reveals that it is extremely important to analyze the class-specific behavior of a system that tries to make predictions on specific tumor classes.
Consistently, the area under the MCP curve is moderate ( MCP = 0.564 ), indicating limited operating-point configurations that achieve uniformly high per-class sensitivity; this aligns with the low H , which diagnose a long tail of weak classes. The larger the area under the MCP curve, the lower the uncertainty of the predictive system. The MCP curve for this multiclass dataset (see Figure 6 in [47]) reveals that probabilistic measures are also aligned with H —and they can be even stricter—at highlighting poor per-class behavior of the diagnostic system, which might be dramatic in medical realms.
Therefore, what appeared to be a sound diagnostic system ( Acc = 0.906 ) may not be so reliable (H = 0.658 and AU(MCP) = 0.564 ). In sum, while overall agreement is strong, the prevalence-insensitive summaries—especially the harmonic mean—expose uneven class-wise performance that would be obscured by prevalence-sensitive measures, such as Accuracy or Cohen’s κ .

8. Conclusions

This work clarifies why widely used global indices—accuracy, F1-score, Cohen’s κ , MCC—are intrinsically prevalence-sensitive and can therefore distort performance assessment under class imbalance. We formalized this dependence, advocated prevalence-insensitive evaluation built solely from class-conditional rates, and analyzed the arithmetic, geometric, and harmonic means as class-symmetric aggregators, including a geometric account of their isocurves and stringency. Beyond well-known monotonicity H G A , we established bounds that clarify when H must decrease and by how much: Proposition 1 quantifies the impact of m weak classes and yields actionable constraints. The complementary visual analyses (Figure 2) make these effects explicit: while Accuracy may remain almost flat when the weak class is rare, H reacts strongly to small per-class sensitivities. Together, these results supply a coherent toolbox for risk-aware evaluation in safety-critical diagnosis.
Empirically, the UCI study shows that the mean gap between Accuracy and H is modest on average but can be substantial on skewed datasets, confirming that prevalence-sensitive aggregates may overstate performance. The cancer case study further demonstrates that high global agreement can coexist with widely dispersed per-class sensitivities, where H exposes rare-class deficits that Accuracy, κ , or MCC may mask. Probability-based MCP complements these label-level summaries by reflecting separation in predictive probabilities, revealing weaknesses even when label metrics look strong.
In general, reliance on global prevalence-sensitive measures can overstate safety precisely where failures are most consequential—rare classes—thereby inflating confidence and masking operational risk. Prevalence-insensitive, class-symmetric means—particularly the harmonic mean of per-class sensitivity—provide conservative, comparable summaries that better track real diagnostic risk.
The theory naturally extends along two axes: (a) families of means (Gini, Kolmogorov–Nagumo–de Finetti, Lehmer, Stolarsky), enabling principled control of sensitivity to dispersion while remaining prevalence-insensitive; (b) majorization tools (Schur-convexity, T-transforms, Karamata) to sharpen guarantees under distributional shifts of { TPR k } . These future work directions aim at evaluation that remains reliable when rare classes matter most.

Funding

This work was supported by Grant PID2023-152660NB-I00 funded by the Ministry of Science, Innovation and Universities, Spain.

Data Availability Statement

All benchmark datasets are publicly available from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ (accessed on 22 February 2025). The pan–cancer case study uses the dataset released with the corresponding Nature Communications article [50]; the data are accessible via the article’s Data Availability/Supplementary Information pages. This dataset is publicly and freely accessible for academic purposes at https://www.nature.com/articles/s41467-022-31666-w (accessed on 25 June 2023). To generate the IMCP curves, all experiments used the Python package [51] publicly available at https://github.com/adaa-polsl/imcp (accessed on 11 May 2025).

Conflicts of Interest

The author declares no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Kolmogorov–Nagumo–de Finetti Mean

Let x = ( x 1 , , x n ) ( 0 , ) n . For a continuous strictly monotone generator ϕ : ( 0 , ) R , the (unweighted) Kolmogorov–Nagumo–de Finetti mean is
M ϕ ( x ) = ϕ 1 1 n i = 1 n ϕ ( x i )
It is continuous, strictly increasing in each coordinate, symmetric, and idempotent: min i x i M ϕ ( x ) max i x i , with equality iff all x i coincide. Historical sources include Kolmogorov’s seminal work [34], its English reprint [52], Nagumo’s 1930 paper [35], and de Finetti’s treatment [36].
The classical arithmetic ( A ), geometric ( G ), and harmonic ( H ) means are recovered by specific generators φ :
φ ( x ) = x M φ = A = 1 n i x i φ ( x ) = ln x ( x > 0 ) M φ = G = i x i 1 / n φ ( x ) = 1 / x ( x > 0 ) M φ = H = n i 1 / x i
Thus, the Pythagorean triad ( A , G , H ) appears as the triple M id , M ln , M 1 / x within the Kolmogorov–Nagumo–de Finetti scheme [21,22].

Appendix B. Gini Mean

For x ( 0 , ) n and parameters ( p , q ) R 2 with p q , the (unweighted) Gini mean is
I ( p , q ; x ) = i = 1 n x i p i = 1 n x i q 1 p q
It is symmetric in ( p , q ) , continuous (with the diagonal limit well defined), homogeneous of degree 1, and satisfies min i x i I ( p , q ; x ) max i x i . Using symmetry in ( p , q ) and the diagonal limit,
A = I ( 1 , 0 ) = I ( 0 , 1 ) G = I ( 0 , 0 ) = lim p 0 I ( p , 0 ) H = I ( 0 , 1 ) = I ( 1 , 0 )
Interestingly, if q = 0 , then I r , 0 ( x ) = M r ( x ) , is the weighted power mean of order r, so that M 1 = A , M 0 = G , M 1 = H . Hence the one-parameter power-mean curve { ( p , q ) = ( r , 0 ) } inside the Gini plane passes through ( H , G , A ) at r = 1 , 0 , 1 . In addition, setting ( p , q ) = ( r + 1 , r ) yields
I r + 1 , r ( x ) = i x i r + 1 i x i r = L r ( x )
The Lehmer mean, showing L r as a Gini subfamily [21]. In fact, L 1 is the contraharmonic mean, which satisfies L 1 A G H , with equality iff all x i are equal.
A notable structural result is that the intersection of the Gini and Stolarsky two-parameter families consists exactly of the power means (for n = 2 ) and reduces to the Pythagorean triad ( A , G , H ) for n 3 [37]; see also [38,53] for concise formulas and extensions.

References

  1. Powers, D.M.W. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
  2. Gneiting, T.; Raftery, A.E. Strictly Proper Scoring Rules, Prediction, and Estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [Google Scholar] [CrossRef]
  3. Van Rijsbergen, C.J. Foundation of Evaluation. J. Doc. 1974, 30, 365–373. [Google Scholar] [CrossRef]
  4. Van Rijsbergen, C.J. Information Retrieval, 2nd ed.; Butterworths: London, UK, 1979. [Google Scholar]
  5. Matthews, B.W. Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme. Biochim. Biophys. Acta (BBA) Protein Struct. 1975, 405, 442–451. [Google Scholar] [CrossRef]
  6. Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
  7. Metz, C.E. Basic principles of ROC analysis. Semin. Nucl. Med. 1978, 8, 283–298. [Google Scholar] [CrossRef] [PubMed]
  8. Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef]
  9. Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  10. Sokolova, M.; Lapalme, G. A Systematic Analysis of Performance Measures for Classification Tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
  11. Davis, J.; Goadrich, M. The Relationship Between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 25–29 June 2006; ACM: New York, NY, USA, 2006; pp. 233–240. [Google Scholar]
  12. Hand, D.J.; Till, R.J. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Mach. Learn. 2001, 45, 171–186. [Google Scholar] [CrossRef]
  13. Kovács, G.; Fazekas, A. mlscorecheck: Testing the consistency of reported performance scores and experiments in machine learning. Neurocomputing 2024, 583, 127556. [Google Scholar] [CrossRef]
  14. Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
  15. He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
  16. Branco, P.; Torgo, L.; Ribeiro, R.P. A Survey of Predictive Modeling on Imbalanced Domains. ACM Comput. Surv. 2016, 49, 31. [Google Scholar] [CrossRef]
  17. Weiss, G.M. Mining with Rarity: A Unifying Framework. SIGKDD Explor. 2004, 6, 7–19. [Google Scholar] [CrossRef]
  18. Aguilar-Ruiz, J.S.; Michalak, M. Multiclass Classification Performance Curve. IEEE Access 2022, 10, 68915–68921. [Google Scholar] [CrossRef]
  19. Hernández-Orallo, J.; Flach, P.A.; Ferri, C. A Unified View of Performance Metrics for Classification. Pattern Recognit. Lett. 2012, 33, 1–13. [Google Scholar]
  20. Pepe, M.S. The Statistical Evaluation of Medical Tests for Classification and Prediction; Oxford University Press: Oxford, UK, 2003. [Google Scholar]
  21. Bullen, P.S. Handbook of Means and Their Inequalities. In Mathematics and Its Applications; Springer: Dordrecht, The Netherlands, 2003; Volume 560. [Google Scholar]
  22. Hardy, G.H.; Littlewood, J.E.; Pólya, G. Inequalities, 2nd ed.; Cambridge University Press: Cambridge, UK, 1952. [Google Scholar]
  23. Cho, B.H.; Yu, H.; Kim, K.W.; Kim, T.H.; Kim, I.Y.; Kim, S.I. Application of irregular and unbalanced data to predict diabetic nephropathy using visualization and feature selection methods. Artif. Intell. Med. 2008, 42, 37–53. [Google Scholar] [CrossRef]
  24. De Oña, J.; Mujalli, R.O.; Calvo, F.J. Analysis of traffic accident injury severity on Spanish rural highways using Bayesian networks. Accid. Anal. Prev. 2011, 43, 402–411. [Google Scholar] [CrossRef]
  25. Fang, S.; Fang, X.; Xiong, M. Psoriasis prediction from genome-wide SNP profiles. BMC Dermatol. 2011, 11, 1. [Google Scholar] [CrossRef]
  26. Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. The Balanced Accuracy and Its Posterior Distribution. In Proceedings of the 20th International Conference on Pattern Recognition (ICPR), Istanbul, Turkey, 23–26 August 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 3121–3124. [Google Scholar]
  27. Youden, W.J. Index for Rating Diagnostic Tests. Cancer 1950, 3, 32–35. [Google Scholar] [CrossRef]
  28. Moghim, N.; Corne, D.W. Predicting Epileptic Seizures in Advance. PLoS ONE 2014, 9, e99334. [Google Scholar] [CrossRef] [PubMed]
  29. Opitz, J. A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice. Trans. Assoc. Comput. Linguist. 2024, 12, 820–836. [Google Scholar] [CrossRef]
  30. Van Laere, S.; Muylle, K.M.; Dupont, A.G.; Cornu, P. Machine Learning Techniques Outperform Conventional Statistical Methods in the Prediction of High Risk QTc Prolongation Related to a Drug-Drug Interaction. J. Med. Syst. 2022, 46, 100. [Google Scholar] [CrossRef] [PubMed]
  31. Saito, T.; Rehmsmeier, M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
  32. Byrt, T.; Bishop, J.; Carlin, J.B. Bias, Prevalence and Kappa. J. Clin. Epidemiol. 1993, 46, 423–429. [Google Scholar] [CrossRef]
  33. Aguilar-Ruiz, J.S. Class-specific feature selection for enhancing explainability in ensemble classification models. Int. J. Data Sci. Anal. 2025, 20, 3771–3780. [Google Scholar] [CrossRef]
  34. Kolmogorov, A.N. Sur la notion de la moyenne. Atti Della R. Accad. Naz. Dei Lincei. Rend. Cl. Di Sci. Fis. Mat. E Nat. 1930, 12, 388–391. [Google Scholar]
  35. Nagumo, M. Über eine Klasse der Mittelwerte. Jpn. J. Math. 1930, 7, 71–79. [Google Scholar] [CrossRef]
  36. De Finetti, B. Sul concetto di media. G. Dell’Istituto Ital. Degli Attuari 1931, 2, 369–396. [Google Scholar]
  37. Alzer, H.; Ruscheweyh, S. On the Intersection of Two-Parameter Mean Value Families. Proc. Am. Math. Soc. 2001, 129, 2655–2662. [Google Scholar] [CrossRef]
  38. Czinder, P.; Páles, Z. Some Comparison Inequalities for Gini and Stolarsky Means. Math. Inequalities Appl. 2006, 9, 607–616. [Google Scholar] [CrossRef]
  39. Neuman, E.; Páles, Z. On Comparison of Stolarsky and Gini Means. J. Math. Anal. Appl. 2003, 278, 274–284. [Google Scholar] [CrossRef]
  40. Hand, D.J.; Anagnostopoulos, C. Notes on the H-measure of classifier performance. Adv. Data Anal. Classif. 2023, 17, 109–124. [Google Scholar] [CrossRef]
  41. Barandela, R.; Sánchez, J.S.; García, V.; Rangel, E. Strategies for Learning in Class Imbalance Problems. Pattern Recognit. 2003, 36, 849–851. [Google Scholar] [CrossRef]
  42. Mujalli, R.O.; de Oña, J. A method for simplifying the analysis of traffic accidents injury severity on two-lane highways using Bayesian networks. J. Saf. Res. 2011, 42, 317–326. [Google Scholar] [CrossRef]
  43. Garcés, P.; Baumeister, S.; Mason, L.; Chatham, C.H.; Holiga, S.; Dukart, J.; Jones, E.J.H.; Banaschewski, T.; Baron-Cohen, S.; Bölte, S.; et al. Resting state EEG power spectrum and functional connectivity in autism: A cross-sectional analysis. Mol. Autism 2022, 13, 22. [Google Scholar] [CrossRef]
  44. Gorodkin, J. Comparing two K-category assignments by a K-category correlation coefficient. Comput. Biol. Chem. 2004, 28, 367–374. [Google Scholar] [CrossRef]
  45. Jurman, G.; Riccadonna, S.; Furlanello, C. A Comparison of MCC and CEN Error Measures in Multi-Class Prediction. PLoS ONE 2012, 7, e41882. [Google Scholar] [CrossRef]
  46. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  47. Aguilar-Ruiz, J.S.; Michalak, M. Classification performance assessment for imbalanced multiclass data. Sci. Rep. 2024, 14, 10759. [Google Scholar] [CrossRef]
  48. Aguilar-Ruiz, J.S. Beyond the ROC Curve: The IMCP Curve. Analytics 2024, 3, 221–224. [Google Scholar] [CrossRef]
  49. Dua, D.; Graff, C. UCI Machine Learning Repository; University of California, Irvine, School of Information and Computer Sciences: Irvine, CA, USA, 2017. [Google Scholar]
  50. Nguyen, L.; Van Hoeck, A.; Cuppen, E. Machine learning–based tissue of origin classification for cancer of unknown primary diagnostics using genome–wide mutation features. Nat. Commun. 2022, 13, 4013. [Google Scholar] [CrossRef] [PubMed]
  51. Aguilar-Ruiz, J.S.; Michalak, M.; Wróbel, Ł. IMCP: A Python package for imbalanced and multiclass data classifier performance comparison. SoftwareX 2024, 28, 101877. [Google Scholar] [CrossRef]
  52. Kolmogorov, A.N. On the Notion of Mean. In Selected Works of A. N. Kolmogorov, Volume I: Mathematics and Mechanics; Tikhomirov, V.M., Ed.; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1991; pp. 144–146. [Google Scholar]
  53. Witkowski, A. On the Intersection of Some Families of Means. RGMIA Res. Rep. Collect. 2006, 9, 2. [Google Scholar]
Figure 1. Isocurves for arithmetic, geometric and harmonic means. X-axis and Y-axis represent the False Positive Rate (FPR) and the True Positive Rate (TPR), respectively.
Figure 1. Isocurves for arithmetic, geometric and harmonic means. X-axis and Y-axis represent the False Positive Rate (FPR) and the True Positive Rate (TPR), respectively.
Mathematics 13 03956 g001
Figure 2. Visual comparison of H with the other means A and G (left) and with Accuracy under three class distributions (right). Under equal class prevalence ( 25 , 25 , 25 , 25 ) , A (brown, left) coincides exactly with Accuracy (brown, right), and H (red) is identical in both panels, providing a common reference. (a) Means of per-class sensitivity for K = 4 with three classes fixed at TPR = 1 and one varying class with TPR = r [ 0 , 1 ] ; curves: A ( r ) = ( r + 3 ) / 4 (arithmetic), G ( r ) = r 1 / 4 (geometric), and H ( r ) = 4 r / ( 1 + 3 r ) (harmonic). (b) Accuracy for three class-distribution scenarios: ( 25 , 25 , 25 , 25 ) (balanced), ( 40 , 30 , 20 , 10 ) (mild imbalance), and ( 80 , 10 , 5 , 5 ) (severe imbalance).
Figure 2. Visual comparison of H with the other means A and G (left) and with Accuracy under three class distributions (right). Under equal class prevalence ( 25 , 25 , 25 , 25 ) , A (brown, left) coincides exactly with Accuracy (brown, right), and H (red) is identical in both panels, providing a common reference. (a) Means of per-class sensitivity for K = 4 with three classes fixed at TPR = 1 and one varying class with TPR = r [ 0 , 1 ] ; curves: A ( r ) = ( r + 3 ) / 4 (arithmetic), G ( r ) = r 1 / 4 (geometric), and H ( r ) = 4 r / ( 1 + 3 r ) (harmonic). (b) Accuracy for three class-distribution scenarios: ( 25 , 25 , 25 , 25 ) (balanced), ( 40 , 30 , 20 , 10 ) (mild imbalance), and ( 80 , 10 , 5 , 5 ) (severe imbalance).
Mathematics 13 03956 g002
Table 1. Confusion Matrix.
Table 1. Confusion Matrix.
Y ^ = 1 Y ^ = 0
Y = 1 TP FN
Y = 0 FP TN
Table 2. Results across datasets from the UCI Repository: dataset name, number of samples (#s), number of variables (#v) and number of class labels (#c), respectively; Acc = Accuracy; A, G, H denote the arithmetic, geometric, and harmonic mean of per-class TPRs; MCP is the area under the MCP curve.
Table 2. Results across datasets from the UCI Repository: dataset name, number of samples (#s), number of variables (#v) and number of class labels (#c), respectively; Acc = Accuracy; A, G, H denote the arithmetic, geometric, and harmonic mean of per-class TPRs; MCP is the area under the MCP curve.
Dataset#s#v#cAccAGHMCP
vertebral_column310630.8420.7920.7800.7670.702
hepatitis_c13852840.2670.2660.2620.2590.293
rice_cammeo_osmancik3810720.9210.9190.9190.9190.848
data_banknote_authentication1372420.9930.9930.9930.9930.945
musk_v2659816620.9790.9350.9330.9310.895
magic_gamma_telescope19,0201020.8820.8560.8520.8470.726
cardiotocography_morphologic212623100.8960.8280.8160.8030.683
divorce1705420.9770.9760.9760.9760.925
australian_credit6901420.8730.8710.8710.8710.701
musk_v147616620.8990.8950.8950.8940.653
sonar2086020.8220.8160.8120.8080.607
heart2701320.8330.8280.8260.8240.643
iris150430.9400.9400.9390.9380.905
customer_churn31501320.9560.9020.8990.8950.883
phishing_websites11,0553020.9720.9710.9710.9710.914
room_occupancy10,1291640.9980.9920.9920.9920.983
landsat_satellite64353660.9160.8910.8810.8690.780
vowel99013110.9730.9730.9730.9730.605
Mean---0.8860.8690.8660.8630.761
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Aguilar-Ruiz, J.S. Prevalence-Insensitive Evaluation of Diagnostic Systems Under Class Imbalance: The Harmonic Mean of Per-Class Sensitivity. Mathematics 2025, 13, 3956. https://doi.org/10.3390/math13243956

AMA Style

Aguilar-Ruiz JS. Prevalence-Insensitive Evaluation of Diagnostic Systems Under Class Imbalance: The Harmonic Mean of Per-Class Sensitivity. Mathematics. 2025; 13(24):3956. https://doi.org/10.3390/math13243956

Chicago/Turabian Style

Aguilar-Ruiz, Jesús S. 2025. "Prevalence-Insensitive Evaluation of Diagnostic Systems Under Class Imbalance: The Harmonic Mean of Per-Class Sensitivity" Mathematics 13, no. 24: 3956. https://doi.org/10.3390/math13243956

APA Style

Aguilar-Ruiz, J. S. (2025). Prevalence-Insensitive Evaluation of Diagnostic Systems Under Class Imbalance: The Harmonic Mean of Per-Class Sensitivity. Mathematics, 13(24), 3956. https://doi.org/10.3390/math13243956

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop