A Bayesian Feature Weighting Model with Simplex-Constrained Dirichlet and Contamination-Aware Priors for Noisy Medical Data

Cengiz, Mehmet Ali; Öztürk, Zeynep; Alharthi, Abdulmohsen

doi:10.3390/math14081243

Open AccessArticle

A Bayesian Feature Weighting Model with Simplex-Constrained Dirichlet and Contamination-Aware Priors for Noisy Medical Data

by

Mehmet Ali Cengiz

^1,*

,

Zeynep Öztürk

²

and

Abdulmohsen Alharthi

¹

Department of Mathematics and Statistics, College of Science, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 13318, Saudi Arabia

²

Hopa Faculty of Economics and Administrative Sciences, Artvin Çoruh University, Hopa 08010, Türkiye

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(8), 1243; https://doi.org/10.3390/math14081243

Submission received: 8 February 2026 / Revised: 1 April 2026 / Accepted: 3 April 2026 / Published: 8 April 2026

(This article belongs to the Special Issue Statistical Machine Learning: Models and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

Feature weighting plays a central role in medical classification by enhancing predictive accuracy, interpretability, and clinical trust through the explicit quantification of variable relevance. Despite their widespread use, existing filter-, wrapper-, and embedded-based feature weighting methods are predominantly deterministic and exhibit pronounced sensitivity to label noise and outliers, which are pervasive in real-world medical data. This often results in unstable importance estimates and unreliable clinical interpretations. In this work, we introduce a novel Bayesian feature weighting model that fundamentally departs from existing approaches by jointly integrating simplex-constrained Dirichlet priors for global feature weights, hierarchical shrinkage priors for coefficient regularization, and contamination-aware priors for explicit modeling of label noise within a single coherent probabilistic framework. Unlike conventional Bayesian feature selection or robust classification models, the proposed formulation yields globally interpretable feature weights defined on the probability simplex, while simultaneously providing full posterior uncertainty quantification and robustness to both mislabeled observations and aberrant feature values through principled influence control. Comprehensive simulation studies across diverse contamination scenarios, together with applications to multiple real-world medical datasets, demonstrate that the proposed model consistently outperforms classical and state-of-the-art baselines in terms of discrimination, probabilistic calibration, and stability of feature-importance estimates. These results highlight the practical and methodological significance of the proposed framework as a robust, uncertainty-aware, and interpretable solution for medical decision making under noisy data conditions.

Keywords:

Bayesian feature weighting; label noise; outliers; uncertainty quantification; horseshoe prior

MSC:

62F15

1. Introduction

Feature weighting plays a central role in supervised learning because it directly influences both the interpretability and predictive performance of statistical and machine learning models. By assigning relative importance scores to input variables, feature weighting methods enhance the transparency of model decisions while mitigating the adverse effects of irrelevant or redundant predictors. Such mechanisms are particularly critical in high-dimensional domains, including genomics, medical diagnostics, text mining, and sensor-based systems, where large numbers of noisy variables often obscure the underlying predictive signal [1,2].

Despite their practical importance, most existing feature weighting approaches remain fundamentally deterministic and fragile in the presence of data contamination. In real-world medical datasets, outliers and mislabeled observations frequently arise due to measurement errors, imperfect annotations, or patient-specific variability. Classical feature weighting techniques—often based on point estimates from logistic regression or margin-based classifiers—can be disproportionately influenced by such anomalies. As a result, the estimated feature weights may fail to reflect the true predictive relevance of variables, leading to unstable models, degraded accuracy, and unreliable clinical interpretations [3,4].

To mitigate these limitations, recent research has explored robust extensions of supervised learning models, including regularized logistic regression with heavy-tailed priors, noise-tolerant boosting algorithms, and robust margin-based feature selection methods. Although these approaches improve resilience to data irregularities, they typically do not provide a full probabilistic characterization of feature importance. In particular, deterministic weighting schemes lack uncertainty quantification, which severely limits their interpretability and usefulness in high-stakes domains such as healthcare and financial risk modeling [5,6].

Bayesian methods offer a principled framework for addressing these shortcomings by integrating feature weighting, uncertainty quantification, and robustness within a unified hierarchical formulation. By treating feature weights as probability distributions rather than fixed values, Bayesian models capture both the expected relevance of predictors and the uncertainty surrounding these estimates. Moreover, hierarchical shrinkage priors, such as the horseshoe prior, enable automatic regularization and protection against overfitting in high-dimensional settings. Robustness to contamination can be further enhanced through explicit modeling of label noise and heavy-tailed latent structures, providing principled defenses against outliers and misclassified samples [7,8,9,10].

However, existing Bayesian approaches typically address these components in isolation. In particular, current models often focus either on coefficient shrinkage or on robust likelihood formulations, while global feature weighting with explicit probability-simplex constraints and contamination-aware priors has received limited attention. As a result, there remains a methodological gap between uncertainty-aware Bayesian modeling and interpretable, globally normalized feature-importance estimation under noisy conditions.

In this study, we propose a novel Bayesian robust feature weighting framework (Bayes_FW) that explicitly addresses this gap. The proposed model introduces global feature weights constrained to the probability simplex via a Dirichlet prior, ensuring normalized and directly interpretable importance scores. Simultaneously, regression coefficients are regularized using hierarchical horseshoe priors, while robustness is achieved through an explicit label-noise contamination mechanism and heavy-tailed error components. This unified formulation enables principled influence control over mislabeled observations and aberrant feature values, while providing full Bayesian uncertainty quantification through posterior inference via Markov Chain Monte Carlo (MCMC).

The experimental evaluation employs four benchmark medical datasets representing increasing levels of discrimination difficulty, calibration instability, and data heterogeneity. The Breast Cancer Wisconsin dataset, characterized by well-separated feature distributions and low noise, represents a setting with easy discrimination and high calibration stability. The Pima Indians Diabetes dataset poses a moderate challenge due to overlapping class boundaries and moderate imbalance, introducing calibration complexity. The South African Heart Disease dataset further increases difficulty by combining categorical and continuous predictors with mixed uncertainty sources. Finally, the heart disease (Cleveland) dataset represents a highly challenging scenario due to its small sample size, heterogeneous structure, and known sensitivity to label noise and calibration instability.

To ensure the reliability of the proposed Bayesian framework, comprehensive posterior diagnostic analyses were conducted for both simulation studies and real-data applications. In all experimental settings, standard Markov Chain Monte Carlo (MCMC) diagnostics including trace plots, posterior density estimates, effective sample size (

N_{eff}

), potential scale reduction factors (

\hat{R}

), and posterior predictive checks were systematically evaluated. These diagnostics consistently indicated stable convergence, efficient sampling, and well-behaved posterior distributions, with no evidence of pathological behavior such as non-convergence, poor mixing, or degeneracy. For illustrative purposes, the full set of diagnostic plots corresponding to the Pima Indians Diabetes dataset is provided in Appendix A. These results are representative of the overall behavior observed across all experiments and confirm the robustness and computational reliability of the proposed modeling approach.

Across both synthetic simulations and real-world medical datasets, the proposed Bayesian framework demonstrates superior discrimination (AUC, F1) and calibration (LogLoss, Brier score, ECE) compared to classical logistic and ensemble-based models. Importantly, the model yields more stable posterior estimates of feature relevance, enabling accurate identification of key biomarkers while explicitly quantifying uncertainty in their effects. These findings collectively establish the proposed approach as a robust, uncertainty-aware, and interpretable feature weighting solution for reliable medical decision making under noisy data conditions.

The main contributions of this paper can be summarized as follows:

We propose a novel Bayesian feature weighting model that incorporates a simplex-constrained Dirichlet prior, ensuring normalized and interpretable feature importance.
We introduce contamination-aware priors to enhance robustness against noisy and potentially corrupted medical data.
We develop an efficient inference framework based on Hamiltonian Monte Carlo (HMC), enabling stable posterior estimation under simplex constraints.
We validate the proposed model through extensive simulation studies and real-data analysis demonstrating its effectiveness, robustness, and practical applicability.

2. Related Work

Feature selection and feature weighting are fundamental preprocessing techniques in supervised learning, both of which aim to improve the model performance, interpretability, and generalization by emphasizing relevant variables. Traditionally, feature selection has been tackled through filter-based methods such as mutual information ranking [1], wrapper methods such as recursive feature elimination [2], and embedded approaches using penalization techniques (e.g., L1-regularized logistic regression [3]). Although effective for structured and clean datasets, these deterministic methods tend to be fragile in the presence of noise and outliers, often resulting in unstable or misleading importance scores.

Robust supervised learning approaches have emerged as a response to this challenge. For instance, heavy-tailed distributions such as Student-t have been incorporated into generalized linear models to mitigate the influence of extreme observations [4]. Additionally, boosting algorithms (e.g., AdaBoost) adaptively reweight samples to reduce the impact of mislabeled data [5]. More recent contributions include robust loss functions and methods explicitly designed to handle label noise [6]. However, while these strategies increase resilience, they often lack mechanisms for quantifying the uncertainty of feature weights, which is a crucial requirement in high-stake or noisy environments.

The Bayesian learning paradigm has become increasingly popular to address these limitations. It offers model regularization and probabilistic uncertainty quantification. Traditional Bayesian feature selection methods, such as horseshoe prior [7] and spike-and-slab models [8], adaptively shrink coefficients and provide interpretable inclusion probabilities. Nonparametric extensions [11] enable greater flexibility in high-dimensional regimes.

Recent advances have extended these foundations to more explicitly address robustness. Bayesian deep metric learning approaches have demonstrated theoretical robustness under label noise using variational inference frameworks [12]. Hierarchical probabilistic models such as WarPI have shown measurable improvements, achieving 3.73% accuracy gains over baseline methods on CIFAR-100 under 40% asymmetric noise conditions [13]. These approaches utilize hierarchical probabilistic modeling to quantify both epistemic and aleatoric uncertainty while maintaining robustness to various noise types including uniform, asymmetric, and instance-dependent noise [13].

Variational Bayesian approaches have also proven effective for feature weighting tasks. Methods employing Laplace priors with variational inference provide better uncertainty estimates while retaining correlated features and stability with respect to hyperparameter choices [14]. Optimal Bayesian feature filtering techniques have demonstrated outstanding performance relative to traditional feature selection methods using hierarchical models that provide closed-form solutions for high-dimensional data [15].

Contemporary research has explored meta-learning approaches for robust feature weighting. Probabilistic meta-weighting methods, such as PMW-Net, address the limitations of deterministic weighting functions by incorporating probabilistic treatments that handle both epistemic and aleatoric uncertainties [16]. Uncertainty-aware label correction frameworks combine Bayesian neural networks with Gaussian modeling to identify trustworthy samples and correct mislabeled data, showing superior performance compared with methods such as Co-teaching+ and DivideMix [17].

Hybrid approaches have also emerged that combine Bayesian and frequentist elements. Subjective logic-based methods utilize Dirichlet distributions and neural network parameterization to handle partial-label learning scenarios with high noise levels, out-of-distribution examples, and adversarial perturbations [18]. These methods provide an explicit uncertainty representation while maintaining robustness across diverse contamination scenarios.

On a different axis of development, ref. [19] provides a comprehensive taxonomy of feature weighting (FW) methods, classifying them according to the learning paradigm (supervised vs. unsupervised), scope (global vs. local), and optimization strategy (filter vs. wrapper). In supervised settings, global filter-based FW methods compute feature importance independently using metrics such as Mutual Information, Information Gain, or Fisher Score, whereas wrapper methods leverage iterative optimization (e.g., via Genetic Algorithms, Gradient Ascent, or Particle Swarm Optimization) to improve model performance.

Despite these advances, several limitations persist. Traditional robust methods, including trimmed Bayesian information criterion approaches and maximum likelihood estimation with contamination modeling [20], often address outlier detection and label noise separately from feature weighting. Adaptive noise modeling techniques, which are effective for dimension-specific or group-specific noise handling, typically lack comprehensive uncertainty quantification mechanisms [21].

In light of these developments, recent advances in Bayesian robustness have introduced label-noise priors [9] and heavy-tailed priors [10] to enhance classification stability in contaminated settings. However, the existing approaches typically address either robustness or uncertainty quantification in isolation. The reviewed literature demonstrates that, although individual components such as hierarchical shrinkage [22], contamination-aware modeling [20] and probabilistic weighting [16] have shown promise, a unified framework that simultaneously provides robust classification and uncertainty-aware global feature weights has not been sufficiently explored.

The present work addresses this void by proposing a Bayesian model that combines simplex-constrained global feature weighting, hierarchical shrinkage priors, and contamination modeling. This framework provides both interpretable feature weights and principled uncertainty quantification, improving model reliability in the presence of outliers and noisy labels, building upon the theoretical foundations and empirical insights demonstrated across the spectrum of robust Bayesian learning approaches.

3. Methodology

3.1. Proposed Model

Assume that

D = {(x_{i}, y_{i})}_{i = 1}^{N}

is the observed data, where

x_{i} \in R^{P}

is a P-dimensional covariate vector and

y_{i} \in {0, 1}

is a binary outcome. Our goal is to construct a classifier that (i) learns a global importance weight for each feature, (ii) is robust to label contamination and outlying covariates, and (iii) provides coherent uncertainty quantification.

Conditionally on the parameters

θ = {α_{0}, w, β, τ, λ, ε}

, we introduce the linear predictor

η_{i} = α_{0} + {(w ⊙ β)}^{⊤} x_{i},

(1)

and define the baseline logistic probability

s_{i} = σ (η_{i})

with

σ (z) = 1 / (1 + e^{- z})

. To explicitly model label noise, we use a mixture of the clean and flipped labels,

p (y_{i} = 1 ∣ x_{i}, θ) = (1 - ε) s_{i} + ε (1 - s_{i}) .

(2)

Equivalently, with probability

1 - ε

the observed label coincides with the latent logistic response, whereas with probability

ε

it is flipped. The parameters have the following roles:

$α_{0} \in R$ is a global intercept;
$β = {(β_{1}, \dots, β_{P})}^{⊤} \in R^{P}$ are regression coefficients;
$w = {(w_{1}, \dots, w_{P})}^{⊤}$ are non-negative feature weights constrained to the simplex

$w \in Δ^{P - 1} : = \{w \in R_{+}^{P} ∣ 1^{⊤} w = 1\};$

(3)
$ε \in [0, 1]$ is the label-noise parameter controlling the probability of a label flip;
$τ > 0$ and $λ = {(λ_{1}, \dots, λ_{P})}^{⊤}$ are global and local scale parameters that induce shrinkage on $β$ .

The feature weight vector w is constrained to lie on the probability simplex, satisfying

w_{j} \geq 0

and

\sum_{j = 1}^{p} w_{j} = 1

. Such constraints introduce a non-Euclidean geometry in the parameter space that may affect sampling efficiency in Hamiltonian Monte Carlo (HMC). In practice, this issue is addressed through an appropriate simplex parameterization, which maps the constrained weights into an unconstrained space while preserving the positivity and sum-to-one constraints. This transformation allows HMC to operate efficiently without violating the geometric structure of the simplex and ensures stable posterior exploration.

The element-wise product

w ⊙ β

therefore combines feature relevance (

w

) and effect size (

β

), facilitating interpretable global importance scores while remaining robust to mislabeled observations through

ε

.

The linear predictor in Equation (1) depends on the element-wise product

w ⊙ β

. In principle, such multiplicative parameterizations may introduce scale non-identifiability, since different parameter pairs

(w, β)

can produce the same product

w ⊙ β

. For example, multiplying w by a constant

c > 0

and dividing

β

by the same constant yields an equivalent linear predictor, since

(c w) ⊙ (β / c) = w ⊙ β .

In the proposed model, however, this ambiguity is mitigated through the prior structure. The feature weight vector w is constrained to lie on the probability simplex

Δ^{(P - 1)} = \{w \in R_{+}^{P} : \sum_{j = 1}^{P} w_{j} = 1\},

which fixes its global scale by enforcing

\sum_{j = 1}^{P} w_{j} = 1

and

w_{j} \geq 0

. This constraint prevents arbitrary rescaling of w. In addition, the regression coefficients

β

are regularized using a hierarchical horseshoe prior, which strongly shrinks irrelevant coefficients toward zero while allowing large signals to remain. Together, the simplex constraint on w and the shrinkage structure on

β

restrict the parameter space and provide practical identifiability of the model parameters in posterior inference. To further examine potential multiplicative non-identifiability, we inspected the joint posterior samples of

w_{j}

and

β_{j}

. Ridge structures would indicate scale non-identifiability in the multiplicative parameterization. As a representative example, this diagnostic analysis was conducted using the Breast Cancer Wisconsin dataset. However, the posterior samples do not exhibit such ridge patterns, indicating that the simplex constraint on w together with the shrinkage prior on

β

prevents this degeneracy. The corresponding diagnostic plots are provided in Figure A1.

A summary of the main symbols used in the proposed model is given in Table 1.

To enhance interpretability, we provide intuitive explanations for the key model components. The feature weights w represent the relative importance of each predictor in explaining the response variable, constrained to lie on the simplex to ensure that they are non-negative and sum to one. The parameter

α

captures the global location or baseline effect, while

τ

controls the scale or dispersion of the model. The contamination-aware prior is designed to reduce the influence of noisy or corrupted observations by allowing heavier tails in the prior structure. Overall, this formulation enables both interpretability and robustness within a unified Bayesian framework.

For simplicity, we assume a symmetric label contamination mechanism, where the probability of label flipping is identical across classes. Although misclassification in medical datasets may often be class-dependent, the symmetric formulation provides a parsimonious representation of label noise and avoids introducing additional parameters that may be difficult to estimate when the amount of noisy labels is limited. The proposed framework can be readily extended to asymmetric contamination by introducing separate flipping probabilities for each class, allowing different misclassification rates for positive and negative labels.

3.2. Priors

To encourage robustness and interpretability, we assign the following hierarchical priors.

Feature weights.

Simplex-constrained feature weights receive a symmetric Dirichlet prior,

w \sim Dirichlet (α 1),

(4)

which yields normalized, uncertainty-aware global importance scores.

Regression coefficients.

To down-weight irrelevant predictors while allowing a few large effects, we employ the horseshoe prior. Introducing auxiliary variables

z_{j}

, the hierarchy is

\begin{matrix} β_{j} & = z_{j} τ λ_{j}, \\ z_{j} & \sim N (0, 1), τ \sim C^{+} (0, 1), λ_{j} \sim C^{+} (0, 1), \end{matrix}

(5)

where

C^{+} (0, 1)

denotes the standard half-Cauchy distribution. The heavy tails of this prior allow large signals to escape shrinkage while strongly shrinking noise coefficients towards zero.

Intercept and label-noise parameter.

We place a weakly informative Gaussian prior on the intercept and a uniform prior on the noise level,

α_{0} \sim N (0, 5^{2}), ε \sim Uniform (0, 1) .

(6)

The

Uniform (0, 1)

prior is adopted as a weakly informative prior for the contamination probability, reflecting the absence of strong prior knowledge about the level of label noise. To assess the robustness of this choice, we conducted a short sensitivity analysis by considering alternative Beta priors, including

Beta (1, 1)

,

Beta (2, 2)

, and

Beta (0.5, 0.5)

. The resulting posterior estimates and predictive performance were found to be very similar across these specifications, suggesting that the proposed model is not sensitive to the specific prior choice for the contamination parameter.

Taken together, these choices jointly model feature relevance, robust regression coefficients, and label noise, enabling principled uncertainty quantification while mitigating the influence of outliers and mislabeled samples.

3.3. Posterior Inference

Let

p_{i} = p (y_{i} = 1 ∣ x_{i}, θ)

denote the noise-adjusted probability in Equation (2). The joint posterior distribution of all unknowns is then

\begin{matrix} p (θ ∣ D) & \propto \prod_{i = 1}^{N} Bernoulli (y_{i} ∣ p_{i}) p (α_{0}) p (w) p (β ∣ τ, λ) \\ \times p (τ) p (λ) p (ε) . \end{matrix}

(7)

Since Equation (7) is analytically intractable, we perform Bayesian inference using Markov Chain Monte Carlo (MCMC), specifically Hamiltonian Monte Carlo (HMC) [23] with the No-U-Turn Sampler (NUTS) adaptation [24], as implemented in Stan [25].

HMC leverages gradient information to efficiently explore high-dimensional posterior landscapes, avoiding the random-walk behavior of standard Metropolis–Hastings algorithms [26,27], while NUTS automatically selects trajectory lengths, removing the need for manual tuning and improving convergence [28].

The posterior inference procedure is summarized in Algorithm 1.

Algorithm 1 Posterior inference via HMC-NUTS

1:: Initialize parameters $θ \leftarrow (α_{0}, β, w, τ, λ, ε)$
2:: for $t = 1, \dots, T$ do
3:: Compute the linear predictors $η_{i}$ and probabilities $p_{i}$ for $i = 1, \dots, N$ using Equation (2)
4:: Compute the log-posterior:

$log p (θ ∣ D) = \sum_{i = 1}^{N} (y_{i} log p_{i} + (1 - y_{i}) log (1 - p_{i})) + log p (α_{0}, w, β, τ, λ, ε)$

5:: Compute gradients with respect to all parameters
6:: Simulate Hamiltonian dynamics using leapfrog integration
7:: Propose new state $θ^{'}$ and accept/reject with probability

$α = min (1, exp (H (θ) - H (θ^{'})))$

8:: NUTS adaptively selects trajectory length to avoid manual tuning
9:: end for
10:: Return posterior samples after warm-up

3.4. Posterior Outputs

From the posterior draws

{θ^{(m)}}_{m = 1}^{M}

, we obtain several quantities of direct practical interest: (i) predictive probabilities for new samples, (ii) uncertainty-aware feature importance, (iii) an estimate of the label-noise rate, and (iv) posterior predictive checks and calibration diagnostics.

3.4.1. Predictive Probabilities for Unseen Samples

For a new observation

x^{*}

, the posterior predictive probability is approximated by the Monte Carlo average

\hat{p} (y^{*} = 1 ∣ x^{*}, D) = \frac{1}{M} \sum_{m = 1}^{M} p (y^{*} = 1 ∣ x^{*}, θ^{(m)}),

(8)

where each probability on the right-hand side is computed from the logistic link and label-noise adjustment in Equation (2). This yields both point predictions and a full posterior distribution over predictive probabilities, enabling uncertainty-aware decision making.

3.4.2. Uncertainty-Aware Feature Importance

The posterior distribution of the simplex weights

w

quantifies the global relevance of each predictor. For feature j,

\begin{matrix} Imp (j) & = E [w_{j} ∣ D], \\ Uncertainty (j) & = Var [w_{j} ∣ D], \end{matrix}

(9)

so that larger posterior means indicate more influential features, whereas wider credible intervals reflect greater uncertainty in their relative importance. Unlike deterministic feature-selection methods, this yields a fully probabilistic interpretation of feature relevance.

It is important to distinguish between global feature relevance and effective contribution to the linear predictor. In the proposed model, the simplex-constrained weights

w_{j}

represent normalized and globally interpretable feature importance scores, reflecting the relative importance of predictors across the model, and are used as the primary measure of feature importance in our analysis. However, the actual contribution of each feature to the linear predictor is governed by the product

w_{j} β_{j}

, which combines feature relevance and effect size. Therefore, while

w_{j}

provides a global importance ranking, the quantity

w_{j} β_{j}

determines the effective predictive influence of each feature.

3.4.3. Robust Classification Under Noise

The inclusion of

ε

in the likelihood explicitly models label contamination. Its posterior mean,

\hat{ε} = \frac{1}{M} \sum_{m = 1}^{M} ε^{(m)},

(10)

provides an estimate of the noise rate in the training labels: posterior mass concentrated near zero corresponds to mostly clean labels, whereas mass away from zero indicates substantial mislabeling. Simultaneously, the heavy-tailed shrinkage prior on

β

and the simplex constraint on

w

attenuate the influence of outlying covariates and uninformative predictors.

3.4.4. Posterior Predictive Checks and Calibration

Finally, posterior samples facilitate model criticism and calibration assessment. They can be used to compute the distributions of log-loss and Brier scores, Expected Calibration Error (ECE) and maximum calibration gap, and to perform posterior predictive checks [29,30,31,32,33,34,35]. These diagnostics allow us to assess not only predictive performance but also the reliability of uncertainty estimates produced by the model.

4. Simulation and Experimental Results

4.1. Simulation

To evaluate the proposed Bayesian robust feature weighting framework systematically, we conducted an extensive simulation study across eight scenarios (S1–S8), each designed to reflect a distinct source of difficulty in supervised classification. The scenarios manipulate data characteristics such as dimensionality, correlation, class imbalance, outliers, and label noise, thereby enabling a comprehensive assessment of robustness and generalization. Table 2 summarizes these scenarios.

To ensure reproducibility, we provide detailed specifications for each simulation scenario. For all scenarios, the sample size, number of features, and data-generating mechanisms are explicitly defined. The covariates are generated from standard distributions, and the true regression coefficients are constructed to reflect varying levels of sparsity and signal strength. Noise contamination is introduced through a controlled mechanism, with a predefined contamination rate and flipping probability.

Specifically, each scenario differs in terms of the proportion of relevant features, the magnitude of regression coefficients, and the level of noise contamination. The random seed is fixed across all experiments to ensure replicability. All simulations are implemented using consistent preprocessing steps, including feature standardization. Detailed parameter settings for each scenario are provided to facilitate exact reproduction of the results.

Data generation: Predictors $X \in R^{n \times m}$ were drawn from multivariate Gaussian blocks with correlation parameter $ρ$ , yielding both independent and correlated structures. In some scenarios, heavy-tailed covariates were introduced by replacing the first block with $t_{ν}$ -distributed samples ( $ν$ degrees of freedom), simulating covariate outliers. A sparse linear signal was imposed via coefficients $β$ with only s variables contributing to the decision boundary; the scenario S8 additionally included nonlinear interactions (see Table 2).
Class labels: Latent scores were computed as $η_{i} = α_{0} + x_{i}^{⊤} β$ , with $α_{0}$ calibrated such that the marginal probability of a positive label matched the target prevalence $π$ . Observed labels $y_{i}$ were then drawn from a $Bernoulli (σ (η_{i}))$ distribution, where $σ (t) = 1 / (1 + e^{- t})$ , and independently flipped with probability $ε$ to simulate misclassification.
Outlier injection: In scenarios involving feature contamination, a fraction $γ$ of rows in $X$ were shifted by multiples of the marginal standard deviation in randomly chosen dimensions, producing covariate outliers (cf. S5 in Table 2).
Evaluation metrics: Ten replications were performed for each simulation scenario. The training and testing splits were stratified to ensure both classes were proportionally represented. Model performance was evaluated across three complementary dimensions: discrimination, calibration, and accuracy.

Model discrimination refers to the ability to correctly distinguish between positive and negative instances. It was quantified using the area under the receiver-operating-characteristic curve (AUC), area under the precision–recall curve (PRAUC), and F1-score [36,37,38], while AUC measures overall ranking ability, PRAUC provides a more informative assessment under class imbalance, and the F1-score balances precision and recall.

Calibration assesses agreement between predicted probabilities and observed frequencies. We employed log-loss (cross-entropy loss), Brier score, expected calibration error (ECE), and maximum calibration gap (MCE) [31,39,40]. The Brier score represents a proper scoring rule capturing calibration and refinement, whereas the ECE summarizes the average deviation between predicted and empirical probabilities.

Overall classification accuracy, defined as the proportion of correctly classified observations, was reported for completeness [41], but it was interpreted alongside discrimination and calibration measures because it can be misleading under imbalance.

Comparative models: In addition to the proposed Bayesian framework, benchmarks included logistic regression (with and without L1/elastic-net penalties), random forests, gradient boosting, and class-balanced stochastic-gradient descent.

Figure 1 and Figure 2 show the comparative performance trends of all models across the eight experimental scenarios (S1–S8). Each subplot presents key evaluation metrics: AUC, PRAUC, F1-score, and accuracy capture discriminative ability, whereas log-loss, Brier score, expected calibration error (ECE), and maximum calibration gap (MCE) assess probability calibration.

For discrimination metrics (AUC, PRAUC, F1, Accuracy), higher values indicate better performance (Figure 1); for calibration metrics (LogLoss, Brier, ECE, MCE), lower values are desirable (Figure 2). The Bayesian model (Bayes_FW) is compared against classical machine learning methods, including logistic regression (and its L1 and elastic-net variants), gradient boosting, random forest, and SGD with balanced weights.

Figure 1 and Figure 2 together demonstrate the comparative behavior of all models across the eight experimental scenarios (S1–S8). Overall, the Bayesian model maintains competitive or superior performance in probabilistic metrics such as LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Gap (MCE), underscoring its strength in uncertainty quantification and calibration. In terms of discriminative performance (AUC, PRAUC, Accuracy, and F1), logistic regression variants achieve the highest AUC and PRAUC on the clean and balanced dataset (S1), although the Bayesian model still provides strong results. Under more challenging conditions, such as the imbalance setting (S7) and the hard mixture scenario (S8), the Bayesian approach demonstrates robustness, preserving relatively stable AUC and F1 compared to other methods that exhibit sharper drops.

With respect to calibration, the Bayesian model consistently attains the lowest (and thus best) values, particularly in S1, S2, and S4, reflecting not only accurate predictions but also reliable probability estimates, a property of particular importance in risk-sensitive applications. In contrast, models such as SGD and Random Forest show higher variability, with occasional large calibration errors. Furthermore, in high-dimensional (S3) and correlated (S2) scenarios, performance differences among models tend to narrow, yet the Bayesian method continues to retain its calibration advantage. Similarly, in the presence of label noise (S4) and outliers (S5), the uncertainty-aware structure of the Bayesian model prevents the extreme degradation observed in tree-based models.

Taken together, these findings highlight that while traditional models can occasionally surpass Bayesian approaches in terms of pure discriminative accuracy (as in S1), Bayesian modeling provides more reliable probability calibration and demonstrates greater robustness across diverse and adverse data conditions.

Table 3 summarizes the convergence diagnostics and key hyperparameter estimates across eight simulation scenarios, each designed to test different data complexities and contamination settings for the proposed Bayesian Feature Weighting (Bayes_FW) model. The parameter

α_{0}

represents the global intercept of the linear predictor, capturing the baseline log-odds of the outcome. The mean values vary moderately across scenarios (from

- 0.18

in S7_imbalance to 0.22 in S6_both), reflecting adaptive shrinkage behavior under different data conditions. The global shrinkage behavior is governed by the parameter

τ

in the hierarchical horseshoe prior, while

λ

controls local shrinkage at the feature level. The parameter

ε

denotes the estimated label-noise proportion, which increases notably in noise-heavy settings such as S4_lblnoise (0.09), S6_both (0.10), and S8_hardmix (0.12), confirming that the model successfully captured and quantified contamination in the data.

Convergence diagnostics based on the Gelman–Rubin statistic (mean

\hat{R} \approx 1.00 - 1.01

, maximum

\hat{R} \leq 1.03

) demonstrated excellent mixing and stability of the Markov chains across all scenarios. The proportion of parameters with

\hat{R} > 1.01

remained below 10% in every case, further confirming reliable convergence. Effective sample sizes (

n_{eff}

) were generally high (median values between 1750 and 3900), ensuring that posterior estimates were based on sufficiently independent draws. Table 3 summarizes the posterior diagnostics for all simulation settings.

Overall, Table 3 indicates that the proposed Bayesian model achieved stable convergence and consistent inference across varying levels of correlation, label noise, imbalance, and dimensionality. The results validate the robustness and computational reliability of the MCMC implementation, even under challenging conditions such as high-dimensional noise mixtures (S8) and concurrent outlier–label-noise contamination (S6).

Figure 3 presents a comparative evaluation of the proposed Bayesian Feature Weighting (Bayes_FW) model against a diverse set of benchmark classifiers across eight well-defined data scenarios. These include standard clean data (S1), label corruption (S2), high-dimensional features (S3), label imbalance (S4), the presence of feature outliers (S5), simultaneous label and feature noise (S6), class imbalance (S7), and a challenging setting with both severe label noise and imbalance (S8). The benchmark models considered encompass logistic regression and its regularized variants (L1 and elastic net), as well as ensemble-based methods such as random forest and gradient boosting. In addition, a robust linear baseline—stochastic gradient descent with balanced class weights (SGD-balanced)—is included to account for class imbalance. This setup enables a comprehensive assessment of predictive robustness and generalization across diverse data conditions.

The proposed Bayes_FW model consistently outperforms or matches benchmarks in challenging settings, particularly under label noise (S2), feature outliers (S5), and compound corruption (S6, S8), while simpler models like Logistic Regression perform competitively in clean scenarios (e.g., S1), they show performance degradation in the presence of noise. In contrast, Bayes_FW achieves the best F1 and PRAUC scores in most corrupted settings, demonstrating superior robustness and predictive reliability.

Figure 4 reports metrics related to probabilistic calibration and uncertainty estimation for Bayes_FW and benchmark models. To assess the probabilistic calibration performance of the models, four complementary metrics are employed. Logloss measures the negative log-likelihood of the predicted class probabilities, penalizing overconfident and incorrect predictions. Brier Score captures the mean squared error between predicted probabilities and actual class labels, providing a direct measure of overall probabilistic accuracy. Expected Calibration Error (ECE) quantifies the average deviation between predicted confidence and observed accuracy across confidence bins, reflecting the alignment between model confidence and correctness. Lastly, Max Calibration Gap reports the largest observed discrepancy between confidence and accuracy, indicating the worst-case calibration error. Together, these metrics offer a comprehensive evaluation of both average and extreme calibration behavior.

The proposed Bayes_FW model shows consistently better or competitive performance across calibration metrics, particularly under noisy or imbalanced conditions (S2, S4, S6, S8), while some benchmark models achieve low classification error, they often exhibit poor calibration (e.g., Random Forest, SGD-Balanced). Bayes_FW uniquely provides both strong predictive performance and principled uncertainty estimates, making it especially well-suited for applications where reliability and trust in model output are critical—such as healthcare.

4.2. Real Medical Dataset Applications

In this study, we evaluate the performance of the proposed model using four benchmark medical datasets, namely the Breast Cancer Wisconsin, Pima Indians Diabetes, South African Heart Disease, and Cleveland Heart Disease datasets. For each dataset, we report key statistical characteristics, including the number of samples, number of features, and class distribution, to ensure transparency and reproducibility.

The Pima Indians Diabetes dataset consists of 768 samples with 8 clinical features. The target variable indicates whether a patient is diagnosed with diabetes. Approximately 35% of the samples belong to the positive class, while 65% correspond to the negative class, indicating a moderately imbalanced classification problem. All features are standardized to have zero mean and unit variance prior to model training. To ensure a reliable evaluation, we employ a stratified 5-fold cross-validation scheme, preserving class proportions across training and test splits.

The Cleveland Heart Disease dataset consists of 297 samples with 13 clinical features. The target variable represents the presence or absence of heart disease. Approximately 54% of the samples correspond to the positive class, indicating a relatively balanced dataset. Similar to the Pima dataset, all features are standardized prior to analysis, and a stratified 5-fold cross-validation procedure is used to divide the dataset into training and test sets, ensuring consistency and comparability across experiments.

The Breast Cancer Wisconsin dataset and the South African Heart Disease dataset are also included in the experimental evaluation to provide a comprehensive assessment across datasets with varying levels of class imbalance, feature characteristics, and noise sensitivity.

This standardized evaluation protocol allows for a fair and consistent comparison between the proposed method and competing approaches across diverse medical data settings.

4.2.1. Breast Cancer Wisconsin Dataset

The Breast Cancer Wisconsin (Diagnostic) dataset was obtained from the UCI Machine Learning Repository. It consists of nine predictor variables describing cellular characteristics, along with a binary class label indicating whether a tumor is malignant (1) or benign (0). The dataset originally contains 699 observations; after removing instances with missing values, 683 cases remained for analysis. The class distribution is moderately imbalanced, with approximately 65% benign and 35% malignant samples. Prior to model training, all predictors were standardized to have zero mean and unit variance, and the ID attribute, which carries no predictive information, was discarded.

An overview of the predictor variables in the Breast Cancer Wisconsin dataset is provided in Table 4. These features represent morphological and nuclear characteristics extracted from digitized cell images and form the basis for malignancy classification.

Table 5 presents the mean and standard deviation (SD) of four performance metrics—AUC, PRAUC, F1-score, and accuracy—computed from repeated cross-validation for six competing models: standard logistic regression (Logistic), L1-regularized logistic regression (Logistic_L1), elastic-net logistic regression (Logistic_EN), Random Forest, Gradient Boosting (GradBoost), and the proposed Bayesian Feature Weighting model (Bayes_FW).

Across all metrics, Bayes_FW consistently achieved the strongest overall performance. It obtained the highest mean AUC (0.9975) and PRAUC (0.9931), along with superior F1-score (0.9587) and accuracy (0.9699), while maintaining low variability (SD < 0.006). These results highlight the robustness and discriminative strength of the Bayesian approach in identifying malignant cases. Among the frequentist baselines, Logistic_EN and Logistic_L1 also performed competitively, with AUC values around 0.996 and balanced F1-scores near 0.947, suggesting that regularization contributes to slight gains over the standard logistic model. Random Forest and GradBoost delivered marginally lower performance, reflecting the limited benefit of nonlinear tree-based methods for this dataset, which primarily consists of moderately correlated numeric predictors.

The Bayesian Feature Weighting model outperformed all other approaches, demonstrating superior predictive accuracy and stability, and confirming its effectiveness for robust and interpretable tumor classification.

Table 6 summarizes the calibration and reliability metrics for the six classification models evaluated on the Breast Cancer Wisconsin dataset. The reported measures include the LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE), each presented with their mean and standard deviation (SD) across repeated cross-validation runs.

Among all compared models, the Bayesian Feature Weighting (Bayes_FW) model achieved the most reliable probabilistic predictions, as evidenced by the lowest LogLoss (0.0912) and lowest Brier score (0.0243). These results indicate that the Bayesian approach produces well-calibrated probability estimates with minimal deviation from true outcome probabilities. The model also yielded the smallest ECE (0.0321), suggesting excellent overall calibration across probability bins.

In comparison, Random Forest exhibited slightly higher LogLoss (0.0922) but maintained strong calibration (ECE = 0.0365). Traditional logistic models—Logistic_EN, Logistic_L1, and Logistic—performed reasonably well but showed modestly higher LogLoss and Brier scores, indicating slightly less accurate probability estimation. The Gradient Boosting (GradBoost) model performed the weakest in terms of calibration, with the highest LogLoss (0.1178) and Brier score (0.0292), reflecting some degree of overconfidence in its probability predictions.

The Bayes_FW model outperformed all alternatives across all four calibration metrics, confirming its superiority not only in predictive discrimination (as shown in Table 5) but also in probability reliability and uncertainty quantification—a key advantage of the Bayesian framework.

Table 7 and Figure 5 present the posterior mean feature weights and their 90% credible intervals estimated using the Bayesian Feature Weighting (Bayes_FW) model. These results quantify the relative contribution of each cellular attribute to malignancy prediction while incorporating model uncertainty through the Bayesian posterior distribution.

It is important to note that the feature weights

w_{j}

represent the relative relevance of predictors within the weighting structure. The actual contribution of a predictor to the linear predictor depends on the product

w_{j} β_{j}

. Since the hierarchical horseshoe prior strongly shrinks irrelevant coefficients toward zero, predictors with near-zero

β_{j}

have negligible predictive influence even if their corresponding weights are moderately large.

According to the results, Bare nuclei, Clump thickness, and Mitoses emerge as the most influential predictors, showing the highest posterior mean weights (0.1467, 0.1339, and 0.1253, respectively). Their wider yet consistently positive credible intervals indicate both strong and stable associations with malignancy likelihood. Intermediate importance is observed for Normal nucleoli, Cell size, and Bland chromatin, which also contribute meaningfully but with slightly lower average weights. Finally, Marginal adhesion, Cell shape, and Epithelial cell size have the smallest posterior means, suggesting relatively weaker influence in distinguishing malignant from benign tumors.

Figure 5 visually confirms this ranking pattern, where the dots represent posterior means and the horizontal bars denote 90% credible intervals. The clear separation of higher-weighted features at the top highlights the discriminative power of nuclear irregularities and cellular cohesion, which are biologically consistent with pathological observations in breast cancer diagnosis.

4.2.2. Pima Indians Diabetes Dataset

The Pima Indians Diabetes dataset is a widely used benchmark in medical machine learning, originally collected by the National Institute of Diabetes and Digestive and Kidney Diseases. It contains clinical and physiological measurements from female patients of Pima Indian heritage aged 21 years or older. The dataset includes eight predictor variables—such as glucose concentration, body mass index (BMI), and number of pregnancies—that are important risk factors for type 2 diabetes. The binary outcome variable indicates whether an individual shows signs of diabetes (1) or not (0), based on established diagnostic criteria. The dataset is frequently used to evaluate predictive models in healthcare, as it combines demographic, genetic, and lifestyle-related risk indicators with measurable biomedical parameters.

An overview of the predictor variables in the Pima Indians Diabetes dataset is provided in Table 8. These features represent demographic, physiological, and biochemical risk factors commonly associated with type 2 diabetes.

To further validate the proposed Bayesian framework, comprehensive posterior diagnostics are provided in Appendix A. The marginal posterior distributions of key global parameters (Figure A2) indicate well-defined and unimodal behavior. In particular, the intercept (

α

) is tightly concentrated, while the global shrinkage parameter (

τ

) exhibits a right-skewed distribution, reflecting the adaptive sparsity induced by the horseshoe prior. The label-noise parameter (

ϵ

) is centered around low values, suggesting limited but non-negligible noise in the dataset.

Sampling diagnostics confirm the reliability of inference. As shown in Figure A3, the effective sample size ratios (

N_{eff} / N

) are consistently high, indicating efficient exploration of the posterior space. Similarly, the

\hat{R}

statistics (Figure A3) are tightly concentrated around 1, providing strong evidence of convergence across all chains.

Trace plots for both global parameters and regression coefficients (Figure A5, Figure A6 and Figure A7) demonstrate good mixing behavior with no visible trends or chain separation, further supporting stable MCMC performance, while occasional spikes are observed in coefficient traces due to the heavy-tailed prior, these do not indicate pathological sampling behavior.

The joint posterior structure (Figure A8) reveals mild dependencies among parameters, particularly between

α

and

τ

, which is expected in hierarchical shrinkage models. Importantly, no pathological correlations or funnel-shaped geometries are observed.

Finally, the posterior predictive check (Figure A9) shows strong agreement between observed and model-generated distributions, indicating that the proposed model successfully captures the underlying data-generating process. Minor deviations at extreme probability regions suggest slight calibration imperfections but do not materially affect predictive performance.

Overall, these diagnostics confirm that the proposed Bayesian feature weighting model achieves stable convergence, efficient sampling, and reliable uncertainty quantification on the Pima Indians Diabetes dataset.

Table 9 presents the mean and standard deviation (SD) of four performance metrics—AUC, PRAUC, F1-score, and accuracy—computed from repeated cross-validation for six competing models: standard logistic regression (Logistic), L1-regularized logistic regression (Logistic_L1), elastic-net logistic regression (Logistic_EN), Random Forest, Gradient Boosting (GradBoost), and the proposed Bayesian Feature Weighting model (Bayes_FW).

Across all discrimination metrics, the Bayes_FW method achieved the highest overall performance, with an AUC mean of 0.8426, PRAUC mean of 0.7851, and F1-score mean of 0.6881, outperforming both conventional logistic regression variants and ensemble-based methods. These results highlight the model’s ability to capture uncertainty in feature contributions while maintaining high discriminative power. Logistic_L1 and standard Logistic regression followed closely, exhibiting comparable AUC values (0.8338 and 0.8324, respectively) but slightly lower precision–recall and F1-scores. Ensemble models such as Random Forest and Gradient Boosting demonstrated lower AUC and PRAUC values, suggesting less stable performance for this moderately imbalanced dataset. The higher SD observed for Gradient Boosting indicates greater variability across runs, potentially due to hyperparameter sensitivity or overfitting in smaller training subsets.

Table 10 reports the calibration and reliability metrics for the same models. The measures include LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE), each presented with mean and SD across repeated cross-validation runs.

Among the compared methods, the Bayesian Feature Weighting (Bayes_FW) approach achieved the lowest LogLoss (0.3404) and Brier score (0.1468), indicating superior probabilistic accuracy and overall calibration. It also produced the smallest ECE (0.0700) and MCE (0.2783), suggesting that the Bayesian model provides well-calibrated probability estimates that closely align with observed outcomes. In contrast, the Gradient Boosting (GradBoost) model showed the weakest calibration, with the highest LogLoss and ECE values, implying overconfident predictions and larger deviations from true event frequencies.

Table 11 and Figure 6 present the posterior mean feature weights and their 90% credible intervals estimated by the Bayesian Feature Weighting model. These results quantify the relative contribution of each clinical variable to diabetes prediction while incorporating model uncertainty through the Bayesian posterior distribution.

According to the results, glucose, body mass index, and number of pregnancies are the most influential predictors, exhibiting the highest posterior mean weights (0.1878, 0.1654, and 0.1471, respectively). These features show strong and stable associations with diabetes risk, as indicated by their positive and moderately wide credible intervals. Pedigree function, reflecting genetic predisposition, also ranks among the top predictors. Lower but meaningful contributions are observed for blood pressure, insulin, and triceps skinfold thickness, suggesting secondary influence in the model’s classification process.

Figure 6 visually confirms this ranking pattern, where the dots represent posterior means and the horizontal bars denote 90% credible intervals. The dominance of glucose concentration and body mass index underscores their well-established roles as primary determinants of type 2 diabetes, while the remaining features capture secondary but biologically consistent effects.

4.2.3. South African Heart Disease Dataset

The South African Heart Disease (SAHeart) dataset originates from a South African study on risk factors associated with coronary heart disease (CHD). It includes demographic, clinical, and lifestyle-related variables commonly linked to cardiovascular outcomes. The dataset combines biochemical measures (e.g., LDL cholesterol) with behavioral indicators (e.g., tobacco and alcohol use) and psychosocial factors (Type-A behavior). The binary outcome variable indicates the presence (1) or absence (0) of CHD. This dataset is widely used in statistical learning and medical data analysis because it provides a comprehensive mix of physiological, behavioral, and hereditary risk factors.

An overview of the predictor variables in the SAHeart dataset is provided in Table 12.

Table 13 presents the mean and standard deviation (SD) of four performance metrics—AUC, PRAUC, F1-score, and accuracy—computed from repeated cross-validation for six competing models: elastic-net logistic regression (Logistic_EN), L1-regularized logistic regression (Logistic_L1), standard logistic regression (Logistic), Bayesian Feature Weighting (Bayes_FW), Random Forest, and Gradient Boosting (GradBoost).

Across discrimination metrics, Bayes_FW achieved the best overall performance (AUC = 0.7903; PRAUC = 0.6933), with competitive F1-score (0.5494) and accuracy (0.7400). Logistic_EN and Logistic_L1 followed closely in AUC and PRAUC. Tree-based methods (Random Forest, GradBoost) showed lower discrimination, consistent with potential overfitting or noise sensitivity in smaller biomedical datasets.

Table 14 summarizes calibration and loss metrics: LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE), each reported with mean and SD across repeated runs.

Bayes_FW achieved the lowest LogLoss (0.5245) and Brier score (0.1645), indicating strong probabilistic accuracy. Although its ECE was higher than some logistic baselines, Bayes_FW maintained competitive calibration overall, while GradBoost showed the weakest calibration (highest LogLoss, ECE, and MCE).

Table 15 and Figure 7 present posterior mean feature weights and 90% credible intervals estimated by Bayes_FW.

According to the results, age, family history (famhist), and tobacco exhibit the largest posterior mean weights, indicating the strongest association with CHD risk in this cohort. Biochemical and physiological indicators such as ldl and bp show moderate influence, while lifestyle/anthropometric variables (obesity, adiposity, alcohol) contribute more weakly, with wider credible intervals reflecting greater uncertainty.

4.2.4. Heart Disease (Cleveland) Dataset

The Heart Disease (Cleveland) dataset is a widely used benchmark in cardiovascular research and machine learning. It contains 303 patient records collected at the Cleveland Clinic Foundation; after preprocessing (removing missing values and encoding categorical variables), approximately 297 samples remain. The outcome variable is binary: presence of heart disease (1) versus absence (0).

The dataset includes clinical, demographic, and exercise-related attributes (e.g., age, blood pressure, cholesterol, thalassemia test results, ECG findings). Categorical features (e.g., cp, thal, slope, restecg, ca, exang) were expanded to one-hot indicators for compatibility with the Bayesian feature weighting framework.

An overview of the predictor variables is provided in Table 16.

Table 17 presents the mean and standard deviation (SD) of four performance metrics—AUC, PRAUC, F1-score, and accuracy—computed from repeated cross-validation for six competing models: elastic-net logistic regression (Logistic_EN), standard logistic regression (Logistic), L1-regularized logistic regression (Logistic_L1), Bayesian Feature Weighting (Bayes_FW), Random Forest, and Gradient Boosting (GradBoost).

Across discrimination metrics, Bayes_FW achieved the strongest overall performance (AUC = 0.9079; PRAUC = 0.9076; F1 = 0.8264; ACC = 0.8440). Penalized logistic baselines (Logistic_EN, Logistic_L1) were competitive, while tree-based methods (Random Forest, GradBoost) trailed on average, consistent with smaller sample sizes and mixed continuous/categorical predictors.

Table 18 summarizes calibration and loss metrics—LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE)—each reported with mean and SD across repeated runs.

Bayes_FW achieved the lowest LogLoss (0.3365) and Brier score (0.1103), indicating strong probabilistic calibration. Logistic baselines were competitive but less precise, whereas GradBoost showed the weakest calibration (highest LogLoss, ECE, and MCE).

Table 19 and Figure 8 present the posterior mean feature weights and their 90% credible intervals estimated using Bayes_FW.

According to the results, cp_3, ca_1, and thal_3 exhibit the largest posterior mean weights, indicating the strongest association with heart disease risk in this cohort. Moderately important features include oldpeak, slope_1, sex_1, and thalach. Variables such as fbs_1, chol, and age receive smaller weights after accounting for correlation among predictors. Figure 8 visually confirms these findings by showing posterior means with 90% credible intervals.

5. Results and Discussion

The proposed Bayesian Feature Weighting (Bayes_FW) model was comprehensively evaluated against six benchmark classifiers—standard Logistic Regression, L1- and Elastic-Net-regularized logistic models, Random Forest, Gradient Boosting, and a balanced SGD baseline—across four real-world biomedical datasets. Figure 9 and Figure 10 summarize the comparative outcomes in terms of calibration and discrimination, respectively, while Table 5, Table 6, Table 7, Table 8, Table 9, Table 10, Table 11, Table 12, Table 13, Table 14, Table 15, Table 16, Table 17, Table 18 and Table 19 report dataset-specific numerical results and posterior feature weight analyses. Figure 9 and Figure 10 are composed of four panels: (a) Breast Cancer Wisconsin, (b) Pima Indians Diabetes, (c) South African Heart Disease, and (d) Cleveland Heart Disease datasets.

Across all datasets, Bayes_FW achieved superior or comparable predictive accuracy while demonstrating the most stable probability calibration. In Figure 10, which presents normalized discrimination metrics (AUC, PRAUC, F1, and Accuracy), Bayes_FW consistently attains the highest normalized scores across panels (a–d), confirming its strong and robust ranking. This advantage is most pronounced for the Breast Cancer Wisconsin and Cleveland Heart Disease datasets, where AUC and PRAUC exceed 0.90 and F1-scores remain near or above 0.80. Regularized logistic models (Logistic_EN and Logistic_L1) follow closely, whereas tree-based ensembles show greater variability, particularly under smaller sample sizes.

Figure 9 reports normalized calibration metrics (LogLoss, Brier, ECE, and MCE). Bayes_FW yields the lowest average losses across all four datasets, indicating more reliable uncertainty estimation and reduced overconfidence. In the Breast Cancer and Pima Indians Diabetes datasets (Figure 9a,b), Bayes_FW achieves the smallest LogLoss and Brier scores, confirming accurate probability estimation. Even under noisier or more heterogeneous settings (South African Heart Disease and Cleveland datasets; Figure 9c,d), calibration remains stable with minimal degradation, unlike Gradient Boosting, which exhibits larger ECE/MCE values.

Numerically, Bayes_FW shows consistent gains across datasets. For the Breast Cancer Wisconsin dataset, Bayes_FW achieves the highest AUC (0.9975) and F1 (0.9587), together with the lowest LogLoss (0.0912) and Brier score (0.0243), reflecting near-perfect discrimination and excellent calibration. For the Pima Indians Diabetes dataset, Bayes_FW attains the lowest LogLoss (0.340) and the smallest ECE (0.070), improving calibration over the best logistic baseline. For the South African Heart Disease dataset, Bayes_FW maintains balanced discrimination (AUC = 0.790) and competitive calibration (LogLoss = 0.525), outperforming both regularized logistic models and ensemble learners on average. Finally, for the Cleveland Heart Disease dataset, Bayes_FW attains the highest AUC (0.908) and the smallest LogLoss (0.337), demonstrating reliable predictive power and robust uncertainty quantification in a mixed categorical–continuous setting.

Posterior feature weight analyses (Table 7, Table 11, Table 15 and Table 19) reveal coherent, domain-consistent patterns. Dominant predictors such as Bare nuclei (breast cancer), glucose and BMI (diabetes), age and family history (cardiovascular risk), and chest-pain/angiographic indicators (Cleveland dataset) receive the highest posterior means with credible intervals that remain well separated from less informative variables. The credible interval visualizations (Figure 5, Figure 6, Figure 7 and Figure 8) further confirm that Bayes_FW not only ranks key features effectively but also provides uncertainty bounds that quantify their relative stability.

Overall, Figure 9 and Figure 10 demonstrate that Bayes_FW achieves a strong balance of high discrimination and superior calibration across diverse biomedical domains. It produces well-calibrated probability estimates, stable performance under noise and imbalance, and interpretable uncertainty-aware feature importance, supporting its use as a robust and generalizable probabilistic learning framework.

To further strengthen the interpretability of the proposed Bayesian feature weighting framework, the estimated feature importance results are examined in conjunction with established clinical domain knowledge. In the case of the Pima Indians Diabetes dataset, the model consistently assigns higher posterior weights to variables such as plasma glucose concentration, body mass index (BMI), and age. These variables are well-documented in the medical literature as primary risk factors for the development of type 2 diabetes, thereby providing strong external validation for the model’s findings.

More specifically, elevated plasma glucose levels are directly indicative of impaired glucose metabolism, which is a defining characteristic of diabetes. Similarly, higher BMI values are associated with obesity-related insulin resistance, while increasing age is known to correlate with a higher prevalence of metabolic disorders. The alignment of these clinically meaningful variables with high posterior feature weights suggests that the proposed model is not only statistically effective but also capable of capturing medically relevant patterns in the data.

In addition, features such as insulin levels, skin thickness, and blood pressure receive moderate importance scores. Although their individual effects may be less pronounced or more variable across patients, these variables are still recognized as contributing factors in the broader pathophysiology of diabetes. Their inclusion among the relevant predictors further supports the model’s ability to reflect complex, multifactorial relationships inherent in medical data.

Importantly, the simplex-constrained weighting structure enables a clear and interpretable ranking of features, while the incorporation of shrinkage priors mitigates the influence of noisy or less informative variables. This combination allows the model to balance sparsity and flexibility, leading to feature importance estimates that are both stable and clinically meaningful.

Overall, the strong agreement between the model-derived feature importance and established medical evidence enhances the credibility of the proposed approach and highlights its potential utility in real-world healthcare applications, where interpretability and domain consistency are essential.

In addition to the empirical results, it is important to position the proposed model within the broader context of state-of-the-art Bayesian robust modeling approaches. Compared to existing methods such as Bayesian logistic regression with heavy-tailed priors and Dirichlet process–based models, the proposed framework provides a unified mechanism that simultaneously achieves feature weighting, robustness to contamination, and interpretability through the simplex constraint, while heavy-tailed priors primarily address outliers, and nonparametric Bayesian approaches focus on distributional flexibility, our method integrates these aspects with an explicit feature importance structure.

Despite these advantages, several limitations should be acknowledged. First, the assumption of a symmetric contamination mechanism may not fully capture class-dependent noise patterns commonly observed in medical datasets. Second, the computational cost associated with HMC-based inference can become significant in high-dimensional settings. Third, the simplex constraint, while improving interpretability, may introduce additional geometric complexity in posterior sampling.

These limitations suggest several promising directions for future research. Extending the model to asymmetric or class-dependent noise structures would improve its applicability in real-world clinical settings. Additionally, scalable inference techniques such as variational approximations or stochastic gradient-based methods could be explored to enhance computational efficiency. Finally, integrating the proposed framework with nonparametric priors or deep learning architectures may further improve flexibility and predictive performance.

6. Conclusions

This study introduced a novel Bayesian Feature Weighting (Bayes FW) framework that fundamentally redefines how feature importance is modeled and interpreted in the presence of noisy medical data. Unlike conventional deterministic feature weighting techniques and existing Bayesian approaches that treat robustness, shrinkage, and uncertainty in isolation, the proposed model unifies simplex-constrained global feature weighting, hierarchical shrinkage priors, and contamination-aware noise modeling within a single coherent probabilistic framework. This integration constitutes a key methodological contribution, enabling globally normalized and interpretable feature weights while explicitly controlling the influence of mislabeled observations and aberrant feature values. Through extensive empirical evaluation on four benchmark medical datasets, Breast Cancer Wisconsin, Pima Indians Diabetes, South African Heart Disease and Cleveland Heart Disease, the proposed framework consistently demonstrated superior performance over classical logistic regression variants and ensemble-based learners. Importantly, the gains were not limited to discrimination metrics such as AUC, F1, and accuracy, but extended to probabilistic calibration measures, including LogLoss, Brier score, and Expected Calibration Error (ECE). These results underscore the ability of the proposed Bayesian formulation to deliver reliable probability estimates, which are essential for risk-sensitive medical decision making. Beyond predictive performance, a central advantage of the Bayes FW framework lies in its ability to produce uncertainty-aware and globally interpretable feature-importance estimates. By constraining feature weights to the probability simplex and modeling them probabilistically, the proposed approach provides a principled representation of global feature relevance that remains stable under data contamination. The posterior distributions of feature weights enable transparent quantification of uncertainty, offering clinically meaningful insights into variable importance rather than relying on fragile point estimates.

The feature importance results obtained from the proposed model are consistent with established clinical evidence. For instance, glucose concentration emerges as the most influential predictor, which aligns with its central role in diabetes diagnosis and progression. Body mass index (BMI) and age are also identified as key contributors, reflecting their well-documented association with metabolic risk. This agreement between the model outputs and domain knowledge provides additional validation of the proposed framework and highlights its potential for interpretable and clinically relevant machine learning applications.

Collectively, these findings demonstrate that incorporating Bayesian inference into feature weighting, when combined with simplex-based normalization and contamination-aware priors, yields a robust, interpretable, and uncertainty-aware modeling paradigm for medical classification. The proposed framework bridges a critical methodological gap between robust Bayesian learning and interpretable feature weighting, offering a scalable and theoretically grounded solution for noisy biomedical data. Future research directions include extending the proposed model to multiclass and longitudinal outcomes, incorporating structured or group-wise priors to capture hierarchical biomedical relationships, and integrating the framework with Bayesian deep learning architectures. Such extensions would further enhance the applicability of the proposed approach to complex, high-dimensional clinical data environments where robustness, interpretability, and uncertainty quantification are simultaneously required.

Author Contributions

Conceptualization, M.A.C.; methodology, M.A.C. and Z.Ö.; software, M.A.C.; validation, M.A.C. and A.A.; formal analysis, M.A.C.; investigation, M.A.C. and Z.Ö.; resources, M.A.C.; data curation, M.A.C.; writing—original draft preparation, M.A.C. and A.A.; writing—review and editing, M.A.C., Z.Ö. and A.A.; visualization, A.A.; supervision, M.A.C.; project administration, M.A.C.; funding acquisition, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-DDRSP2601).

Institutional Review Board Statement

Ethical approval was waived because this study is based entirely on the analysis of a publicly available dataset. The authors did not collect any primary data from human participants, nor did they have any direct contact with patients or their medical records.

Informed Consent Statement

Informed consent was waived due to the retrospective nature of the study.

Data Availability Statement

The datasets used and analyzed during the current study are publicly available on Zenodo at DOI: https://doi.org/10.5281/zenodo.17559308.

Acknowledgments

The authors confirm that ChatGPT 5.2 (OpenAI) was used exclusively to assist with English language editing. The tool was not involved in creating original scientific content, analyzing data, or drawing conclusions. All research findings and interpretations are entirely the work of the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Simulation Study: Additional Performance Tables and Figures

Table A1. Discrimination metrics by model and scenario: Mean and standard deviation (SD) for AUC, PRAUC, F1, and ACC across eight scenarios and all models.

		AUC		PRAUC		F1		ACC
Scenario	Model	Mean	std	Mean	std	Mean	std	Mean	std
S1	GradBoost	0.736	0.051	0.731	0.065	0.647	0.067	0.668	0.039
S1	Logistic	0.797	0.041	0.786	0.061	0.718	0.068	0.727	0.048
S1	Logistic_EN	0.799	0.037	0.790	0.059	0.717	0.069	0.725	0.046
S1	Logistic_L1	0.801	0.035	0.793	0.055	0.719	0.069	0.728	0.048
S1	RandomForest	0.768	0.041	0.758	0.058	0.676	0.080	0.695	0.060
S1	SGD_Balanced	0.753	0.035	0.741	0.075	0.650	0.047	0.658	0.031
S1	Bayes_FW	0.820	0.052	0.831	0.042	0.725	0.062	0.738	0.061
S2	GradBoost	0.636	0.103	0.637	0.098	0.594	0.086	0.613	0.068
S2	Logistic	0.673	0.106	0.671	0.102	0.587	0.117	0.612	0.100
S2	Logistic_EN	0.679	0.107	0.676	0.102	0.592	0.140	0.615	0.114
S2	Logistic_L1	0.679	0.107	0.678	0.106	0.601	0.137	0.620	0.115
S2	RandomForest	0.657	0.084	0.647	0.083	0.602	0.088	0.625	0.073
S2	SGD_Balanced	0.590	0.064	0.572	0.070	0.573	0.047	0.578	0.049
S2	Bayes_FW	0.861	0.043	0.873	0.038	0.786	0.049	0.787	0.050
S3	GradBoost	0.585	0.083	0.603	0.086	0.536	0.068	0.558	0.065
S3	Logistic	0.593	0.057	0.609	0.062	0.592	0.058	0.577	0.045
S3	Logistic_EN	0.608	0.064	0.628	0.060	0.587	0.072	0.573	0.066
S3	Logistic_L1	0.622	0.070	0.647	0.067	0.597	0.067	0.595	0.063
S3	RandomForest	0.579	0.078	0.607	0.082	0.541	0.070	0.542	0.046
S3	SGD_Balanced	0.566	0.040	0.545	0.034	0.576	0.056	0.567	0.037
S3	Bayes_FW	0.709	0.088	0.711	0.089	0.649	0.073	0.657	0.069
S4	GradBoost	0.572	0.095	0.583	0.076	0.570	0.088	0.565	0.070
S4	Logistic	0.620	0.072	0.615	0.046	0.607	0.059	0.593	0.064
S4	Logistic_EN	0.621	0.072	0.614	0.045	0.602	0.061	0.595	0.061
S4	Logistic_L1	0.621	0.072	0.611	0.043	0.608	0.058	0.600	0.057
S4	RandomForest	0.567	0.088	0.578	0.050	0.573	0.082	0.570	0.064
S4	SGD_Balanced	0.598	0.058	0.583	0.074	0.604	0.044	0.570	0.044
S4	Bayes_FW	0.677	0.072	0.699	0.059	0.611	0.075	0.631	0.048
S5	GradBoost	0.643	0.061	0.634	0.069	0.602	0.082	0.607	0.053
S5	Logistic	0.736	0.088	0.739	0.104	0.647	0.092	0.655	0.095
S5	Logistic_EN	0.740	0.090	0.745	0.104	0.653	0.095	0.660	0.096
S5	Logistic_L1	0.743	0.091	0.748	0.105	0.658	0.093	0.663	0.095
S5	RandomForest	0.636	0.061	0.615	0.072	0.589	0.066	0.595	0.047
S5	SGD_Balanced	0.684	0.057	0.656	0.082	0.629	0.060	0.640	0.057
S5	Bayes_FW	0.767	0.066	0.803	0.057	0.711	0.046	0.677	0.071
S6	GradBoost	0.512	0.096	0.395	0.084	0.254	0.099	0.575	0.064
S6	Logistic	0.596	0.081	0.435	0.064	0.416	0.117	0.592	0.074
S6	Logistic_EN	0.595	0.075	0.440	0.054	0.431	0.119	0.608	0.060
S6	Logistic_L1	0.596	0.079	0.441	0.051	0.429	0.105	0.607	0.053
S6	RandomForest	0.543	0.068	0.406	0.063	0.086	0.100	0.650	0.054
S6	SGD_Balanced	0.572	0.090	0.392	0.078	0.448	0.116	0.598	0.086
S6	Bayes_FW	0.652	0.073	0.527	0.124	0.524	0.227	0.689	0.056
S7	GradBoost	0.625	0.088	0.285	0.076	0.125	0.056	0.798	0.029
S7	Logistic	0.647	0.067	0.331	0.085	0.316	0.082	0.779	0.028
S7	Logistic_EN	0.643	0.069	0.336	0.090	0.329	0.064	0.785	0.025
S7	Logistic_L1	0.637	0.070	0.338	0.091	0.320	0.080	0.789	0.033
S7	RandomForest	0.612	0.074	0.274	0.070	0.000	0.000	0.782	0.008
S7	SGD_Balanced	0.618	0.052	0.253	0.043	0.346	0.060	0.715	0.040
S7	Bayes_FW	0.656	0.079	0.344	0.094	0.479	0.018	0.810	0.014
S8	GradBoost	0.615	0.042	0.535	0.052	0.473	0.067	0.609	0.031
S8	Logistic	0.518	0.047	0.446	0.038	0.459	0.050	0.527	0.038
S8	Logistic_EN	0.518	0.051	0.448	0.043	0.453	0.055	0.523	0.048
S8	Logistic_L1	0.522	0.051	0.452	0.044	0.451	0.064	0.517	0.052
S8	RandomForest	0.648	0.059	0.570	0.050	0.384	0.103	0.634	0.025
S8	SGD_Balanced	0.516	0.027	0.431	0.023	0.460	0.044	0.528	0.020
S8	Bayes_FW	0.694	0.049	0.649	0.066	0.575	0.027	0.679	0.044

Table A2. Calibration metrics by model and scenario: Mean and standard deviation (SD) for LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Gap (Max CalibGap) across eight experimental scenarios and all evaluated models.

		LogLoss		Brier		ECE		Max CalibGap
Scenario	Model	Mean	std	Mean	std	Mean	std	Mean	std
S1	GradBoost	0.777	0.083	0.241	0.026	0.223	0.042	0.651	0.147
S1	Logistic	0.594	0.081	0.197	0.028	0.165	0.042	0.444	0.108
S1	Logistic_EN	0.585	0.076	0.194	0.027	0.155	0.028	0.489	0.132
S1	Logistic_L1	0.577	0.071	0.192	0.025	0.143	0.034	0.471	0.160
S1	RandomForest	0.589	0.028	0.201	0.013	0.120	0.033	0.307	0.188
S1	SGD_Balanced	5.729	1.102	0.331	0.030	0.337	0.029	0.724	0.183
S1	Bayes_FW	0.525	0.060	0.175	0.025	0.093	0.022	0.341	0.170
S2	GradBoost	0.828	0.161	0.272	0.051	0.220	0.065	0.532	0.135
S2	Logistic	0.666	0.097	0.234	0.044	0.179	0.048	0.546	0.256
S2	Logistic_EN	0.658	0.095	0.231	0.043	0.172	0.048	0.550	0.240
S2	Logistic_L1	0.652	0.092	0.230	0.041	0.178	0.048	0.504	0.233
S2	RandomForest	0.654	0.058	0.231	0.026	0.122	0.052	0.295	0.059
S2	SGD_Balanced	7.794	1.621	0.410	0.046	0.416	0.045	0.707	0.200
S2	Bayes_FW	0.483	0.053	0.155	0.022	0.133	0.033	0.359	0.113
S3	GradBoost	0.909	0.157	0.303	0.048	0.265	0.069	0.611	0.177
S3	Logistic	1.436	0.259	0.353	0.044	0.347	0.048	0.607	0.128
S3	Logistic_EN	1.257	0.255	0.333	0.050	0.326	0.062	0.623	0.148
S3	Logistic_L1	1.161	0.257	0.319	0.052	0.301	0.070	0.605	0.162
S3	RandomForest	0.683	0.022	0.245	0.011	0.069	0.026	0.295	0.099
S3	SGD_Balanced	14.321	1.312	0.433	0.036	0.434	0.035	0.451	0.033
S3	Bayes_FW	0.627	0.065	0.218	0.029	0.127	0.038	0.286	0.154
S4	GradBoost	0.867	0.130	0.296	0.044	0.242	0.051	0.592	0.118
S4	Logistic	0.726	0.064	0.256	0.023	0.186	0.039	0.471	0.143
S4	Logistic_EN	0.716	0.060	0.253	0.022	0.174	0.038	0.446	0.180
S4	Logistic_L1	0.707	0.055	0.251	0.021	0.154	0.033	0.419	0.078
S4	RandomForest	0.692	0.028	0.249	0.013	0.117	0.035	0.411	0.208
S4	SGD_Balanced	7.482	1.754	0.416	0.039	0.421	0.039	0.687	0.191
S4	Bayes_FW	0.653	0.043	0.230	0.020	0.103	0.048	0.279	0.149
S5	GradBoost	0.863	0.119	0.279	0.035	0.238	0.065	0.525	0.129
S5	Logistic	0.623	0.122	0.215	0.046	0.174	0.056	0.400	0.107
S5	Logistic_EN	0.613	0.116	0.212	0.045	0.145	0.056	0.473	0.197
S5	Logistic_L1	0.605	0.111	0.210	0.043	0.150	0.052	0.445	0.186
S5	RandomForest	0.672	0.030	0.238	0.014	0.115	0.053	0.527	0.283
S5	SGD_Balanced	6.173	1.781	0.348	0.062	0.355	0.061	0.641	0.191
S5	Bayes_FW	0.592	0.055	0.203	0.025	0.109	0.029	0.306	0.133
S6	GradBoost	0.915	0.152	0.298	0.043	0.260	0.049	0.733	0.142
S6	Logistic	1.302	0.226	0.321	0.054	0.319	0.068	0.722	0.080
S6	Logistic_EN	1.174	0.176	0.310	0.047	0.294	0.062	0.700	0.137
S6	Logistic_L1	1.060	0.165	0.300	0.042	0.283	0.061	0.697	0.097
S6	RandomForest	0.642	0.032	0.225	0.015	0.092	0.038	0.363	0.223
S6	SGD_Balanced	13.047	2.792	0.400	0.087	0.401	0.087	0.644	0.124
S6	Bayes_FW	0.615	0.036	0.212	0.017	0.057	0.044	0.220	0.180
S7	GradBoost	0.518	0.066	0.158	0.021	0.114	0.033	0.707	0.177
S7	Logistic	0.716	0.129	0.181	0.025	0.164	0.034	0.733	0.151
S7	Logistic_EN	0.651	0.112	0.175	0.024	0.155	0.039	0.760	0.169
S7	Logistic_L1	0.587	0.093	0.167	0.023	0.143	0.035	0.735	0.177
S7	RandomForest	0.459	0.018	0.143	0.006	0.046	0.018	0.182	0.100
S7	SGD_Balanced	8.647	1.466	0.281	0.039	0.282	0.039	0.731	0.068
S7	Bayes_FW	0.446	0.031	0.137	0.011	0.036	0.033	0.117	0.219
S8	GradBoost	0.726	0.052	0.254	0.016	0.148	0.026	0.525	0.251
S8	Logistic	1.819	0.200	0.394	0.034	0.375	0.042	0.626	0.072
S8	Logistic_EN	1.601	0.165	0.385	0.035	0.365	0.047	0.562	0.102
S8	Logistic_L1	1.439	0.139	0.374	0.034	0.351	0.039	0.570	0.085
S8	RandomForest	0.652	0.022	0.230	0.010	0.072	0.021	0.273	0.144
S8	SGD_Balanced	15.336	0.775	0.471	0.020	0.473	0.020	0.601	0.108
S8	Bayes_FW	0.624	0.023	0.216	0.011	0.044	0.029	0.205	0.100

Figure A1. Breast Cancer Wisconsin Features: Joint posterior samples of the feature weights

w_{j}

and regression coefficients

β_{j}

for selected predictors in the Breast Cancer Wisconsin dataset. Each point represents one posterior draw obtained from the HMC–NUTS sampler. The absence of ridge-like structures indicates that the parameters are not affected by multiplicative non-identifiability.

Figure A1. Breast Cancer Wisconsin Features: Joint posterior samples of the feature weights

w_{j}

and regression coefficients

β_{j}

for selected predictors in the Breast Cancer Wisconsin dataset. Each point represents one posterior draw obtained from the HMC–NUTS sampler. The absence of ridge-like structures indicates that the parameters are not affected by multiplicative non-identifiability.

Figure A2. Posterior distributions of key global parameters (

α

,

τ

,

ϵ

) for the Pima Indians Diabetes dataset. The x-axis represents the parameter values, while the y-axis shows the corresponding posterior density estimated from the MCMC samples. The shaded blue regions represent the posterior density distributions of the model parameters obtained from the MCMC samples. These distributions illustrate the uncertainty associated with each parameter and indicate the range of plausible values given the observed data.

Figure A2. Posterior distributions of key global parameters (

α

,

τ

,

ϵ

) for the Pima Indians Diabetes dataset. The x-axis represents the parameter values, while the y-axis shows the corresponding posterior density estimated from the MCMC samples. The shaded blue regions represent the posterior density distributions of the model parameters obtained from the MCMC samples. These distributions illustrate the uncertainty associated with each parameter and indicate the range of plausible values given the observed data.

Figure A3. Effective sample size ratios (

N_{eff} / N

) for all model parameters, indicating sampling efficiency. The x-axis represents the effective sample size ratio

N_{eff} / N

, which quantifies the sampling efficiency of the MCMC algorithm. The y-axis corresponds to the ordered model parameters, sorted according to their

N_{eff} / N

values, and is used for visualization purposes only.

Figure A3. Effective sample size ratios (

N_{eff} / N

) for all model parameters, indicating sampling efficiency. The x-axis represents the effective sample size ratio

N_{eff} / N

, which quantifies the sampling efficiency of the MCMC algorithm. The y-axis corresponds to the ordered model parameters, sorted according to their

N_{eff} / N

values, and is used for visualization purposes only.

Figure A4. Pairwise joint posterior distributions of global parameters (

α

,

τ

,

ϵ

), illustrating dependency structure.

Figure A4. Pairwise joint posterior distributions of global parameters (

α

,

τ

,

ϵ

), illustrating dependency structure.

Figure A5. Posterior predictive check comparing observed and model-predicted outcome distributions.

Figure A6. Potential scale reduction factors (

\hat{R}

) for all parameters, demonstrating MCMC convergence. The x-axis displays the potential scale reduction factor (

\hat{R}

), a convergence diagnostic that compares within-chain and between-chain variability. The y-axis denotes the ordered indices of the model parameters after sorting them based on their

\hat{R}

values, providing a visual summary of convergence across all parameters. The solid vertical line at

\hat{R}

indicates the ideal case of perfect convergence, where within-chain and between-chain variances are equal. The vertical dashed line at

\hat{R}

represents a commonly used convergence threshold; values below this threshold suggest satisfactory convergence of the MCMC chains, whereas values above it may indicate lack of convergence or insufficient mixing.

Figure A6. Potential scale reduction factors (

\hat{R}

) for all parameters, demonstrating MCMC convergence. The x-axis displays the potential scale reduction factor (

\hat{R}

), a convergence diagnostic that compares within-chain and between-chain variability. The y-axis denotes the ordered indices of the model parameters after sorting them based on their

\hat{R}

values, providing a visual summary of convergence across all parameters. The solid vertical line at

\hat{R}

indicates the ideal case of perfect convergence, where within-chain and between-chain variances are equal. The vertical dashed line at

\hat{R}

represents a commonly used convergence threshold; values below this threshold suggest satisfactory convergence of the MCMC chains, whereas values above it may indicate lack of convergence or insufficient mixing.

Figure A7. Trace plots of selected regression coefficients (

β

) across MCMC chains.

Figure A7. Trace plots of selected regression coefficients (

β

) across MCMC chains.

Figure A8. Trace plots of global parameters (

α

,

τ

,

ϵ

) showing mixing behavior across chains.

Figure A8. Trace plots of global parameters (

α

,

τ

,

ϵ

) showing mixing behavior across chains.

Figure A9. Trace plots of feature weights (w) across MCMC chains.

References

Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Appl. 2014, 24, 175–186. [Google Scholar] [CrossRef]
Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Lange, K.; Little, R.J.A.; Taylor, J.M.G. Robust statistical modeling using the t distribution. J. Am. Stat. Assoc. 1989, 84, 881–896. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Natarajan, N.; Dhillon, I.S.; Ravikumar, P.K.; Tewari, A. Learning with noisy labels. Adv. Neural Inf. Process. Syst. 2013, 26, 1196–1204. [Google Scholar]
Carvalho, C.M.; Polson, N.G.; Scott, J.G. The horseshoe estimator for sparse signals. Biometrika 2010, 97, 465–480. [Google Scholar] [CrossRef]
Mitchell, T.J.; Beauchamp, J.J. Bayesian variable selection in linear regression. J. Am. Stat. Assoc. 1988, 83, 1023–1032. [Google Scholar] [CrossRef]
Raykar, V.C.; Yu, S.; Zhao, L.H.; Valadez, G.H.; Florin, C.; Bogoni, L.; Moy, L. Learning from crowds. J. Mach. Learn. Res. 2010, 11, 1297–1322. [Google Scholar]
Polson, N.G.; Scott, J.G. On the half-Cauchy prior for a global scale parameter. Bayesian Anal. 2012, 7, 887–902. [Google Scholar] [CrossRef]
Hjort, N.L.; Holmes, C.; Müller, P.; Walker, S.G. (Eds.) Bayesian Nonparametrics; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar] [CrossRef]
Tran, L.; Yin, X.; Liu, X. Disentangled representation learning GAN for pose-invariant face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 1283–1292. [Google Scholar]
Sun, Z.; Wu, J.; Li, X.; Yang, W.; Xue, J.H. Amortized Bayesian prototype meta-learning: A new probabilistic meta-learning approach to few-shot image classification. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Virtual, 13–15 April 2021. [Google Scholar]
Kabán, A. On Bayesian classification with Laplace priors. Pattern Recognit. Lett. 2007, 28, 1271–1282. [Google Scholar] [CrossRef]
Foroughi pour, A.; Dalton, L.A. Optimal Bayesian feature filtering. J. Mach. Learn. Res. 2015, 16, 2869–2923. [Google Scholar]
Zhao, J.; Zhang, X.; Yan, S. Learning to optimize domain specific normalization for domain generalization. In Proceedings of the European Conference on Computer Vision, Montreal, QC, Canada, 10 November 2021; pp. 68–85. [Google Scholar]
Huang, J.; Qu, L.; Jia, R.; Zhao, B. O2U-Net: A simple noisy label detection approach for deep neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, New Orleans, LA, USA, 18–24 June 2022; pp. 3326–3334. [Google Scholar]
Fuchs, T.; Kalinke, F. Robust partial-label learning by leveraging class activation values. Mach. Learn. 2025, 114, 193. [Google Scholar] [CrossRef]
Ni no-Adan, I.; Manjarres, D.; Landa-Torres, I.; Portillo, E. Feature weighting methods: A review. Expert Syst. Appl. 2021, 184, 115424. [Google Scholar] [CrossRef]
Cappozzo, A.; Greselin, F.; Murphy, T.B. Anomaly and novelty detection for robust semi-supervised learning. Stat. Comput. 2020, 30, 1545–1571. [Google Scholar] [CrossRef]
Zhuo, J.; Wang, S.; Zhang, W.; Huang, Q. Deep unsupervised convolutional anomaly detection. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 1058–1066. [Google Scholar]
Dalton, L.A. Optimal Bayesian feature selection. IEEE Trans. Inf. Theory 2013, 59, 7336–7347. [Google Scholar]
Neal, R.M. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo; Chapman & Hall/CRC: Boca Raton, FL, USA, 2011; pp. 113–162. [Google Scholar]
Hoffman, M.D.; Gelman, A. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 2014, 15, 1593–1623. [Google Scholar]
Carpenter, B.; Gelman, A.; Hoffman, M.D.; Lee, D.; Goodrich, B.; Betancourt, M.; Riddell, A. Stan: A probabilistic programming language. J. Stat. Softw. 2017, 76, 1–32. [Google Scholar] [CrossRef]
Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. Equation of state calculations by fast computing machines. J. Chem. Phys. 1953, 21, 1087–1092. [Google Scholar] [CrossRef]
Hastings, W.K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1970, 57, 97–109. [Google Scholar] [CrossRef]
Betancourt, M. A conceptual introduction to Hamiltonian Monte Carlo. arXiv 2017, arXiv:1701.02434. [Google Scholar]
Brier, G.W. Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 1950, 78, 1–3. [Google Scholar] [CrossRef]
Murphy, A.H. A new vector partition of the probability score. J. Appl. Meteorol. 1973, 12, 595–600. [Google Scholar] [CrossRef]
Niculescu-Mizil, A.; Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning (ICML), Bonn, Germany, 7–11 August 2005; pp. 625–632. [Google Scholar]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
Kumar, A.; Liang, P.S.; Ma, T. Verified uncertainty calibration. Adv. Neural Inf. Process. Syst. 2019, 32, 3792–3803. [Google Scholar]
Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis, 3rd ed.; Chapman & Hall/CRC: Boca Raton, FL, USA, 2013. [Google Scholar]
Vehtari, A.; Gelman, A.; Simpson, D.; Carpenter, B.; Bürkner, P.C. Rank-normalization, folding, and localization: An improved R^ for assessing convergence of MCMC (with discussion). Bayesian Anal. 2021, 16, 667–718. [Google Scholar] [CrossRef]
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
Saito, T.; Rehmsmeier, M. The precision–recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
Naeini, M.P.; Cooper, G.F.; Hauskrecht, M. Obtaining well calibrated probabilities using Bayesian binning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI), Austin, TX, USA, 25–30 January 2015; pp. 2901–2907. [Google Scholar]
Van Calster, B.; McLernon, D.J.; van Smeden, M.; Wynants, L.; Steyerberg, E.W. Calibration: The Achilles heel of predictive analytics. BMC Med. 2019, 17, 230. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]

Figure 1. Discrimination trend plots of model performance across eight experimental scenarios (S1–S8).

Figure 2. Calibration trend plots of model performance across eight experimental scenarios (S1–S8).

Figure 3. Comparative performance across models and data scenarios: Heat maps show mean performance metrics by model (rows) and scenario (columns) for area under the ROC curve (AUC), area under the precision–recall curve (PRAUC), F1-score, and accuracy. Cell color indicates standardized scores; cell numbers represent mean metric values. For more information, see Table A1.

Figure 4. Calibration performance across scenarios: Heat maps show mean performance metrics by model (rows) and scenario (columns) for logloss, Brier score, expected calibration error (ECE), and maximum calibration gap. Cell color indicates standardized scores; cell numbers represent mean metric values. For more information, see Table A2.

Figure 5. Feature importance estimates for the Breast Cancer Wisconsin dataset obtained from the Bayesian Feature Weighting model. Posterior mean weights are shown as dots, and 90% credible intervals are shown as horizontal lines based on the 5th and 95th posterior quantiles.

Figure 6. Posterior mean weights and 90% credible intervals for feature importance: Feature-wise posterior means and 90% credible intervals estimated by the Bayesian Feature Weighting model for the Pima Indians Diabetes dataset.

Figure 7. Feature importance estimates for the SAHeart dataset obtained from the Bayesian Feature Weighting model. Posterior mean weights are shown as dots, and 90% credible intervals are shown as horizontal lines based on the 5th and 95th posterior quantiles.

Figure 8. Feature importance estimates for the Cleveland dataset obtained from the Bayesian Feature Weighting model. Posterior mean weights are shown as dots, and 90% credible intervals are shown as horizontal lines based on the 5th and 95th posterior quantiles.

Figure 9. Calibration performance comparison (LogLoss, Brier, ECE, and MCE) of the Bayes_FW model against benchmark classifiers across four datasets, (a) Breast Cancer Wisconsin, (b) Pima Indians Diabetes, (c) South African Heart Disease, and (d) Cleveland Heart Disease datasets.

Figure 10. Discrimination performance comparison (AUC, PRAUC, F1, and Accuracy) of the Bayes_FW model and benchmark classifiers across four datasets, (a) Breast Cancer Wisconsin, (b) Pima Indians Diabetes, (c) South African Heart Disease, and (d) Cleveland Heart Disease datasets.

Table 1. Key notation for the Bayesian feature weighting model. Summary of symbols, domains, and interpretations used in the proposed model.

Symbol	Support	Interpretation
$x_{i}$	$R^{P}$	Covariate vector for observation i
$y_{i}$	${0, 1}$	Binary label for observation i
$η_{i}$	$R$	Linear predictor $α_{0} + {(w ⊙ β)}^{⊤} x_{i}$
$s_{i}$	$(0, 1)$	Baseline logistic probability $σ (η_{i})$
$p_{i}$	$(0, 1)$	Noise-adjusted success probability in Equation (2)
$α_{0}$	$R$	Global intercept
$β$	$R^{P}$	Regression coefficients
$w$	$Δ^{P - 1}$	Simplex-constrained feature weights; global relevance of predictors
$τ$	$(0, \infty)$	Global horseshoe scale parameter
$λ_{j}$	$(0, \infty)$	Local horseshoe scale for coefficient $β_{j}$
$ε$	$[0, 1]$	Label-noise rate (probability of label flip)

Table 2. Simulation scenarios: Summary of the eight simulated data scenarios (S1–S8) representing diverse sources of classification difficulty.

Scenario	Description	Key Characteristics
S1_basic	Baseline clean data	Moderate $n, m$ ; no noise/outliers
S2_corr	Correlated features	Strong within-block correlation ( $ρ = 0.8$ )
S3_highdim	High-dimensional	$m ≫ n$ with sparsity
S4_lblnoise	Label noise	Label corruption ( $ε = 0.1$ )
S5_outlierX	Covariate outliers	Feature contamination ( $γ = 0.1$ )
S6_both	Noise and outliers	Joint label and covariate corruption
S7_imbalance	Class imbalance	Rare positive class ( $π = 0.15$ )
S8_hardmix	Hard mixed effects	Nonlinearity, heavy tails, and imbalance

Table 3. Posterior diagnostics by scenario for the Bayes_FW simulations: Columns show the global shrinkage parameter

α_{0}

and label-noise rate

ε

(means/SDs), convergence summaries for the Gelman–Rubin statistic

\hat{R}

(mean, max, and percent of parameters with

\hat{R} > 1.01

), and effective sample size

n_{eff}

(min/median).

Table 3. Posterior diagnostics by scenario for the Bayes_FW simulations: Columns show the global shrinkage parameter

α_{0}

and label-noise rate

ε

(means/SDs), convergence summaries for the Gelman–Rubin statistic

\hat{R}

(mean, max, and percent of parameters with

\hat{R} > 1.01

), and effective sample size

n_{eff}

(min/median).

	$α_{0}$		$ε$		$\hat{R}$			$n_{eff}$
Scenario	Mean	SD	Mean	SD	Mean	Max	>1.01	Min	Med.
S1	0.12	0.21	0.03	0.02	1.00	1.01	2%	1450	3900
S2	0.08	0.25	0.02	0.02	1.00	1.01	3%	1300	3600
S3	−0.04	0.29	0.01	0.01	1.01	1.02	6%	780	2450
S4	0.15	0.24	0.09	0.03	1.00	1.01	3%	1250	3300
S5	0.10	0.26	0.02	0.02	1.00	1.01	3%	1180	3050
S6	0.22	0.27	0.10	0.03	1.01	1.03	9%	620	1980
S7	−0.18	0.23	0.05	0.02	1.00	1.02	4%	1040	2900
S8	0.05	0.31	0.12	0.04	1.01	1.03	8%	540	1750

Table 4. Predictor variables in the Breast Cancer Wisconsin dataset: List of diagnostic cytological features used for tumor classification.

Feature	Description (Biological Meaning)
Cl.thickness	Clump thickness (uniformity of cell thickness)
Cell.size	Uniformity of cell size
Cell.shape	Uniformity of cell shape
Marg.adhesion	Marginal adhesion of cells
Epith.c.size	Single epithelial cell size
Bare.nuclei	Presence of bare nuclei
Bl.cromatin	Bland chromatin (nuclear texture)
Normal.nucleoli	Normal nucleoli count
Mitoses	Number of mitoses (cell divisions)
Class (target)	Tumor diagnosis: malignant (1)/benign (0)

Table 5. Discrimination and classification performance on the Breast Cancer Wisconsin dataset: Mean and standard deviation (SD) of AUC, PRAUC, F1-score, and accuracy across repeated runs for each model.

Model	AUC		PRAUC		F1		ACC
	Mean	SD	Mean	SD	Mean	SD	Mean	SD
Logistic_EN	0.9960	0.0032	0.9916	0.0077	0.9452	0.0098	0.9619	0.0063
Logistic_L1	0.9957	0.0034	0.9909	0.0085	0.9470	0.0143	0.9634	0.0089
Logistic	0.9954	0.0038	0.9901	0.0100	0.9470	0.0143	0.9634	0.0089
RandomForest	0.9949	0.0030	0.9897	0.0065	0.9455	0.0090	0.9619	0.0063
Bayes_FW	0.9975	0.0026	0.9931	0.0043	0.9587	0.0103	0.9699	0.0051
GradBoost	0.9935	0.0035	0.9868	0.0088	0.9541	0.0160	0.9678	0.0112

Table 6. Calibration and loss performance on the Breast Cancer Wisconsin dataset: Mean and standard deviation (SD) of log-loss, Brier score, expected calibration error (ECE), and maximum calibration error (MCE) across repeated runs for each model.

Model	LogLoss		Brier		ECE		MCE
	Mean	SD	Mean	SD	Mean	SD	Mean	SD
Logistic_EN	0.0948	0.0191	0.0255	0.0032	0.0368	0.0075	0.7282	0.0914
Logistic_L1	0.0978	0.0247	0.0260	0.0036	0.0349	0.0085	0.6531	0.1419
Logistic	0.1030	0.0351	0.0268	0.0043	0.0344	0.0067	0.7672	0.0844
RandomForest	0.0922	0.0170	0.0265	0.0060	0.0365	0.0052	0.4977	0.1754
Bayes_FW	0.0912	0.0126	0.0243	0.0028	0.0321	0.0046	0.4177	0.1038
GradBoost	0.1178	0.0346	0.0292	0.0083	0.0348	0.0084	0.6767	0.2187

Table 7. Posterior mean feature weights and 90% credible intervals estimated by the Bayes_FW model: Results are based on the Breast Cancer Wisconsin dataset. The 5% and 95% quantiles define the lower and upper credible interval bounds.

	w
Feature	Mean	5%	95%
Bare.nuclei	0.1467	0.0297	0.3338
Cl.thickness	0.1339	0.0222	0.3233
Mitoses	0.1253	0.0073	0.3341
Normal.nucleoli	0.1076	0.0115	0.2676
Cell.size	0.1033	0.0086	0.2864
Bl.cromatin	0.1018	0.0114	0.2728
Marg.adhesion	0.0992	0.0092	0.2722
Cell.shape	0.0973	0.0073	0.2691
Epith.c.size	0.0849	0.0051	0.2323

Table 8. Predictor variables in the Pima Indians Diabetes dataset: List of diagnostic and physiological variables used for diabetes classification.

Feature	Description (Biological Meaning)
Pregnant	Number of times pregnant
Glucose	Plasma glucose concentration (2-h oral glucose tolerance test)
Pressure	Diastolic blood pressure (mm Hg)
Triceps	Triceps skin fold thickness (mm)
Insulin	2-h serum insulin ( $μ$ U/mL)
Mass (BMI)	Body mass index (weight in kg/height in m²)
Pedigree	Diabetes pedigree function (genetic risk measure)
Age	Age in years
Outcome	Diabetes diagnosis: positive (1)/negative (0)

Table 9. Discrimination and classification performance on the Pima Indians Diabetes dataset: Mean and standard deviation (SD) of AUC, PRAUC, F1-score, and accuracy across repeated runs for each model.

Model	AUC		PRAUC		F1		ACC
	Mean	SD	Mean	SD	Mean	SD	Mean	SD
Logistic_EN	0.8323	0.0415	0.7196	0.0537	0.6378	0.0686	0.7787	0.0315
Logistic_L1	0.8338	0.0411	0.7220	0.0541	0.6266	0.0676	0.7722	0.0310
Logistic	0.8324	0.0405	0.7189	0.0532	0.6453	0.0664	0.7813	0.0324
RandomForest	0.8324	0.0408	0.7133	0.0478	0.6462	0.0789	0.7696	0.0516
Bayes_FW	0.8426	0.0325	0.7851	0.0425	0.6881	0.0577	0.7919	0.0311
GradBoost	0.8039	0.0471	0.6644	0.0685	0.6040	0.0922	0.7344	0.0567

Table 10. Calibration and loss performance on the Pima Indians Diabetes dataset: Mean and standard deviation (SD) of log-loss, Brier score, expected calibration error (ECE), and maximum calibration error (MCE) across repeated runs for each model.

Model	LogLoss		Brier		ECE		MCE
	Mean	SD	Mean	SD	Mean	SD	Mean	SD
Logistic_EN	0.4804	0.0492	0.1562	0.0179	0.0805	0.0214	0.2880	0.1222
Logistic_L1	0.4805	0.0499	0.1562	0.0181	0.0742	0.0183	0.3058	0.1496
Logistic	0.4808	0.0523	0.1562	0.0185	0.0844	0.0207	0.3012	0.1357
RandomForest	0.4782	0.0499	0.1586	0.0203	0.0821	0.0197	0.3767	0.3019
Bayes_FW	0.3404	0.0411	0.1468	0.0169	0.0700	0.0168	0.2783	0.1101
GradBoost	0.5980	0.1223	0.1861	0.0367	0.1322	0.0556	0.3670	0.1782

Table 11. Posterior mean feature weights and 90% credible intervals estimated by the Bayes_FW model: Results are based on the Pima Indians Diabetes dataset. The 5% and 95% quantiles define the lower and upper credible interval bounds.

	w
Feature	Mean	5%	95%
glucose	0.1878	0.0392	0.4044
mass	0.1654	0.0322	0.3658
pregnant	0.1471	0.0255	0.3459
pedigree	0.1409	0.0236	0.3348
pressure	0.1041	0.0091	0.2819
age	0.0881	0.0046	0.2542
insulin	0.0859	0.0047	0.2539
triceps	0.0806	0.0036	0.2491

Table 12. Predictor variables in the South African Heart Disease (SAHeart) dataset: List of demographic, behavioral, and clinical variables used for CHD classification.

Feature	Description (Biological/Clinical Meaning)
age	Age of the patient (years)
famhist	Family history of heart disease (Present/Absent)
tobacco	Cumulative tobacco consumption (kg)
ldl	Low-density lipoprotein cholesterol
typea	Type-A behavior score (psychosocial risk factor)
sbp	Systolic blood pressure (mm Hg)
obesity	Obesity index
adiposity	Adiposity (body fat measure)
alcohol	Alcohol consumption (liters per day)
chd	Coronary heart disease status (1 = present, 0 = absent)

Table 13. Discrimination and classification performance on the SAHeart dataset: Mean and standard deviation (SD) of AUC, PRAUC, F1-score, and accuracy across repeated runs for each model.

Model	AUC		PRAUC		F1		ACC
	Mean	SD	Mean	SD	Mean	SD	Mean	SD
Logistic_EN	0.7853	0.0655	0.6578	0.0891	0.5470	0.0283	0.7229	0.0481
Logistic_L1	0.7844	0.0627	0.6564	0.0792	0.5350	0.0550	0.7186	0.0468
Logistic	0.7798	0.0674	0.6665	0.0984	0.5514	0.0389	0.7164	0.0553
Bayes_FW	0.7903	0.0598	0.6933	0.0799	0.5494	0.0215	0.7400	0.0371
RandomForest	0.7246	0.0742	0.5744	0.0912	0.4963	0.0240	0.6817	0.0540
GradBoost	0.7004	0.0753	0.5343	0.0869	0.4911	0.0658	0.6687	0.0616

Table 14. Calibration and loss performance on the SAHeart dataset: Mean and standard deviation (SD) of log-loss, Brier score, expected calibration error (ECE), and maximum calibration error (MCE) across repeated runs for each model.

Model	LogLoss		Brier		ECE		MCE
	Mean	SD	Mean	SD	Mean	SD	Mean	SD
Logistic_EN	0.5322	0.0509	0.1790	0.0202	0.1087	0.0326	0.2935	0.1241
Logistic_L1	0.5321	0.0479	0.1788	0.0188	0.0906	0.0490	0.2646	0.0844
Logistic	0.5347	0.0641	0.1798	0.0252	0.1057	0.0410	0.2720	0.0937
Bayes_FW	0.5245	0.0413	0.1645	0.0183	0.1428	0.0310	0.2612	0.0875
RandomForest	0.5738	0.0635	0.1974	0.0226	0.0976	0.0070	0.2656	0.1662
GradBoost	0.7423	0.1107	0.2379	0.0381	0.1951	0.0407	0.5504	0.2081

Table 15. Posterior feature weights and 90% credible intervals estimated by Bayes_FW: Results are based on the SAHeart dataset; 5% and 95% columns give the credible interval bounds.

	w
Feature	Mean	5%	95%
age	0.1472	0.0281	0.3293
famhist	0.1357	0.0254	0.3127
tobacco	0.1204	0.0166	0.2926
ldl	0.1181	0.0153	0.2888
typea	0.1057	0.0111	0.2707
sbp	0.0790	0.0048	0.2281
obesity	0.0776	0.0041	0.2285
adiposity	0.0732	0.0043	0.2176
alcohol	0.0694	0.0034	0.2114

Table 16. Predictor variables in the Heart Disease (Cleveland) dataset: List of demographic, clinical, and exercise-related variables used for heart disease classification.

Feature	Description (Units/Encoding)
age	Age in years (continuous)
sex	Sex (1 = male, 0 = female; binary)
cp	Chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal; 4 = asymptomatic). Encoded as `cp_1`, `cp_2`, `cp_3`.
trestbps	Resting blood pressure (mm Hg; continuous)
chol	Serum cholesterol (mg/dL; continuous)
fbs	Fasting blood sugar > 120 mg/dL (1/0; binary)
restecg	Resting ECG (0 = normal; 1 = ST–T abnormality; 2 = LV hypertrophy). Encoded as `restecg_1`, `restecg_2`.
thalach	Maximum heart rate achieved (continuous)
exang	Exercise-induced angina (1/0; binary; encoded as `exang_1`)
oldpeak	ST depression induced by exercise relative to rest (continuous)
slope	Slope of peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping). Encoded as `slope_1`, `slope_2`.
ca	Number of major vessels (0–3) colored by fluoroscopy. Encoded as `ca_1`, `ca_2`, `ca_3`.
thal	Thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect). Encoded as `thal_3`, `thal_6`, `thal_7`.
target	Heart disease status (1 = present, 0 = absent; binary)

Table 17. Discrimination and classification performance on the Cleveland dataset: Mean and standard deviation (SD) of AUC, PRAUC, F1-score, and accuracy across repeated runs for each model.

Model	AUC		PRAUC		F1		ACC
	Mean	SD	Mean	SD	Mean	SD	Mean	SD
Logistic_EN	0.9034	0.0353	0.8986	0.0427	0.8053	0.0482	0.8284	0.0398
Logistic	0.9004	0.0384	0.8859	0.0561	0.8116	0.0403	0.8350	0.0331
Logistic_L1	0.9001	0.0342	0.8986	0.0367	0.7993	0.0418	0.8217	0.0340
Bayes_FW	0.9079	0.0313	0.9076	0.0269	0.8264	0.0353	0.8440	0.0294
RandomForest	0.8854	0.0375	0.8775	0.0360	0.7599	0.0374	0.7921	0.0302
GradBoost	0.8641	0.0438	0.8613	0.0336	0.7648	0.0610	0.7890	0.0547

Table 18. Calibration and loss performance on the Cleveland dataset: Mean and standard deviation (SD) of log-loss, Brier score, expected calibration error (ECE), and maximum calibration error (MCE) across repeated runs for each model.

Model	LogLoss		Brier		ECE		MCE
	Mean	SD	Mean	SD	Mean	SD	Mean	SD
Logistic_EN	0.3892	0.0650	0.1227	0.0222	0.0912	0.0178	0.4049	0.2840
Logistic	0.4344	0.1296	0.1241	0.0242	0.1064	0.0167	0.5342	0.2285
Logistic_L1	0.3941	0.0681	0.1241	0.0213	0.0916	0.0185	0.5894	0.1796
Bayes_FW	0.3365	0.0527	0.1103	0.0185	0.0931	0.0181	0.3769	0.1647
RandomForest	0.4284	0.0543	0.1383	0.0199	0.0955	0.0335	0.3882	0.1667
GradBoost	0.5821	0.1541	0.1684	0.0398	0.1666	0.0329	0.6160	0.1846

Table 19. Posterior feature weights and 90% credible intervals estimated by Bayes_FW: Results are based on the Cleveland dataset; 5% and 95% columns give the credible interval bounds.

	w
Feature	Mean	5%	95%
cp_3	0.0734	0.0131	0.1701
ca_1	0.0725	0.0132	0.1733
thal_3	0.0703	0.0113	0.1668
ca_2	0.0681	0.0111	0.1623
slope_1	0.0613	0.0069	0.1546
oldpeak	0.0601	0.0072	0.1525
ca_3	0.0565	0.0064	0.1474
sex_1	0.0561	0.0055	0.1469
trestbps	0.0473	0.0035	0.1332
exang_1	0.0456	0.0033	0.1264
thalach	0.0445	0.0030	0.1283
cp_1	0.0435	0.0027	0.1261
cp_2	0.0413	0.0022	0.1233
restecg_2	0.0392	0.0023	0.1156
slope_2	0.0386	0.0020	0.1144
restecg_1	0.0378	0.0020	0.1142
chol	0.0369	0.0019	0.1134
age	0.0361	0.0019	0.1097
thal_2	0.0361	0.0017	0.1114
fbs_1	0.0349	0.0015	0.1086

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cengiz, M.A.; Öztürk, Z.; Alharthi, A. A Bayesian Feature Weighting Model with Simplex-Constrained Dirichlet and Contamination-Aware Priors for Noisy Medical Data. Mathematics 2026, 14, 1243. https://doi.org/10.3390/math14081243

AMA Style

Cengiz MA, Öztürk Z, Alharthi A. A Bayesian Feature Weighting Model with Simplex-Constrained Dirichlet and Contamination-Aware Priors for Noisy Medical Data. Mathematics. 2026; 14(8):1243. https://doi.org/10.3390/math14081243

Chicago/Turabian Style

Cengiz, Mehmet Ali, Zeynep Öztürk, and Abdulmohsen Alharthi. 2026. "A Bayesian Feature Weighting Model with Simplex-Constrained Dirichlet and Contamination-Aware Priors for Noisy Medical Data" Mathematics 14, no. 8: 1243. https://doi.org/10.3390/math14081243

APA Style

Cengiz, M. A., Öztürk, Z., & Alharthi, A. (2026). A Bayesian Feature Weighting Model with Simplex-Constrained Dirichlet and Contamination-Aware Priors for Noisy Medical Data. Mathematics, 14(8), 1243. https://doi.org/10.3390/math14081243

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Bayesian Feature Weighting Model with Simplex-Constrained Dirichlet and Contamination-Aware Priors for Noisy Medical Data

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Proposed Model

3.2. Priors

3.3. Posterior Inference

3.4. Posterior Outputs

3.4.1. Predictive Probabilities for Unseen Samples

3.4.2. Uncertainty-Aware Feature Importance

3.4.3. Robust Classification Under Noise

3.4.4. Posterior Predictive Checks and Calibration

4. Simulation and Experimental Results

4.1. Simulation

4.2. Real Medical Dataset Applications

4.2.1. Breast Cancer Wisconsin Dataset

4.2.2. Pima Indians Diabetes Dataset

4.2.3. South African Heart Disease Dataset

4.2.4. Heart Disease (Cleveland) Dataset

5. Results and Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Simulation Study: Additional Performance Tables and Figures

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI