Next Article in Journal
Boundedness of Rough Multiple Oscillatory Singular Integral Operators on Triebel–Lizorkin Space
Next Article in Special Issue
Improved Data-Driven Shrinkage Estimators for Regression Models Under Severe Multicollinearity
Previous Article in Journal
Dirichlet–Kernel Methods for Geometric Conditional Quantiles: Bahadur Expansions and Boundary Adaptivity on the d-Simplex
Previous Article in Special Issue
Adaptive Penalized Regression for High-Efficiency Estimation in Correlated Predictor Settings: A Data-Driven Shrinkage Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Bayesian Feature Weighting Model with Simplex-Constrained Dirichlet and Contamination-Aware Priors for Noisy Medical Data

by
Mehmet Ali Cengiz
1,*,
Zeynep Öztürk
2 and
Abdulmohsen Alharthi
1
1
Department of Mathematics and Statistics, College of Science, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 13318, Saudi Arabia
2
Hopa Faculty of Economics and Administrative Sciences, Artvin Çoruh University, Hopa 08010, Türkiye
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(8), 1243; https://doi.org/10.3390/math14081243
Submission received: 8 February 2026 / Revised: 1 April 2026 / Accepted: 3 April 2026 / Published: 8 April 2026
(This article belongs to the Special Issue Statistical Machine Learning: Models and Its Applications)

Abstract

Feature weighting plays a central role in medical classification by enhancing predictive accuracy, interpretability, and clinical trust through the explicit quantification of variable relevance. Despite their widespread use, existing filter-, wrapper-, and embedded-based feature weighting methods are predominantly deterministic and exhibit pronounced sensitivity to label noise and outliers, which are pervasive in real-world medical data. This often results in unstable importance estimates and unreliable clinical interpretations. In this work, we introduce a novel Bayesian feature weighting model that fundamentally departs from existing approaches by jointly integrating simplex-constrained Dirichlet priors for global feature weights, hierarchical shrinkage priors for coefficient regularization, and contamination-aware priors for explicit modeling of label noise within a single coherent probabilistic framework. Unlike conventional Bayesian feature selection or robust classification models, the proposed formulation yields globally interpretable feature weights defined on the probability simplex, while simultaneously providing full posterior uncertainty quantification and robustness to both mislabeled observations and aberrant feature values through principled influence control. Comprehensive simulation studies across diverse contamination scenarios, together with applications to multiple real-world medical datasets, demonstrate that the proposed model consistently outperforms classical and state-of-the-art baselines in terms of discrimination, probabilistic calibration, and stability of feature-importance estimates. These results highlight the practical and methodological significance of the proposed framework as a robust, uncertainty-aware, and interpretable solution for medical decision making under noisy data conditions.

1. Introduction

Feature weighting plays a central role in supervised learning because it directly influences both the interpretability and predictive performance of statistical and machine learning models. By assigning relative importance scores to input variables, feature weighting methods enhance the transparency of model decisions while mitigating the adverse effects of irrelevant or redundant predictors. Such mechanisms are particularly critical in high-dimensional domains, including genomics, medical diagnostics, text mining, and sensor-based systems, where large numbers of noisy variables often obscure the underlying predictive signal [1,2].
Despite their practical importance, most existing feature weighting approaches remain fundamentally deterministic and fragile in the presence of data contamination. In real-world medical datasets, outliers and mislabeled observations frequently arise due to measurement errors, imperfect annotations, or patient-specific variability. Classical feature weighting techniques—often based on point estimates from logistic regression or margin-based classifiers—can be disproportionately influenced by such anomalies. As a result, the estimated feature weights may fail to reflect the true predictive relevance of variables, leading to unstable models, degraded accuracy, and unreliable clinical interpretations [3,4].
To mitigate these limitations, recent research has explored robust extensions of supervised learning models, including regularized logistic regression with heavy-tailed priors, noise-tolerant boosting algorithms, and robust margin-based feature selection methods. Although these approaches improve resilience to data irregularities, they typically do not provide a full probabilistic characterization of feature importance. In particular, deterministic weighting schemes lack uncertainty quantification, which severely limits their interpretability and usefulness in high-stakes domains such as healthcare and financial risk modeling [5,6].
Bayesian methods offer a principled framework for addressing these shortcomings by integrating feature weighting, uncertainty quantification, and robustness within a unified hierarchical formulation. By treating feature weights as probability distributions rather than fixed values, Bayesian models capture both the expected relevance of predictors and the uncertainty surrounding these estimates. Moreover, hierarchical shrinkage priors, such as the horseshoe prior, enable automatic regularization and protection against overfitting in high-dimensional settings. Robustness to contamination can be further enhanced through explicit modeling of label noise and heavy-tailed latent structures, providing principled defenses against outliers and misclassified samples [7,8,9,10].
However, existing Bayesian approaches typically address these components in isolation. In particular, current models often focus either on coefficient shrinkage or on robust likelihood formulations, while global feature weighting with explicit probability-simplex constraints and contamination-aware priors has received limited attention. As a result, there remains a methodological gap between uncertainty-aware Bayesian modeling and interpretable, globally normalized feature-importance estimation under noisy conditions.
In this study, we propose a novel Bayesian robust feature weighting framework (Bayes_FW) that explicitly addresses this gap. The proposed model introduces global feature weights constrained to the probability simplex via a Dirichlet prior, ensuring normalized and directly interpretable importance scores. Simultaneously, regression coefficients are regularized using hierarchical horseshoe priors, while robustness is achieved through an explicit label-noise contamination mechanism and heavy-tailed error components. This unified formulation enables principled influence control over mislabeled observations and aberrant feature values, while providing full Bayesian uncertainty quantification through posterior inference via Markov Chain Monte Carlo (MCMC).
The experimental evaluation employs four benchmark medical datasets representing increasing levels of discrimination difficulty, calibration instability, and data heterogeneity. The Breast Cancer Wisconsin dataset, characterized by well-separated feature distributions and low noise, represents a setting with easy discrimination and high calibration stability. The Pima Indians Diabetes dataset poses a moderate challenge due to overlapping class boundaries and moderate imbalance, introducing calibration complexity. The South African Heart Disease dataset further increases difficulty by combining categorical and continuous predictors with mixed uncertainty sources. Finally, the heart disease (Cleveland) dataset represents a highly challenging scenario due to its small sample size, heterogeneous structure, and known sensitivity to label noise and calibration instability.
To ensure the reliability of the proposed Bayesian framework, comprehensive posterior diagnostic analyses were conducted for both simulation studies and real-data applications. In all experimental settings, standard Markov Chain Monte Carlo (MCMC) diagnostics including trace plots, posterior density estimates, effective sample size ( N eff ), potential scale reduction factors ( R ^ ), and posterior predictive checks were systematically evaluated. These diagnostics consistently indicated stable convergence, efficient sampling, and well-behaved posterior distributions, with no evidence of pathological behavior such as non-convergence, poor mixing, or degeneracy. For illustrative purposes, the full set of diagnostic plots corresponding to the Pima Indians Diabetes dataset is provided in Appendix A. These results are representative of the overall behavior observed across all experiments and confirm the robustness and computational reliability of the proposed modeling approach.
Across both synthetic simulations and real-world medical datasets, the proposed Bayesian framework demonstrates superior discrimination (AUC, F1) and calibration (LogLoss, Brier score, ECE) compared to classical logistic and ensemble-based models. Importantly, the model yields more stable posterior estimates of feature relevance, enabling accurate identification of key biomarkers while explicitly quantifying uncertainty in their effects. These findings collectively establish the proposed approach as a robust, uncertainty-aware, and interpretable feature weighting solution for reliable medical decision making under noisy data conditions.
The main contributions of this paper can be summarized as follows:
  • We propose a novel Bayesian feature weighting model that incorporates a simplex-constrained Dirichlet prior, ensuring normalized and interpretable feature importance.
  • We introduce contamination-aware priors to enhance robustness against noisy and potentially corrupted medical data.
  • We develop an efficient inference framework based on Hamiltonian Monte Carlo (HMC), enabling stable posterior estimation under simplex constraints.
  • We validate the proposed model through extensive simulation studies and real-data analysis demonstrating its effectiveness, robustness, and practical applicability.

2. Related Work

Feature selection and feature weighting are fundamental preprocessing techniques in supervised learning, both of which aim to improve the model performance, interpretability, and generalization by emphasizing relevant variables. Traditionally, feature selection has been tackled through filter-based methods such as mutual information ranking [1], wrapper methods such as recursive feature elimination [2], and embedded approaches using penalization techniques (e.g., L1-regularized logistic regression [3]). Although effective for structured and clean datasets, these deterministic methods tend to be fragile in the presence of noise and outliers, often resulting in unstable or misleading importance scores.
Robust supervised learning approaches have emerged as a response to this challenge. For instance, heavy-tailed distributions such as Student-t have been incorporated into generalized linear models to mitigate the influence of extreme observations [4]. Additionally, boosting algorithms (e.g., AdaBoost) adaptively reweight samples to reduce the impact of mislabeled data [5]. More recent contributions include robust loss functions and methods explicitly designed to handle label noise [6]. However, while these strategies increase resilience, they often lack mechanisms for quantifying the uncertainty of feature weights, which is a crucial requirement in high-stake or noisy environments.
The Bayesian learning paradigm has become increasingly popular to address these limitations. It offers model regularization and probabilistic uncertainty quantification. Traditional Bayesian feature selection methods, such as horseshoe prior [7] and spike-and-slab models [8], adaptively shrink coefficients and provide interpretable inclusion probabilities. Nonparametric extensions [11] enable greater flexibility in high-dimensional regimes.
Recent advances have extended these foundations to more explicitly address robustness. Bayesian deep metric learning approaches have demonstrated theoretical robustness under label noise using variational inference frameworks [12]. Hierarchical probabilistic models such as WarPI have shown measurable improvements, achieving 3.73% accuracy gains over baseline methods on CIFAR-100 under 40% asymmetric noise conditions [13]. These approaches utilize hierarchical probabilistic modeling to quantify both epistemic and aleatoric uncertainty while maintaining robustness to various noise types including uniform, asymmetric, and instance-dependent noise [13].
Variational Bayesian approaches have also proven effective for feature weighting tasks. Methods employing Laplace priors with variational inference provide better uncertainty estimates while retaining correlated features and stability with respect to hyperparameter choices [14]. Optimal Bayesian feature filtering techniques have demonstrated outstanding performance relative to traditional feature selection methods using hierarchical models that provide closed-form solutions for high-dimensional data [15].
Contemporary research has explored meta-learning approaches for robust feature weighting. Probabilistic meta-weighting methods, such as PMW-Net, address the limitations of deterministic weighting functions by incorporating probabilistic treatments that handle both epistemic and aleatoric uncertainties [16]. Uncertainty-aware label correction frameworks combine Bayesian neural networks with Gaussian modeling to identify trustworthy samples and correct mislabeled data, showing superior performance compared with methods such as Co-teaching+ and DivideMix [17].
Hybrid approaches have also emerged that combine Bayesian and frequentist elements. Subjective logic-based methods utilize Dirichlet distributions and neural network parameterization to handle partial-label learning scenarios with high noise levels, out-of-distribution examples, and adversarial perturbations [18]. These methods provide an explicit uncertainty representation while maintaining robustness across diverse contamination scenarios.
On a different axis of development, ref. [19] provides a comprehensive taxonomy of feature weighting (FW) methods, classifying them according to the learning paradigm (supervised vs. unsupervised), scope (global vs. local), and optimization strategy (filter vs. wrapper). In supervised settings, global filter-based FW methods compute feature importance independently using metrics such as Mutual Information, Information Gain, or Fisher Score, whereas wrapper methods leverage iterative optimization (e.g., via Genetic Algorithms, Gradient Ascent, or Particle Swarm Optimization) to improve model performance.
Despite these advances, several limitations persist. Traditional robust methods, including trimmed Bayesian information criterion approaches and maximum likelihood estimation with contamination modeling [20], often address outlier detection and label noise separately from feature weighting. Adaptive noise modeling techniques, which are effective for dimension-specific or group-specific noise handling, typically lack comprehensive uncertainty quantification mechanisms [21].
In light of these developments, recent advances in Bayesian robustness have introduced label-noise priors [9] and heavy-tailed priors [10] to enhance classification stability in contaminated settings. However, the existing approaches typically address either robustness or uncertainty quantification in isolation. The reviewed literature demonstrates that, although individual components such as hierarchical shrinkage [22], contamination-aware modeling [20] and probabilistic weighting [16] have shown promise, a unified framework that simultaneously provides robust classification and uncertainty-aware global feature weights has not been sufficiently explored.
The present work addresses this void by proposing a Bayesian model that combines simplex-constrained global feature weighting, hierarchical shrinkage priors, and contamination modeling. This framework provides both interpretable feature weights and principled uncertainty quantification, improving model reliability in the presence of outliers and noisy labels, building upon the theoretical foundations and empirical insights demonstrated across the spectrum of robust Bayesian learning approaches.

3. Methodology

3.1. Proposed Model

Assume that D = { ( x i , y i ) } i = 1 N is the observed data, where x i R P is a P-dimensional covariate vector and y i { 0 , 1 } is a binary outcome. Our goal is to construct a classifier that (i) learns a global importance weight for each feature, (ii) is robust to label contamination and outlying covariates, and (iii) provides coherent uncertainty quantification.
Conditionally on the parameters θ = { α 0 , w , β , τ , λ , ε } , we introduce the linear predictor
η i = α 0 + ( w β ) x i ,
and define the baseline logistic probability s i = σ ( η i ) with σ ( z ) = 1 / ( 1 + e z ) . To explicitly model label noise, we use a mixture of the clean and flipped labels,
p y i = 1 x i , θ = ( 1 ε ) s i + ε ( 1 s i ) .
Equivalently, with probability 1 ε the observed label coincides with the latent logistic response, whereas with probability ε it is flipped. The parameters have the following roles:
  • α 0 R is a global intercept;
  • β = ( β 1 , , β P ) R P are regression coefficients;
  • w = ( w 1 , , w P ) are non-negative feature weights constrained to the simplex
    w Δ P 1 : = w R + P 1 w = 1 ;
  • ε [ 0 , 1 ] is the label-noise parameter controlling the probability of a label flip;
  • τ > 0 and λ = ( λ 1 , , λ P ) are global and local scale parameters that induce shrinkage on β .
The feature weight vector w is constrained to lie on the probability simplex, satisfying w j 0 and j = 1 p w j = 1 . Such constraints introduce a non-Euclidean geometry in the parameter space that may affect sampling efficiency in Hamiltonian Monte Carlo (HMC). In practice, this issue is addressed through an appropriate simplex parameterization, which maps the constrained weights into an unconstrained space while preserving the positivity and sum-to-one constraints. This transformation allows HMC to operate efficiently without violating the geometric structure of the simplex and ensures stable posterior exploration.
The element-wise product w β therefore combines feature relevance ( w ) and effect size ( β ), facilitating interpretable global importance scores while remaining robust to mislabeled observations through ε .
The linear predictor in Equation (1) depends on the element-wise product w β . In principle, such multiplicative parameterizations may introduce scale non-identifiability, since different parameter pairs ( w , β ) can produce the same product w β . For example, multiplying w by a constant c > 0 and dividing β by the same constant yields an equivalent linear predictor, since
( c w ) ( β / c ) = w β .
In the proposed model, however, this ambiguity is mitigated through the prior structure. The feature weight vector w is constrained to lie on the probability simplex
Δ ( P 1 ) = w R + P : j = 1 P w j = 1 ,
which fixes its global scale by enforcing j = 1 P w j = 1 and w j 0 . This constraint prevents arbitrary rescaling of w. In addition, the regression coefficients β are regularized using a hierarchical horseshoe prior, which strongly shrinks irrelevant coefficients toward zero while allowing large signals to remain. Together, the simplex constraint on w and the shrinkage structure on β restrict the parameter space and provide practical identifiability of the model parameters in posterior inference. To further examine potential multiplicative non-identifiability, we inspected the joint posterior samples of w j and β j . Ridge structures would indicate scale non-identifiability in the multiplicative parameterization. As a representative example, this diagnostic analysis was conducted using the Breast Cancer Wisconsin dataset. However, the posterior samples do not exhibit such ridge patterns, indicating that the simplex constraint on w together with the shrinkage prior on β prevents this degeneracy. The corresponding diagnostic plots are provided in Figure A1.
A summary of the main symbols used in the proposed model is given in Table 1.
To enhance interpretability, we provide intuitive explanations for the key model components. The feature weights w represent the relative importance of each predictor in explaining the response variable, constrained to lie on the simplex to ensure that they are non-negative and sum to one. The parameter α captures the global location or baseline effect, while τ controls the scale or dispersion of the model. The contamination-aware prior is designed to reduce the influence of noisy or corrupted observations by allowing heavier tails in the prior structure. Overall, this formulation enables both interpretability and robustness within a unified Bayesian framework.
For simplicity, we assume a symmetric label contamination mechanism, where the probability of label flipping is identical across classes. Although misclassification in medical datasets may often be class-dependent, the symmetric formulation provides a parsimonious representation of label noise and avoids introducing additional parameters that may be difficult to estimate when the amount of noisy labels is limited. The proposed framework can be readily extended to asymmetric contamination by introducing separate flipping probabilities for each class, allowing different misclassification rates for positive and negative labels.

3.2. Priors

To encourage robustness and interpretability, we assign the following hierarchical priors.
  • Feature weights.
Simplex-constrained feature weights receive a symmetric Dirichlet prior,
w Dirichlet ( α 1 ) ,
which yields normalized, uncertainty-aware global importance scores.
  • Regression coefficients.
To down-weight irrelevant predictors while allowing a few large effects, we employ the horseshoe prior. Introducing auxiliary variables z j , the hierarchy is
β j = z j τ λ j , z j N ( 0 , 1 ) , τ C + ( 0 , 1 ) , λ j C + ( 0 , 1 ) ,
where C + ( 0 , 1 ) denotes the standard half-Cauchy distribution. The heavy tails of this prior allow large signals to escape shrinkage while strongly shrinking noise coefficients towards zero.
  • Intercept and label-noise parameter.
We place a weakly informative Gaussian prior on the intercept and a uniform prior on the noise level,
α 0 N ( 0 , 5 2 ) , ε Uniform ( 0 , 1 ) .
The Uniform ( 0 , 1 ) prior is adopted as a weakly informative prior for the contamination probability, reflecting the absence of strong prior knowledge about the level of label noise. To assess the robustness of this choice, we conducted a short sensitivity analysis by considering alternative Beta priors, including Beta ( 1 , 1 ) , Beta ( 2 , 2 ) , and Beta ( 0.5 , 0.5 ) . The resulting posterior estimates and predictive performance were found to be very similar across these specifications, suggesting that the proposed model is not sensitive to the specific prior choice for the contamination parameter.
Taken together, these choices jointly model feature relevance, robust regression coefficients, and label noise, enabling principled uncertainty quantification while mitigating the influence of outliers and mislabeled samples.

3.3. Posterior Inference

Let p i = p ( y i = 1 x i , θ ) denote the noise-adjusted probability in Equation (2). The joint posterior distribution of all unknowns is then
p ( θ D ) i = 1 N Bernoulli y i p i p ( α 0 ) p ( w ) p ( β τ , λ ) × p ( τ ) p ( λ ) p ( ε ) .
Since Equation (7) is analytically intractable, we perform Bayesian inference using Markov Chain Monte Carlo (MCMC), specifically Hamiltonian Monte Carlo (HMC) [23] with the No-U-Turn Sampler (NUTS) adaptation [24], as implemented in Stan [25].
HMC leverages gradient information to efficiently explore high-dimensional posterior landscapes, avoiding the random-walk behavior of standard Metropolis–Hastings algorithms [26,27], while NUTS automatically selects trajectory lengths, removing the need for manual tuning and improving convergence [28].
The posterior inference procedure is summarized in Algorithm 1.
Algorithm 1 Posterior inference via HMC-NUTS
  1:
Initialize parameters θ ( α 0 , β , w , τ , λ , ε )
  2:
for  t = 1 , , T  do
  3:
       Compute the linear predictors η i and probabilities p i for i = 1 , , N using Equation (2)
  4:
      Compute the log-posterior:
  •     log p ( θ D ) = i = 1 N y i log p i + ( 1 y i ) log ( 1 p i ) + log p ( α 0 , w , β , τ , λ , ε )
  5:
     Compute gradients with respect to all parameters
  6:
     Simulate Hamiltonian dynamics using leapfrog integration
  7:
     Propose new state θ and accept/reject with probability
  •     α = min 1 , exp H ( θ ) H ( θ )
  8:
     NUTS adaptively selects trajectory length to avoid manual tuning
  9:
end for
10:
Return posterior samples after warm-up

3.4. Posterior Outputs

From the posterior draws { θ ( m ) } m = 1 M , we obtain several quantities of direct practical interest: (i) predictive probabilities for new samples, (ii) uncertainty-aware feature importance, (iii) an estimate of the label-noise rate, and (iv) posterior predictive checks and calibration diagnostics.

3.4.1. Predictive Probabilities for Unseen Samples

For a new observation x , the posterior predictive probability is approximated by the Monte Carlo average
p ^ y = 1 x , D = 1 M m = 1 M p y = 1 x , θ ( m ) ,
where each probability on the right-hand side is computed from the logistic link and label-noise adjustment in Equation (2). This yields both point predictions and a full posterior distribution over predictive probabilities, enabling uncertainty-aware decision making.

3.4.2. Uncertainty-Aware Feature Importance

The posterior distribution of the simplex weights w quantifies the global relevance of each predictor. For feature j,
Imp ( j ) = E [ w j D ] , Uncertainty ( j ) = Var [ w j D ] ,
so that larger posterior means indicate more influential features, whereas wider credible intervals reflect greater uncertainty in their relative importance. Unlike deterministic feature-selection methods, this yields a fully probabilistic interpretation of feature relevance.
It is important to distinguish between global feature relevance and effective contribution to the linear predictor. In the proposed model, the simplex-constrained weights w j represent normalized and globally interpretable feature importance scores, reflecting the relative importance of predictors across the model, and are used as the primary measure of feature importance in our analysis. However, the actual contribution of each feature to the linear predictor is governed by the product w j β j , which combines feature relevance and effect size. Therefore, while w j provides a global importance ranking, the quantity w j β j determines the effective predictive influence of each feature.

3.4.3. Robust Classification Under Noise

The inclusion of ε in the likelihood explicitly models label contamination. Its posterior mean,
ε ^ = 1 M m = 1 M ε ( m ) ,
provides an estimate of the noise rate in the training labels: posterior mass concentrated near zero corresponds to mostly clean labels, whereas mass away from zero indicates substantial mislabeling. Simultaneously, the heavy-tailed shrinkage prior on β and the simplex constraint on w attenuate the influence of outlying covariates and uninformative predictors.

3.4.4. Posterior Predictive Checks and Calibration

Finally, posterior samples facilitate model criticism and calibration assessment. They can be used to compute the distributions of log-loss and Brier scores, Expected Calibration Error (ECE) and maximum calibration gap, and to perform posterior predictive checks [29,30,31,32,33,34,35]. These diagnostics allow us to assess not only predictive performance but also the reliability of uncertainty estimates produced by the model.

4. Simulation and Experimental Results

4.1. Simulation

To evaluate the proposed Bayesian robust feature weighting framework systematically, we conducted an extensive simulation study across eight scenarios (S1–S8), each designed to reflect a distinct source of difficulty in supervised classification. The scenarios manipulate data characteristics such as dimensionality, correlation, class imbalance, outliers, and label noise, thereby enabling a comprehensive assessment of robustness and generalization. Table 2 summarizes these scenarios.
To ensure reproducibility, we provide detailed specifications for each simulation scenario. For all scenarios, the sample size, number of features, and data-generating mechanisms are explicitly defined. The covariates are generated from standard distributions, and the true regression coefficients are constructed to reflect varying levels of sparsity and signal strength. Noise contamination is introduced through a controlled mechanism, with a predefined contamination rate and flipping probability.
Specifically, each scenario differs in terms of the proportion of relevant features, the magnitude of regression coefficients, and the level of noise contamination. The random seed is fixed across all experiments to ensure replicability. All simulations are implemented using consistent preprocessing steps, including feature standardization. Detailed parameter settings for each scenario are provided to facilitate exact reproduction of the results.
  • Data generation: Predictors X R n × m were drawn from multivariate Gaussian blocks with correlation parameter ρ , yielding both independent and correlated structures. In some scenarios, heavy-tailed covariates were introduced by replacing the first block with t ν -distributed samples ( ν degrees of freedom), simulating covariate outliers. A sparse linear signal was imposed via coefficients β with only s variables contributing to the decision boundary; the scenario S8 additionally included nonlinear interactions (see Table 2).
  • Class labels: Latent scores were computed as η i = α 0 + x i β , with α 0 calibrated such that the marginal probability of a positive label matched the target prevalence π . Observed labels y i were then drawn from a Bernoulli ( σ ( η i ) ) distribution, where σ ( t ) = 1 / ( 1 + e t ) , and independently flipped with probability ε to simulate misclassification.
  • Outlier injection: In scenarios involving feature contamination, a fraction γ of rows in X were shifted by multiples of the marginal standard deviation in randomly chosen dimensions, producing covariate outliers (cf. S5 in Table 2).
  • Evaluation metrics: Ten replications were performed for each simulation scenario. The training and testing splits were stratified to ensure both classes were proportionally represented. Model performance was evaluated across three complementary dimensions: discrimination, calibration, and accuracy.
Model discrimination refers to the ability to correctly distinguish between positive and negative instances. It was quantified using the area under the receiver-operating-characteristic curve (AUC), area under the precision–recall curve (PRAUC), and F1-score [36,37,38], while AUC measures overall ranking ability, PRAUC provides a more informative assessment under class imbalance, and the F1-score balances precision and recall.
Calibration assesses agreement between predicted probabilities and observed frequencies. We employed log-loss (cross-entropy loss), Brier score, expected calibration error (ECE), and maximum calibration gap (MCE) [31,39,40]. The Brier score represents a proper scoring rule capturing calibration and refinement, whereas the ECE summarizes the average deviation between predicted and empirical probabilities.
Overall classification accuracy, defined as the proportion of correctly classified observations, was reported for completeness [41], but it was interpreted alongside discrimination and calibration measures because it can be misleading under imbalance.
  • Comparative models: In addition to the proposed Bayesian framework, benchmarks included logistic regression (with and without L1/elastic-net penalties), random forests, gradient boosting, and class-balanced stochastic-gradient descent.
Figure 1 and Figure 2 show the comparative performance trends of all models across the eight experimental scenarios (S1–S8). Each subplot presents key evaluation metrics: AUC, PRAUC, F1-score, and accuracy capture discriminative ability, whereas log-loss, Brier score, expected calibration error (ECE), and maximum calibration gap (MCE) assess probability calibration.
For discrimination metrics (AUC, PRAUC, F1, Accuracy), higher values indicate better performance (Figure 1); for calibration metrics (LogLoss, Brier, ECE, MCE), lower values are desirable (Figure 2). The Bayesian model (Bayes_FW) is compared against classical machine learning methods, including logistic regression (and its L1 and elastic-net variants), gradient boosting, random forest, and SGD with balanced weights.
Figure 1 and Figure 2 together demonstrate the comparative behavior of all models across the eight experimental scenarios (S1–S8). Overall, the Bayesian model maintains competitive or superior performance in probabilistic metrics such as LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Gap (MCE), underscoring its strength in uncertainty quantification and calibration. In terms of discriminative performance (AUC, PRAUC, Accuracy, and F1), logistic regression variants achieve the highest AUC and PRAUC on the clean and balanced dataset (S1), although the Bayesian model still provides strong results. Under more challenging conditions, such as the imbalance setting (S7) and the hard mixture scenario (S8), the Bayesian approach demonstrates robustness, preserving relatively stable AUC and F1 compared to other methods that exhibit sharper drops.
With respect to calibration, the Bayesian model consistently attains the lowest (and thus best) values, particularly in S1, S2, and S4, reflecting not only accurate predictions but also reliable probability estimates, a property of particular importance in risk-sensitive applications. In contrast, models such as SGD and Random Forest show higher variability, with occasional large calibration errors. Furthermore, in high-dimensional (S3) and correlated (S2) scenarios, performance differences among models tend to narrow, yet the Bayesian method continues to retain its calibration advantage. Similarly, in the presence of label noise (S4) and outliers (S5), the uncertainty-aware structure of the Bayesian model prevents the extreme degradation observed in tree-based models.
Taken together, these findings highlight that while traditional models can occasionally surpass Bayesian approaches in terms of pure discriminative accuracy (as in S1), Bayesian modeling provides more reliable probability calibration and demonstrates greater robustness across diverse and adverse data conditions.
Table 3 summarizes the convergence diagnostics and key hyperparameter estimates across eight simulation scenarios, each designed to test different data complexities and contamination settings for the proposed Bayesian Feature Weighting (Bayes_FW) model. The parameter α 0 represents the global intercept of the linear predictor, capturing the baseline log-odds of the outcome. The mean values vary moderately across scenarios (from 0.18 in S7_imbalance to 0.22 in S6_both), reflecting adaptive shrinkage behavior under different data conditions. The global shrinkage behavior is governed by the parameter τ in the hierarchical horseshoe prior, while λ controls local shrinkage at the feature level. The parameter ε denotes the estimated label-noise proportion, which increases notably in noise-heavy settings such as S4_lblnoise (0.09), S6_both (0.10), and S8_hardmix (0.12), confirming that the model successfully captured and quantified contamination in the data.
Convergence diagnostics based on the Gelman–Rubin statistic (mean R ^ 1.00 1.01 , maximum R ^ 1.03 ) demonstrated excellent mixing and stability of the Markov chains across all scenarios. The proportion of parameters with R ^ > 1.01 remained below 10% in every case, further confirming reliable convergence. Effective sample sizes ( n eff ) were generally high (median values between 1750 and 3900), ensuring that posterior estimates were based on sufficiently independent draws. Table 3 summarizes the posterior diagnostics for all simulation settings.
Overall, Table 3 indicates that the proposed Bayesian model achieved stable convergence and consistent inference across varying levels of correlation, label noise, imbalance, and dimensionality. The results validate the robustness and computational reliability of the MCMC implementation, even under challenging conditions such as high-dimensional noise mixtures (S8) and concurrent outlier–label-noise contamination (S6).
Figure 3 presents a comparative evaluation of the proposed Bayesian Feature Weighting (Bayes_FW) model against a diverse set of benchmark classifiers across eight well-defined data scenarios. These include standard clean data (S1), label corruption (S2), high-dimensional features (S3), label imbalance (S4), the presence of feature outliers (S5), simultaneous label and feature noise (S6), class imbalance (S7), and a challenging setting with both severe label noise and imbalance (S8). The benchmark models considered encompass logistic regression and its regularized variants (L1 and elastic net), as well as ensemble-based methods such as random forest and gradient boosting. In addition, a robust linear baseline—stochastic gradient descent with balanced class weights (SGD-balanced)—is included to account for class imbalance. This setup enables a comprehensive assessment of predictive robustness and generalization across diverse data conditions.
The proposed Bayes_FW model consistently outperforms or matches benchmarks in challenging settings, particularly under label noise (S2), feature outliers (S5), and compound corruption (S6, S8), while simpler models like Logistic Regression perform competitively in clean scenarios (e.g., S1), they show performance degradation in the presence of noise. In contrast, Bayes_FW achieves the best F1 and PRAUC scores in most corrupted settings, demonstrating superior robustness and predictive reliability.
Figure 4 reports metrics related to probabilistic calibration and uncertainty estimation for Bayes_FW and benchmark models. To assess the probabilistic calibration performance of the models, four complementary metrics are employed. Logloss measures the negative log-likelihood of the predicted class probabilities, penalizing overconfident and incorrect predictions. Brier Score captures the mean squared error between predicted probabilities and actual class labels, providing a direct measure of overall probabilistic accuracy. Expected Calibration Error (ECE) quantifies the average deviation between predicted confidence and observed accuracy across confidence bins, reflecting the alignment between model confidence and correctness. Lastly, Max Calibration Gap reports the largest observed discrepancy between confidence and accuracy, indicating the worst-case calibration error. Together, these metrics offer a comprehensive evaluation of both average and extreme calibration behavior.
The proposed Bayes_FW model shows consistently better or competitive performance across calibration metrics, particularly under noisy or imbalanced conditions (S2, S4, S6, S8), while some benchmark models achieve low classification error, they often exhibit poor calibration (e.g., Random Forest, SGD-Balanced). Bayes_FW uniquely provides both strong predictive performance and principled uncertainty estimates, making it especially well-suited for applications where reliability and trust in model output are critical—such as healthcare.

4.2. Real Medical Dataset Applications

In this study, we evaluate the performance of the proposed model using four benchmark medical datasets, namely the Breast Cancer Wisconsin, Pima Indians Diabetes, South African Heart Disease, and Cleveland Heart Disease datasets. For each dataset, we report key statistical characteristics, including the number of samples, number of features, and class distribution, to ensure transparency and reproducibility.
The Pima Indians Diabetes dataset consists of 768 samples with 8 clinical features. The target variable indicates whether a patient is diagnosed with diabetes. Approximately 35% of the samples belong to the positive class, while 65% correspond to the negative class, indicating a moderately imbalanced classification problem. All features are standardized to have zero mean and unit variance prior to model training. To ensure a reliable evaluation, we employ a stratified 5-fold cross-validation scheme, preserving class proportions across training and test splits.
The Cleveland Heart Disease dataset consists of 297 samples with 13 clinical features. The target variable represents the presence or absence of heart disease. Approximately 54% of the samples correspond to the positive class, indicating a relatively balanced dataset. Similar to the Pima dataset, all features are standardized prior to analysis, and a stratified 5-fold cross-validation procedure is used to divide the dataset into training and test sets, ensuring consistency and comparability across experiments.
The Breast Cancer Wisconsin dataset and the South African Heart Disease dataset are also included in the experimental evaluation to provide a comprehensive assessment across datasets with varying levels of class imbalance, feature characteristics, and noise sensitivity.
This standardized evaluation protocol allows for a fair and consistent comparison between the proposed method and competing approaches across diverse medical data settings.

4.2.1. Breast Cancer Wisconsin Dataset

The Breast Cancer Wisconsin (Diagnostic) dataset was obtained from the UCI Machine Learning Repository. It consists of nine predictor variables describing cellular characteristics, along with a binary class label indicating whether a tumor is malignant (1) or benign (0). The dataset originally contains 699 observations; after removing instances with missing values, 683 cases remained for analysis. The class distribution is moderately imbalanced, with approximately 65% benign and 35% malignant samples. Prior to model training, all predictors were standardized to have zero mean and unit variance, and the ID attribute, which carries no predictive information, was discarded.
An overview of the predictor variables in the Breast Cancer Wisconsin dataset is provided in Table 4. These features represent morphological and nuclear characteristics extracted from digitized cell images and form the basis for malignancy classification.
Table 5 presents the mean and standard deviation (SD) of four performance metrics—AUC, PRAUC, F1-score, and accuracy—computed from repeated cross-validation for six competing models: standard logistic regression (Logistic), L1-regularized logistic regression (Logistic_L1), elastic-net logistic regression (Logistic_EN), Random Forest, Gradient Boosting (GradBoost), and the proposed Bayesian Feature Weighting model (Bayes_FW).
Across all metrics, Bayes_FW consistently achieved the strongest overall performance. It obtained the highest mean AUC (0.9975) and PRAUC (0.9931), along with superior F1-score (0.9587) and accuracy (0.9699), while maintaining low variability (SD < 0.006). These results highlight the robustness and discriminative strength of the Bayesian approach in identifying malignant cases. Among the frequentist baselines, Logistic_EN and Logistic_L1 also performed competitively, with AUC values around 0.996 and balanced F1-scores near 0.947, suggesting that regularization contributes to slight gains over the standard logistic model. Random Forest and GradBoost delivered marginally lower performance, reflecting the limited benefit of nonlinear tree-based methods for this dataset, which primarily consists of moderately correlated numeric predictors.
The Bayesian Feature Weighting model outperformed all other approaches, demonstrating superior predictive accuracy and stability, and confirming its effectiveness for robust and interpretable tumor classification.
Table 6 summarizes the calibration and reliability metrics for the six classification models evaluated on the Breast Cancer Wisconsin dataset. The reported measures include the LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE), each presented with their mean and standard deviation (SD) across repeated cross-validation runs.
Among all compared models, the Bayesian Feature Weighting (Bayes_FW) model achieved the most reliable probabilistic predictions, as evidenced by the lowest LogLoss (0.0912) and lowest Brier score (0.0243). These results indicate that the Bayesian approach produces well-calibrated probability estimates with minimal deviation from true outcome probabilities. The model also yielded the smallest ECE (0.0321), suggesting excellent overall calibration across probability bins.
In comparison, Random Forest exhibited slightly higher LogLoss (0.0922) but maintained strong calibration (ECE = 0.0365). Traditional logistic models—Logistic_EN, Logistic_L1, and Logistic—performed reasonably well but showed modestly higher LogLoss and Brier scores, indicating slightly less accurate probability estimation. The Gradient Boosting (GradBoost) model performed the weakest in terms of calibration, with the highest LogLoss (0.1178) and Brier score (0.0292), reflecting some degree of overconfidence in its probability predictions.
The Bayes_FW model outperformed all alternatives across all four calibration metrics, confirming its superiority not only in predictive discrimination (as shown in Table 5) but also in probability reliability and uncertainty quantification—a key advantage of the Bayesian framework.
Table 7 and Figure 5 present the posterior mean feature weights and their 90% credible intervals estimated using the Bayesian Feature Weighting (Bayes_FW) model. These results quantify the relative contribution of each cellular attribute to malignancy prediction while incorporating model uncertainty through the Bayesian posterior distribution.
It is important to note that the feature weights w j represent the relative relevance of predictors within the weighting structure. The actual contribution of a predictor to the linear predictor depends on the product w j β j . Since the hierarchical horseshoe prior strongly shrinks irrelevant coefficients toward zero, predictors with near-zero β j have negligible predictive influence even if their corresponding weights are moderately large.
According to the results, Bare nuclei, Clump thickness, and Mitoses emerge as the most influential predictors, showing the highest posterior mean weights (0.1467, 0.1339, and 0.1253, respectively). Their wider yet consistently positive credible intervals indicate both strong and stable associations with malignancy likelihood. Intermediate importance is observed for Normal nucleoli, Cell size, and Bland chromatin, which also contribute meaningfully but with slightly lower average weights. Finally, Marginal adhesion, Cell shape, and Epithelial cell size have the smallest posterior means, suggesting relatively weaker influence in distinguishing malignant from benign tumors.
Figure 5 visually confirms this ranking pattern, where the dots represent posterior means and the horizontal bars denote 90% credible intervals. The clear separation of higher-weighted features at the top highlights the discriminative power of nuclear irregularities and cellular cohesion, which are biologically consistent with pathological observations in breast cancer diagnosis.

4.2.2. Pima Indians Diabetes Dataset

The Pima Indians Diabetes dataset is a widely used benchmark in medical machine learning, originally collected by the National Institute of Diabetes and Digestive and Kidney Diseases. It contains clinical and physiological measurements from female patients of Pima Indian heritage aged 21 years or older. The dataset includes eight predictor variables—such as glucose concentration, body mass index (BMI), and number of pregnancies—that are important risk factors for type 2 diabetes. The binary outcome variable indicates whether an individual shows signs of diabetes (1) or not (0), based on established diagnostic criteria. The dataset is frequently used to evaluate predictive models in healthcare, as it combines demographic, genetic, and lifestyle-related risk indicators with measurable biomedical parameters.
An overview of the predictor variables in the Pima Indians Diabetes dataset is provided in Table 8. These features represent demographic, physiological, and biochemical risk factors commonly associated with type 2 diabetes.
To further validate the proposed Bayesian framework, comprehensive posterior diagnostics are provided in Appendix A. The marginal posterior distributions of key global parameters (Figure A2) indicate well-defined and unimodal behavior. In particular, the intercept ( α ) is tightly concentrated, while the global shrinkage parameter ( τ ) exhibits a right-skewed distribution, reflecting the adaptive sparsity induced by the horseshoe prior. The label-noise parameter ( ϵ ) is centered around low values, suggesting limited but non-negligible noise in the dataset.
Sampling diagnostics confirm the reliability of inference. As shown in Figure A3, the effective sample size ratios ( N eff / N ) are consistently high, indicating efficient exploration of the posterior space. Similarly, the R ^ statistics (Figure A3) are tightly concentrated around 1, providing strong evidence of convergence across all chains.
Trace plots for both global parameters and regression coefficients (Figure A5, Figure A6 and Figure A7) demonstrate good mixing behavior with no visible trends or chain separation, further supporting stable MCMC performance, while occasional spikes are observed in coefficient traces due to the heavy-tailed prior, these do not indicate pathological sampling behavior.
The joint posterior structure (Figure A8) reveals mild dependencies among parameters, particularly between α and τ , which is expected in hierarchical shrinkage models. Importantly, no pathological correlations or funnel-shaped geometries are observed.
Finally, the posterior predictive check (Figure A9) shows strong agreement between observed and model-generated distributions, indicating that the proposed model successfully captures the underlying data-generating process. Minor deviations at extreme probability regions suggest slight calibration imperfections but do not materially affect predictive performance.
Overall, these diagnostics confirm that the proposed Bayesian feature weighting model achieves stable convergence, efficient sampling, and reliable uncertainty quantification on the Pima Indians Diabetes dataset.
Table 9 presents the mean and standard deviation (SD) of four performance metrics—AUC, PRAUC, F1-score, and accuracy—computed from repeated cross-validation for six competing models: standard logistic regression (Logistic), L1-regularized logistic regression (Logistic_L1), elastic-net logistic regression (Logistic_EN), Random Forest, Gradient Boosting (GradBoost), and the proposed Bayesian Feature Weighting model (Bayes_FW).
Across all discrimination metrics, the Bayes_FW method achieved the highest overall performance, with an AUC mean of 0.8426, PRAUC mean of 0.7851, and F1-score mean of 0.6881, outperforming both conventional logistic regression variants and ensemble-based methods. These results highlight the model’s ability to capture uncertainty in feature contributions while maintaining high discriminative power. Logistic_L1 and standard Logistic regression followed closely, exhibiting comparable AUC values (0.8338 and 0.8324, respectively) but slightly lower precision–recall and F1-scores. Ensemble models such as Random Forest and Gradient Boosting demonstrated lower AUC and PRAUC values, suggesting less stable performance for this moderately imbalanced dataset. The higher SD observed for Gradient Boosting indicates greater variability across runs, potentially due to hyperparameter sensitivity or overfitting in smaller training subsets.
Table 10 reports the calibration and reliability metrics for the same models. The measures include LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE), each presented with mean and SD across repeated cross-validation runs.
Among the compared methods, the Bayesian Feature Weighting (Bayes_FW) approach achieved the lowest LogLoss (0.3404) and Brier score (0.1468), indicating superior probabilistic accuracy and overall calibration. It also produced the smallest ECE (0.0700) and MCE (0.2783), suggesting that the Bayesian model provides well-calibrated probability estimates that closely align with observed outcomes. In contrast, the Gradient Boosting (GradBoost) model showed the weakest calibration, with the highest LogLoss and ECE values, implying overconfident predictions and larger deviations from true event frequencies.
Table 11 and Figure 6 present the posterior mean feature weights and their 90% credible intervals estimated by the Bayesian Feature Weighting model. These results quantify the relative contribution of each clinical variable to diabetes prediction while incorporating model uncertainty through the Bayesian posterior distribution.
According to the results, glucose, body mass index, and number of pregnancies are the most influential predictors, exhibiting the highest posterior mean weights (0.1878, 0.1654, and 0.1471, respectively). These features show strong and stable associations with diabetes risk, as indicated by their positive and moderately wide credible intervals. Pedigree function, reflecting genetic predisposition, also ranks among the top predictors. Lower but meaningful contributions are observed for blood pressure, insulin, and triceps skinfold thickness, suggesting secondary influence in the model’s classification process.
Figure 6 visually confirms this ranking pattern, where the dots represent posterior means and the horizontal bars denote 90% credible intervals. The dominance of glucose concentration and body mass index underscores their well-established roles as primary determinants of type 2 diabetes, while the remaining features capture secondary but biologically consistent effects.

4.2.3. South African Heart Disease Dataset

The South African Heart Disease (SAHeart) dataset originates from a South African study on risk factors associated with coronary heart disease (CHD). It includes demographic, clinical, and lifestyle-related variables commonly linked to cardiovascular outcomes. The dataset combines biochemical measures (e.g., LDL cholesterol) with behavioral indicators (e.g., tobacco and alcohol use) and psychosocial factors (Type-A behavior). The binary outcome variable indicates the presence (1) or absence (0) of CHD. This dataset is widely used in statistical learning and medical data analysis because it provides a comprehensive mix of physiological, behavioral, and hereditary risk factors.
An overview of the predictor variables in the SAHeart dataset is provided in Table 12.
Table 13 presents the mean and standard deviation (SD) of four performance metrics—AUC, PRAUC, F1-score, and accuracy—computed from repeated cross-validation for six competing models: elastic-net logistic regression (Logistic_EN), L1-regularized logistic regression (Logistic_L1), standard logistic regression (Logistic), Bayesian Feature Weighting (Bayes_FW), Random Forest, and Gradient Boosting (GradBoost).
Across discrimination metrics, Bayes_FW achieved the best overall performance (AUC = 0.7903; PRAUC = 0.6933), with competitive F1-score (0.5494) and accuracy (0.7400). Logistic_EN and Logistic_L1 followed closely in AUC and PRAUC. Tree-based methods (Random Forest, GradBoost) showed lower discrimination, consistent with potential overfitting or noise sensitivity in smaller biomedical datasets.
Table 14 summarizes calibration and loss metrics: LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE), each reported with mean and SD across repeated runs.
Bayes_FW achieved the lowest LogLoss (0.5245) and Brier score (0.1645), indicating strong probabilistic accuracy. Although its ECE was higher than some logistic baselines, Bayes_FW maintained competitive calibration overall, while GradBoost showed the weakest calibration (highest LogLoss, ECE, and MCE).
Table 15 and Figure 7 present posterior mean feature weights and 90% credible intervals estimated by Bayes_FW.
According to the results, age, family history (famhist), and tobacco exhibit the largest posterior mean weights, indicating the strongest association with CHD risk in this cohort. Biochemical and physiological indicators such as ldl and bp show moderate influence, while lifestyle/anthropometric variables (obesity, adiposity, alcohol) contribute more weakly, with wider credible intervals reflecting greater uncertainty.

4.2.4. Heart Disease (Cleveland) Dataset

The Heart Disease (Cleveland) dataset is a widely used benchmark in cardiovascular research and machine learning. It contains 303 patient records collected at the Cleveland Clinic Foundation; after preprocessing (removing missing values and encoding categorical variables), approximately 297 samples remain. The outcome variable is binary: presence of heart disease (1) versus absence (0).
The dataset includes clinical, demographic, and exercise-related attributes (e.g., age, blood pressure, cholesterol, thalassemia test results, ECG findings). Categorical features (e.g., cp, thal, slope, restecg, ca, exang) were expanded to one-hot indicators for compatibility with the Bayesian feature weighting framework.
An overview of the predictor variables is provided in Table 16.
Table 17 presents the mean and standard deviation (SD) of four performance metrics—AUC, PRAUC, F1-score, and accuracy—computed from repeated cross-validation for six competing models: elastic-net logistic regression (Logistic_EN), standard logistic regression (Logistic), L1-regularized logistic regression (Logistic_L1), Bayesian Feature Weighting (Bayes_FW), Random Forest, and Gradient Boosting (GradBoost).
Across discrimination metrics, Bayes_FW achieved the strongest overall performance (AUC = 0.9079; PRAUC = 0.9076; F1 = 0.8264; ACC = 0.8440). Penalized logistic baselines (Logistic_EN, Logistic_L1) were competitive, while tree-based methods (Random Forest, GradBoost) trailed on average, consistent with smaller sample sizes and mixed continuous/categorical predictors.
Table 18 summarizes calibration and loss metrics—LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE)—each reported with mean and SD across repeated runs.
Bayes_FW achieved the lowest LogLoss (0.3365) and Brier score (0.1103), indicating strong probabilistic calibration. Logistic baselines were competitive but less precise, whereas GradBoost showed the weakest calibration (highest LogLoss, ECE, and MCE).
Table 19 and Figure 8 present the posterior mean feature weights and their 90% credible intervals estimated using Bayes_FW.
According to the results, cp_3, ca_1, and thal_3 exhibit the largest posterior mean weights, indicating the strongest association with heart disease risk in this cohort. Moderately important features include oldpeak, slope_1, sex_1, and thalach. Variables such as fbs_1, chol, and age receive smaller weights after accounting for correlation among predictors. Figure 8 visually confirms these findings by showing posterior means with 90% credible intervals.

5. Results and Discussion

The proposed Bayesian Feature Weighting (Bayes_FW) model was comprehensively evaluated against six benchmark classifiers—standard Logistic Regression, L1- and Elastic-Net-regularized logistic models, Random Forest, Gradient Boosting, and a balanced SGD baseline—across four real-world biomedical datasets. Figure 9 and Figure 10 summarize the comparative outcomes in terms of calibration and discrimination, respectively, while Table 5, Table 6, Table 7, Table 8, Table 9, Table 10, Table 11, Table 12, Table 13, Table 14, Table 15, Table 16, Table 17, Table 18 and Table 19 report dataset-specific numerical results and posterior feature weight analyses. Figure 9 and Figure 10 are composed of four panels: (a) Breast Cancer Wisconsin, (b) Pima Indians Diabetes, (c) South African Heart Disease, and (d) Cleveland Heart Disease datasets.
Across all datasets, Bayes_FW achieved superior or comparable predictive accuracy while demonstrating the most stable probability calibration. In Figure 10, which presents normalized discrimination metrics (AUC, PRAUC, F1, and Accuracy), Bayes_FW consistently attains the highest normalized scores across panels (a–d), confirming its strong and robust ranking. This advantage is most pronounced for the Breast Cancer Wisconsin and Cleveland Heart Disease datasets, where AUC and PRAUC exceed 0.90 and F1-scores remain near or above 0.80. Regularized logistic models (Logistic_EN and Logistic_L1) follow closely, whereas tree-based ensembles show greater variability, particularly under smaller sample sizes.
Figure 9 reports normalized calibration metrics (LogLoss, Brier, ECE, and MCE). Bayes_FW yields the lowest average losses across all four datasets, indicating more reliable uncertainty estimation and reduced overconfidence. In the Breast Cancer and Pima Indians Diabetes datasets (Figure 9a,b), Bayes_FW achieves the smallest LogLoss and Brier scores, confirming accurate probability estimation. Even under noisier or more heterogeneous settings (South African Heart Disease and Cleveland datasets; Figure 9c,d), calibration remains stable with minimal degradation, unlike Gradient Boosting, which exhibits larger ECE/MCE values.
Numerically, Bayes_FW shows consistent gains across datasets. For the Breast Cancer Wisconsin dataset, Bayes_FW achieves the highest AUC (0.9975) and F1 (0.9587), together with the lowest LogLoss (0.0912) and Brier score (0.0243), reflecting near-perfect discrimination and excellent calibration. For the Pima Indians Diabetes dataset, Bayes_FW attains the lowest LogLoss (0.340) and the smallest ECE (0.070), improving calibration over the best logistic baseline. For the South African Heart Disease dataset, Bayes_FW maintains balanced discrimination (AUC = 0.790) and competitive calibration (LogLoss = 0.525), outperforming both regularized logistic models and ensemble learners on average. Finally, for the Cleveland Heart Disease dataset, Bayes_FW attains the highest AUC (0.908) and the smallest LogLoss (0.337), demonstrating reliable predictive power and robust uncertainty quantification in a mixed categorical–continuous setting.
Posterior feature weight analyses (Table 7, Table 11, Table 15 and Table 19) reveal coherent, domain-consistent patterns. Dominant predictors such as Bare nuclei (breast cancer), glucose and BMI (diabetes), age and family history (cardiovascular risk), and chest-pain/angiographic indicators (Cleveland dataset) receive the highest posterior means with credible intervals that remain well separated from less informative variables. The credible interval visualizations (Figure 5, Figure 6, Figure 7 and Figure 8) further confirm that Bayes_FW not only ranks key features effectively but also provides uncertainty bounds that quantify their relative stability.
Overall, Figure 9 and Figure 10 demonstrate that Bayes_FW achieves a strong balance of high discrimination and superior calibration across diverse biomedical domains. It produces well-calibrated probability estimates, stable performance under noise and imbalance, and interpretable uncertainty-aware feature importance, supporting its use as a robust and generalizable probabilistic learning framework.
To further strengthen the interpretability of the proposed Bayesian feature weighting framework, the estimated feature importance results are examined in conjunction with established clinical domain knowledge. In the case of the Pima Indians Diabetes dataset, the model consistently assigns higher posterior weights to variables such as plasma glucose concentration, body mass index (BMI), and age. These variables are well-documented in the medical literature as primary risk factors for the development of type 2 diabetes, thereby providing strong external validation for the model’s findings.
More specifically, elevated plasma glucose levels are directly indicative of impaired glucose metabolism, which is a defining characteristic of diabetes. Similarly, higher BMI values are associated with obesity-related insulin resistance, while increasing age is known to correlate with a higher prevalence of metabolic disorders. The alignment of these clinically meaningful variables with high posterior feature weights suggests that the proposed model is not only statistically effective but also capable of capturing medically relevant patterns in the data.
In addition, features such as insulin levels, skin thickness, and blood pressure receive moderate importance scores. Although their individual effects may be less pronounced or more variable across patients, these variables are still recognized as contributing factors in the broader pathophysiology of diabetes. Their inclusion among the relevant predictors further supports the model’s ability to reflect complex, multifactorial relationships inherent in medical data.
Importantly, the simplex-constrained weighting structure enables a clear and interpretable ranking of features, while the incorporation of shrinkage priors mitigates the influence of noisy or less informative variables. This combination allows the model to balance sparsity and flexibility, leading to feature importance estimates that are both stable and clinically meaningful.
Overall, the strong agreement between the model-derived feature importance and established medical evidence enhances the credibility of the proposed approach and highlights its potential utility in real-world healthcare applications, where interpretability and domain consistency are essential.
In addition to the empirical results, it is important to position the proposed model within the broader context of state-of-the-art Bayesian robust modeling approaches. Compared to existing methods such as Bayesian logistic regression with heavy-tailed priors and Dirichlet process–based models, the proposed framework provides a unified mechanism that simultaneously achieves feature weighting, robustness to contamination, and interpretability through the simplex constraint, while heavy-tailed priors primarily address outliers, and nonparametric Bayesian approaches focus on distributional flexibility, our method integrates these aspects with an explicit feature importance structure.
Despite these advantages, several limitations should be acknowledged. First, the assumption of a symmetric contamination mechanism may not fully capture class-dependent noise patterns commonly observed in medical datasets. Second, the computational cost associated with HMC-based inference can become significant in high-dimensional settings. Third, the simplex constraint, while improving interpretability, may introduce additional geometric complexity in posterior sampling.
These limitations suggest several promising directions for future research. Extending the model to asymmetric or class-dependent noise structures would improve its applicability in real-world clinical settings. Additionally, scalable inference techniques such as variational approximations or stochastic gradient-based methods could be explored to enhance computational efficiency. Finally, integrating the proposed framework with nonparametric priors or deep learning architectures may further improve flexibility and predictive performance.

6. Conclusions

This study introduced a novel Bayesian Feature Weighting (Bayes FW) framework that fundamentally redefines how feature importance is modeled and interpreted in the presence of noisy medical data. Unlike conventional deterministic feature weighting techniques and existing Bayesian approaches that treat robustness, shrinkage, and uncertainty in isolation, the proposed model unifies simplex-constrained global feature weighting, hierarchical shrinkage priors, and contamination-aware noise modeling within a single coherent probabilistic framework. This integration constitutes a key methodological contribution, enabling globally normalized and interpretable feature weights while explicitly controlling the influence of mislabeled observations and aberrant feature values. Through extensive empirical evaluation on four benchmark medical datasets, Breast Cancer Wisconsin, Pima Indians Diabetes, South African Heart Disease and Cleveland Heart Disease, the proposed framework consistently demonstrated superior performance over classical logistic regression variants and ensemble-based learners. Importantly, the gains were not limited to discrimination metrics such as AUC, F1, and accuracy, but extended to probabilistic calibration measures, including LogLoss, Brier score, and Expected Calibration Error (ECE). These results underscore the ability of the proposed Bayesian formulation to deliver reliable probability estimates, which are essential for risk-sensitive medical decision making. Beyond predictive performance, a central advantage of the Bayes FW framework lies in its ability to produce uncertainty-aware and globally interpretable feature-importance estimates. By constraining feature weights to the probability simplex and modeling them probabilistically, the proposed approach provides a principled representation of global feature relevance that remains stable under data contamination. The posterior distributions of feature weights enable transparent quantification of uncertainty, offering clinically meaningful insights into variable importance rather than relying on fragile point estimates.
The feature importance results obtained from the proposed model are consistent with established clinical evidence. For instance, glucose concentration emerges as the most influential predictor, which aligns with its central role in diabetes diagnosis and progression. Body mass index (BMI) and age are also identified as key contributors, reflecting their well-documented association with metabolic risk. This agreement between the model outputs and domain knowledge provides additional validation of the proposed framework and highlights its potential for interpretable and clinically relevant machine learning applications.
Collectively, these findings demonstrate that incorporating Bayesian inference into feature weighting, when combined with simplex-based normalization and contamination-aware priors, yields a robust, interpretable, and uncertainty-aware modeling paradigm for medical classification. The proposed framework bridges a critical methodological gap between robust Bayesian learning and interpretable feature weighting, offering a scalable and theoretically grounded solution for noisy biomedical data. Future research directions include extending the proposed model to multiclass and longitudinal outcomes, incorporating structured or group-wise priors to capture hierarchical biomedical relationships, and integrating the framework with Bayesian deep learning architectures. Such extensions would further enhance the applicability of the proposed approach to complex, high-dimensional clinical data environments where robustness, interpretability, and uncertainty quantification are simultaneously required.

Author Contributions

Conceptualization, M.A.C.; methodology, M.A.C. and Z.Ö.; software, M.A.C.; validation, M.A.C. and A.A.; formal analysis, M.A.C.; investigation, M.A.C. and Z.Ö.; resources, M.A.C.; data curation, M.A.C.; writing—original draft preparation, M.A.C. and A.A.; writing—review and editing, M.A.C., Z.Ö. and A.A.; visualization, A.A.; supervision, M.A.C.; project administration, M.A.C.; funding acquisition, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-DDRSP2601).

Institutional Review Board Statement

Ethical approval was waived because this study is based entirely on the analysis of a publicly available dataset. The authors did not collect any primary data from human participants, nor did they have any direct contact with patients or their medical records.

Informed Consent Statement

Informed consent was waived due to the retrospective nature of the study.

Data Availability Statement

The datasets used and analyzed during the current study are publicly available on Zenodo at DOI: https://doi.org/10.5281/zenodo.17559308.

Acknowledgments

The authors confirm that ChatGPT 5.2 (OpenAI) was used exclusively to assist with English language editing. The tool was not involved in creating original scientific content, analyzing data, or drawing conclusions. All research findings and interpretations are entirely the work of the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Simulation Study: Additional Performance Tables and Figures

Table A1. Discrimination metrics by model and scenario: Mean and standard deviation (SD) for AUC, PRAUC, F1, and ACC across eight scenarios and all models.
Table A1. Discrimination metrics by model and scenario: Mean and standard deviation (SD) for AUC, PRAUC, F1, and ACC across eight scenarios and all models.
AUCPRAUCF1ACC
ScenarioModelMeanstdMeanstdMeanstdMeanstd
S1GradBoost0.7360.0510.7310.0650.6470.0670.6680.039
S1Logistic0.7970.0410.7860.0610.7180.0680.7270.048
S1Logistic_EN0.7990.0370.7900.0590.7170.0690.7250.046
S1Logistic_L10.8010.0350.7930.0550.7190.0690.7280.048
S1RandomForest0.7680.0410.7580.0580.6760.0800.6950.060
S1SGD_Balanced0.7530.0350.7410.0750.6500.0470.6580.031
S1Bayes_FW0.8200.0520.8310.0420.7250.0620.7380.061
S2GradBoost0.6360.1030.6370.0980.5940.0860.6130.068
S2Logistic0.6730.1060.6710.1020.5870.1170.6120.100
S2Logistic_EN0.6790.1070.6760.1020.5920.1400.6150.114
S2Logistic_L10.6790.1070.6780.1060.6010.1370.6200.115
S2RandomForest0.6570.0840.6470.0830.6020.0880.6250.073
S2SGD_Balanced0.5900.0640.5720.0700.5730.0470.5780.049
S2Bayes_FW0.8610.0430.8730.0380.7860.0490.7870.050
S3GradBoost0.5850.0830.6030.0860.5360.0680.5580.065
S3Logistic0.5930.0570.6090.0620.5920.0580.5770.045
S3Logistic_EN0.6080.0640.6280.0600.5870.0720.5730.066
S3Logistic_L10.6220.0700.6470.0670.5970.0670.5950.063
S3RandomForest0.5790.0780.6070.0820.5410.0700.5420.046
S3SGD_Balanced0.5660.0400.5450.0340.5760.0560.5670.037
S3Bayes_FW0.7090.0880.7110.0890.6490.0730.6570.069
S4GradBoost0.5720.0950.5830.0760.5700.0880.5650.070
S4Logistic0.6200.0720.6150.0460.6070.0590.5930.064
S4Logistic_EN0.6210.0720.6140.0450.6020.0610.5950.061
S4Logistic_L10.6210.0720.6110.0430.6080.0580.6000.057
S4RandomForest0.5670.0880.5780.0500.5730.0820.5700.064
S4SGD_Balanced0.5980.0580.5830.0740.6040.0440.5700.044
S4Bayes_FW0.6770.0720.6990.0590.6110.0750.6310.048
S5GradBoost0.6430.0610.6340.0690.6020.0820.6070.053
S5Logistic0.7360.0880.7390.1040.6470.0920.6550.095
S5Logistic_EN0.7400.0900.7450.1040.6530.0950.6600.096
S5Logistic_L10.7430.0910.7480.1050.6580.0930.6630.095
S5RandomForest0.6360.0610.6150.0720.5890.0660.5950.047
S5SGD_Balanced0.6840.0570.6560.0820.6290.0600.6400.057
S5Bayes_FW0.7670.0660.8030.0570.7110.0460.6770.071
S6GradBoost0.5120.0960.3950.0840.2540.0990.5750.064
S6Logistic0.5960.0810.4350.0640.4160.1170.5920.074
S6Logistic_EN0.5950.0750.4400.0540.4310.1190.6080.060
S6Logistic_L10.5960.0790.4410.0510.4290.1050.6070.053
S6RandomForest0.5430.0680.4060.0630.0860.1000.6500.054
S6SGD_Balanced0.5720.0900.3920.0780.4480.1160.5980.086
S6Bayes_FW0.6520.0730.5270.1240.5240.2270.6890.056
S7GradBoost0.6250.0880.2850.0760.1250.0560.7980.029
S7Logistic0.6470.0670.3310.0850.3160.0820.7790.028
S7Logistic_EN0.6430.0690.3360.0900.3290.0640.7850.025
S7Logistic_L10.6370.0700.3380.0910.3200.0800.7890.033
S7RandomForest0.6120.0740.2740.0700.0000.0000.7820.008
S7SGD_Balanced0.6180.0520.2530.0430.3460.0600.7150.040
S7Bayes_FW0.6560.0790.3440.0940.4790.0180.8100.014
S8GradBoost0.6150.0420.5350.0520.4730.0670.6090.031
S8Logistic0.5180.0470.4460.0380.4590.0500.5270.038
S8Logistic_EN0.5180.0510.4480.0430.4530.0550.5230.048
S8Logistic_L10.5220.0510.4520.0440.4510.0640.5170.052
S8RandomForest0.6480.0590.5700.0500.3840.1030.6340.025
S8SGD_Balanced0.5160.0270.4310.0230.4600.0440.5280.020
S8Bayes_FW0.6940.0490.6490.0660.5750.0270.6790.044
Table A2. Calibration metrics by model and scenario: Mean and standard deviation (SD) for LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Gap (Max CalibGap) across eight experimental scenarios and all evaluated models.
Table A2. Calibration metrics by model and scenario: Mean and standard deviation (SD) for LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Gap (Max CalibGap) across eight experimental scenarios and all evaluated models.
LogLossBrierECEMax CalibGap
ScenarioModelMeanstdMeanstdMeanstdMeanstd
S1GradBoost0.7770.0830.2410.0260.2230.0420.6510.147
S1Logistic0.5940.0810.1970.0280.1650.0420.4440.108
S1Logistic_EN0.5850.0760.1940.0270.1550.0280.4890.132
S1Logistic_L10.5770.0710.1920.0250.1430.0340.4710.160
S1RandomForest0.5890.0280.2010.0130.1200.0330.3070.188
S1SGD_Balanced5.7291.1020.3310.0300.3370.0290.7240.183
S1Bayes_FW0.5250.0600.1750.0250.0930.0220.3410.170
S2GradBoost0.8280.1610.2720.0510.2200.0650.5320.135
S2Logistic0.6660.0970.2340.0440.1790.0480.5460.256
S2Logistic_EN0.6580.0950.2310.0430.1720.0480.5500.240
S2Logistic_L10.6520.0920.2300.0410.1780.0480.5040.233
S2RandomForest0.6540.0580.2310.0260.1220.0520.2950.059
S2SGD_Balanced7.7941.6210.4100.0460.4160.0450.7070.200
S2Bayes_FW0.4830.0530.1550.0220.1330.0330.3590.113
S3GradBoost0.9090.1570.3030.0480.2650.0690.6110.177
S3Logistic1.4360.2590.3530.0440.3470.0480.6070.128
S3Logistic_EN1.2570.2550.3330.0500.3260.0620.6230.148
S3Logistic_L11.1610.2570.3190.0520.3010.0700.6050.162
S3RandomForest0.6830.0220.2450.0110.0690.0260.2950.099
S3SGD_Balanced14.3211.3120.4330.0360.4340.0350.4510.033
S3Bayes_FW0.6270.0650.2180.0290.1270.0380.2860.154
S4GradBoost0.8670.1300.2960.0440.2420.0510.5920.118
S4Logistic0.7260.0640.2560.0230.1860.0390.4710.143
S4Logistic_EN0.7160.0600.2530.0220.1740.0380.4460.180
S4Logistic_L10.7070.0550.2510.0210.1540.0330.4190.078
S4RandomForest0.6920.0280.2490.0130.1170.0350.4110.208
S4SGD_Balanced7.4821.7540.4160.0390.4210.0390.6870.191
S4Bayes_FW0.6530.0430.2300.0200.1030.0480.2790.149
S5GradBoost0.8630.1190.2790.0350.2380.0650.5250.129
S5Logistic0.6230.1220.2150.0460.1740.0560.4000.107
S5Logistic_EN0.6130.1160.2120.0450.1450.0560.4730.197
S5Logistic_L10.6050.1110.2100.0430.1500.0520.4450.186
S5RandomForest0.6720.0300.2380.0140.1150.0530.5270.283
S5SGD_Balanced6.1731.7810.3480.0620.3550.0610.6410.191
S5Bayes_FW0.5920.0550.2030.0250.1090.0290.3060.133
S6GradBoost0.9150.1520.2980.0430.2600.0490.7330.142
S6Logistic1.3020.2260.3210.0540.3190.0680.7220.080
S6Logistic_EN1.1740.1760.3100.0470.2940.0620.7000.137
S6Logistic_L11.0600.1650.3000.0420.2830.0610.6970.097
S6RandomForest0.6420.0320.2250.0150.0920.0380.3630.223
S6SGD_Balanced13.0472.7920.4000.0870.4010.0870.6440.124
S6Bayes_FW0.6150.0360.2120.0170.0570.0440.2200.180
S7GradBoost0.5180.0660.1580.0210.1140.0330.7070.177
S7Logistic0.7160.1290.1810.0250.1640.0340.7330.151
S7Logistic_EN0.6510.1120.1750.0240.1550.0390.7600.169
S7Logistic_L10.5870.0930.1670.0230.1430.0350.7350.177
S7RandomForest0.4590.0180.1430.0060.0460.0180.1820.100
S7SGD_Balanced8.6471.4660.2810.0390.2820.0390.7310.068
S7Bayes_FW0.4460.0310.1370.0110.0360.0330.1170.219
S8GradBoost0.7260.0520.2540.0160.1480.0260.5250.251
S8Logistic1.8190.2000.3940.0340.3750.0420.6260.072
S8Logistic_EN1.6010.1650.3850.0350.3650.0470.5620.102
S8Logistic_L11.4390.1390.3740.0340.3510.0390.5700.085
S8RandomForest0.6520.0220.2300.0100.0720.0210.2730.144
S8SGD_Balanced15.3360.7750.4710.0200.4730.0200.6010.108
S8Bayes_FW0.6240.0230.2160.0110.0440.0290.2050.100
Figure A1. Breast Cancer Wisconsin Features: Joint posterior samples of the feature weights w j and regression coefficients β j for selected predictors in the Breast Cancer Wisconsin dataset. Each point represents one posterior draw obtained from the HMC–NUTS sampler. The absence of ridge-like structures indicates that the parameters are not affected by multiplicative non-identifiability.
Figure A1. Breast Cancer Wisconsin Features: Joint posterior samples of the feature weights w j and regression coefficients β j for selected predictors in the Breast Cancer Wisconsin dataset. Each point represents one posterior draw obtained from the HMC–NUTS sampler. The absence of ridge-like structures indicates that the parameters are not affected by multiplicative non-identifiability.
Mathematics 14 01243 g0a1
Figure A2. Posterior distributions of key global parameters ( α , τ , ϵ ) for the Pima Indians Diabetes dataset. The x-axis represents the parameter values, while the y-axis shows the corresponding posterior density estimated from the MCMC samples. The shaded blue regions represent the posterior density distributions of the model parameters obtained from the MCMC samples. These distributions illustrate the uncertainty associated with each parameter and indicate the range of plausible values given the observed data.
Figure A2. Posterior distributions of key global parameters ( α , τ , ϵ ) for the Pima Indians Diabetes dataset. The x-axis represents the parameter values, while the y-axis shows the corresponding posterior density estimated from the MCMC samples. The shaded blue regions represent the posterior density distributions of the model parameters obtained from the MCMC samples. These distributions illustrate the uncertainty associated with each parameter and indicate the range of plausible values given the observed data.
Mathematics 14 01243 g0a2
Figure A3. Effective sample size ratios ( N eff / N ) for all model parameters, indicating sampling efficiency. The x-axis represents the effective sample size ratio N eff / N , which quantifies the sampling efficiency of the MCMC algorithm. The y-axis corresponds to the ordered model parameters, sorted according to their N eff / N values, and is used for visualization purposes only.
Figure A3. Effective sample size ratios ( N eff / N ) for all model parameters, indicating sampling efficiency. The x-axis represents the effective sample size ratio N eff / N , which quantifies the sampling efficiency of the MCMC algorithm. The y-axis corresponds to the ordered model parameters, sorted according to their N eff / N values, and is used for visualization purposes only.
Mathematics 14 01243 g0a3
Figure A4. Pairwise joint posterior distributions of global parameters ( α , τ , ϵ ), illustrating dependency structure.
Figure A4. Pairwise joint posterior distributions of global parameters ( α , τ , ϵ ), illustrating dependency structure.
Mathematics 14 01243 g0a4
Figure A5. Posterior predictive check comparing observed and model-predicted outcome distributions.
Figure A5. Posterior predictive check comparing observed and model-predicted outcome distributions.
Mathematics 14 01243 g0a5
Figure A6. Potential scale reduction factors ( R ^ ) for all parameters, demonstrating MCMC convergence. The x-axis displays the potential scale reduction factor ( R ^ ), a convergence diagnostic that compares within-chain and between-chain variability. The y-axis denotes the ordered indices of the model parameters after sorting them based on their R ^ values, providing a visual summary of convergence across all parameters. The solid vertical line at R ^ indicates the ideal case of perfect convergence, where within-chain and between-chain variances are equal. The vertical dashed line at R ^ represents a commonly used convergence threshold; values below this threshold suggest satisfactory convergence of the MCMC chains, whereas values above it may indicate lack of convergence or insufficient mixing.
Figure A6. Potential scale reduction factors ( R ^ ) for all parameters, demonstrating MCMC convergence. The x-axis displays the potential scale reduction factor ( R ^ ), a convergence diagnostic that compares within-chain and between-chain variability. The y-axis denotes the ordered indices of the model parameters after sorting them based on their R ^ values, providing a visual summary of convergence across all parameters. The solid vertical line at R ^ indicates the ideal case of perfect convergence, where within-chain and between-chain variances are equal. The vertical dashed line at R ^ represents a commonly used convergence threshold; values below this threshold suggest satisfactory convergence of the MCMC chains, whereas values above it may indicate lack of convergence or insufficient mixing.
Mathematics 14 01243 g0a6
Figure A7. Trace plots of selected regression coefficients ( β ) across MCMC chains.
Figure A7. Trace plots of selected regression coefficients ( β ) across MCMC chains.
Mathematics 14 01243 g0a7
Figure A8. Trace plots of global parameters ( α , τ , ϵ ) showing mixing behavior across chains.
Figure A8. Trace plots of global parameters ( α , τ , ϵ ) showing mixing behavior across chains.
Mathematics 14 01243 g0a8
Figure A9. Trace plots of feature weights (w) across MCMC chains.
Figure A9. Trace plots of feature weights (w) across MCMC chains.
Mathematics 14 01243 g0a9

References

  1. Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Appl. 2014, 24, 175–186. [Google Scholar] [CrossRef]
  2. Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
  3. Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
  4. Lange, K.; Little, R.J.A.; Taylor, J.M.G. Robust statistical modeling using the t distribution. J. Am. Stat. Assoc. 1989, 84, 881–896. [Google Scholar] [CrossRef]
  5. Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
  6. Natarajan, N.; Dhillon, I.S.; Ravikumar, P.K.; Tewari, A. Learning with noisy labels. Adv. Neural Inf. Process. Syst. 2013, 26, 1196–1204. [Google Scholar]
  7. Carvalho, C.M.; Polson, N.G.; Scott, J.G. The horseshoe estimator for sparse signals. Biometrika 2010, 97, 465–480. [Google Scholar] [CrossRef]
  8. Mitchell, T.J.; Beauchamp, J.J. Bayesian variable selection in linear regression. J. Am. Stat. Assoc. 1988, 83, 1023–1032. [Google Scholar] [CrossRef]
  9. Raykar, V.C.; Yu, S.; Zhao, L.H.; Valadez, G.H.; Florin, C.; Bogoni, L.; Moy, L. Learning from crowds. J. Mach. Learn. Res. 2010, 11, 1297–1322. [Google Scholar]
  10. Polson, N.G.; Scott, J.G. On the half-Cauchy prior for a global scale parameter. Bayesian Anal. 2012, 7, 887–902. [Google Scholar] [CrossRef]
  11. Hjort, N.L.; Holmes, C.; Müller, P.; Walker, S.G. (Eds.) Bayesian Nonparametrics; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar] [CrossRef]
  12. Tran, L.; Yin, X.; Liu, X. Disentangled representation learning GAN for pose-invariant face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 1283–1292. [Google Scholar]
  13. Sun, Z.; Wu, J.; Li, X.; Yang, W.; Xue, J.H. Amortized Bayesian prototype meta-learning: A new probabilistic meta-learning approach to few-shot image classification. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Virtual, 13–15 April 2021. [Google Scholar]
  14. Kabán, A. On Bayesian classification with Laplace priors. Pattern Recognit. Lett. 2007, 28, 1271–1282. [Google Scholar] [CrossRef]
  15. Foroughi pour, A.; Dalton, L.A. Optimal Bayesian feature filtering. J. Mach. Learn. Res. 2015, 16, 2869–2923. [Google Scholar]
  16. Zhao, J.; Zhang, X.; Yan, S. Learning to optimize domain specific normalization for domain generalization. In Proceedings of the European Conference on Computer Vision, Montreal, QC, Canada, 10 November 2021; pp. 68–85. [Google Scholar]
  17. Huang, J.; Qu, L.; Jia, R.; Zhao, B. O2U-Net: A simple noisy label detection approach for deep neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, New Orleans, LA, USA, 18–24 June 2022; pp. 3326–3334. [Google Scholar]
  18. Fuchs, T.; Kalinke, F. Robust partial-label learning by leveraging class activation values. Mach. Learn. 2025, 114, 193. [Google Scholar] [CrossRef]
  19. Ni no-Adan, I.; Manjarres, D.; Landa-Torres, I.; Portillo, E. Feature weighting methods: A review. Expert Syst. Appl. 2021, 184, 115424. [Google Scholar] [CrossRef]
  20. Cappozzo, A.; Greselin, F.; Murphy, T.B. Anomaly and novelty detection for robust semi-supervised learning. Stat. Comput. 2020, 30, 1545–1571. [Google Scholar] [CrossRef]
  21. Zhuo, J.; Wang, S.; Zhang, W.; Huang, Q. Deep unsupervised convolutional anomaly detection. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 1058–1066. [Google Scholar]
  22. Dalton, L.A. Optimal Bayesian feature selection. IEEE Trans. Inf. Theory 2013, 59, 7336–7347. [Google Scholar]
  23. Neal, R.M. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo; Chapman & Hall/CRC: Boca Raton, FL, USA, 2011; pp. 113–162. [Google Scholar]
  24. Hoffman, M.D.; Gelman, A. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 2014, 15, 1593–1623. [Google Scholar]
  25. Carpenter, B.; Gelman, A.; Hoffman, M.D.; Lee, D.; Goodrich, B.; Betancourt, M.; Riddell, A. Stan: A probabilistic programming language. J. Stat. Softw. 2017, 76, 1–32. [Google Scholar] [CrossRef]
  26. Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. Equation of state calculations by fast computing machines. J. Chem. Phys. 1953, 21, 1087–1092. [Google Scholar] [CrossRef]
  27. Hastings, W.K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1970, 57, 97–109. [Google Scholar] [CrossRef]
  28. Betancourt, M. A conceptual introduction to Hamiltonian Monte Carlo. arXiv 2017, arXiv:1701.02434. [Google Scholar]
  29. Brier, G.W. Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 1950, 78, 1–3. [Google Scholar] [CrossRef]
  30. Murphy, A.H. A new vector partition of the probability score. J. Appl. Meteorol. 1973, 12, 595–600. [Google Scholar] [CrossRef]
  31. Niculescu-Mizil, A.; Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning (ICML), Bonn, Germany, 7–11 August 2005; pp. 625–632. [Google Scholar]
  32. Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
  33. Kumar, A.; Liang, P.S.; Ma, T. Verified uncertainty calibration. Adv. Neural Inf. Process. Syst. 2019, 32, 3792–3803. [Google Scholar]
  34. Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis, 3rd ed.; Chapman & Hall/CRC: Boca Raton, FL, USA, 2013. [Google Scholar]
  35. Vehtari, A.; Gelman, A.; Simpson, D.; Carpenter, B.; Bürkner, P.C. Rank-normalization, folding, and localization: An improved R^ for assessing convergence of MCMC (with discussion). Bayesian Anal. 2021, 16, 667–718. [Google Scholar] [CrossRef]
  36. Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  37. Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
  38. Saito, T.; Rehmsmeier, M. The precision–recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
  39. Naeini, M.P.; Cooper, G.F.; Hauskrecht, M. Obtaining well calibrated probabilities using Bayesian binning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI), Austin, TX, USA, 25–30 January 2015; pp. 2901–2907. [Google Scholar]
  40. Van Calster, B.; McLernon, D.J.; van Smeden, M.; Wynants, L.; Steyerberg, E.W. Calibration: The Achilles heel of predictive analytics. BMC Med. 2019, 17, 230. [Google Scholar] [CrossRef]
  41. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Figure 1. Discrimination trend plots of model performance across eight experimental scenarios (S1–S8).
Figure 1. Discrimination trend plots of model performance across eight experimental scenarios (S1–S8).
Mathematics 14 01243 g001
Figure 2. Calibration trend plots of model performance across eight experimental scenarios (S1–S8).
Figure 2. Calibration trend plots of model performance across eight experimental scenarios (S1–S8).
Mathematics 14 01243 g002
Figure 3. Comparative performance across models and data scenarios: Heat maps show mean performance metrics by model (rows) and scenario (columns) for area under the ROC curve (AUC), area under the precision–recall curve (PRAUC), F1-score, and accuracy. Cell color indicates standardized scores; cell numbers represent mean metric values. For more information, see Table A1.
Figure 3. Comparative performance across models and data scenarios: Heat maps show mean performance metrics by model (rows) and scenario (columns) for area under the ROC curve (AUC), area under the precision–recall curve (PRAUC), F1-score, and accuracy. Cell color indicates standardized scores; cell numbers represent mean metric values. For more information, see Table A1.
Mathematics 14 01243 g003
Figure 4. Calibration performance across scenarios: Heat maps show mean performance metrics by model (rows) and scenario (columns) for logloss, Brier score, expected calibration error (ECE), and maximum calibration gap. Cell color indicates standardized scores; cell numbers represent mean metric values. For more information, see Table A2.
Figure 4. Calibration performance across scenarios: Heat maps show mean performance metrics by model (rows) and scenario (columns) for logloss, Brier score, expected calibration error (ECE), and maximum calibration gap. Cell color indicates standardized scores; cell numbers represent mean metric values. For more information, see Table A2.
Mathematics 14 01243 g004
Figure 5. Feature importance estimates for the Breast Cancer Wisconsin dataset obtained from the Bayesian Feature Weighting model. Posterior mean weights are shown as dots, and 90% credible intervals are shown as horizontal lines based on the 5th and 95th posterior quantiles.
Figure 5. Feature importance estimates for the Breast Cancer Wisconsin dataset obtained from the Bayesian Feature Weighting model. Posterior mean weights are shown as dots, and 90% credible intervals are shown as horizontal lines based on the 5th and 95th posterior quantiles.
Mathematics 14 01243 g005
Figure 6. Posterior mean weights and 90% credible intervals for feature importance: Feature-wise posterior means and 90% credible intervals estimated by the Bayesian Feature Weighting model for the Pima Indians Diabetes dataset.
Figure 6. Posterior mean weights and 90% credible intervals for feature importance: Feature-wise posterior means and 90% credible intervals estimated by the Bayesian Feature Weighting model for the Pima Indians Diabetes dataset.
Mathematics 14 01243 g006
Figure 7. Feature importance estimates for the SAHeart dataset obtained from the Bayesian Feature Weighting model. Posterior mean weights are shown as dots, and 90% credible intervals are shown as horizontal lines based on the 5th and 95th posterior quantiles.
Figure 7. Feature importance estimates for the SAHeart dataset obtained from the Bayesian Feature Weighting model. Posterior mean weights are shown as dots, and 90% credible intervals are shown as horizontal lines based on the 5th and 95th posterior quantiles.
Mathematics 14 01243 g007
Figure 8. Feature importance estimates for the Cleveland dataset obtained from the Bayesian Feature Weighting model. Posterior mean weights are shown as dots, and 90% credible intervals are shown as horizontal lines based on the 5th and 95th posterior quantiles.
Figure 8. Feature importance estimates for the Cleveland dataset obtained from the Bayesian Feature Weighting model. Posterior mean weights are shown as dots, and 90% credible intervals are shown as horizontal lines based on the 5th and 95th posterior quantiles.
Mathematics 14 01243 g008
Figure 9. Calibration performance comparison (LogLoss, Brier, ECE, and MCE) of the Bayes_FW model against benchmark classifiers across four datasets, (a) Breast Cancer Wisconsin, (b) Pima Indians Diabetes, (c) South African Heart Disease, and (d) Cleveland Heart Disease datasets.
Figure 9. Calibration performance comparison (LogLoss, Brier, ECE, and MCE) of the Bayes_FW model against benchmark classifiers across four datasets, (a) Breast Cancer Wisconsin, (b) Pima Indians Diabetes, (c) South African Heart Disease, and (d) Cleveland Heart Disease datasets.
Mathematics 14 01243 g009
Figure 10. Discrimination performance comparison (AUC, PRAUC, F1, and Accuracy) of the Bayes_FW model and benchmark classifiers across four datasets, (a) Breast Cancer Wisconsin, (b) Pima Indians Diabetes, (c) South African Heart Disease, and (d) Cleveland Heart Disease datasets.
Figure 10. Discrimination performance comparison (AUC, PRAUC, F1, and Accuracy) of the Bayes_FW model and benchmark classifiers across four datasets, (a) Breast Cancer Wisconsin, (b) Pima Indians Diabetes, (c) South African Heart Disease, and (d) Cleveland Heart Disease datasets.
Mathematics 14 01243 g010
Table 1. Key notation for the Bayesian feature weighting model. Summary of symbols, domains, and interpretations used in the proposed model.
Table 1. Key notation for the Bayesian feature weighting model. Summary of symbols, domains, and interpretations used in the proposed model.
SymbolSupportInterpretation
x i R P Covariate vector for observation i
y i { 0 , 1 } Binary label for observation i
η i R Linear predictor α 0 + ( w β ) x i
s i ( 0 , 1 ) Baseline logistic probability σ ( η i )
p i ( 0 , 1 ) Noise-adjusted success probability in Equation (2)
α 0 R Global intercept
β R P Regression coefficients
w Δ P 1 Simplex-constrained feature weights; global relevance of predictors
τ ( 0 , ) Global horseshoe scale parameter
λ j ( 0 , ) Local horseshoe scale for coefficient β j
ε [ 0 , 1 ] Label-noise rate (probability of label flip)
Table 2. Simulation scenarios: Summary of the eight simulated data scenarios (S1–S8) representing diverse sources of classification difficulty.
Table 2. Simulation scenarios: Summary of the eight simulated data scenarios (S1–S8) representing diverse sources of classification difficulty.
ScenarioDescriptionKey Characteristics
S1_basicBaseline clean dataModerate n , m ; no noise/outliers
S2_corrCorrelated featuresStrong within-block correlation ( ρ = 0.8 )
S3_highdimHigh-dimensional m n with sparsity
S4_lblnoiseLabel noiseLabel corruption ( ε = 0.1 )
S5_outlierXCovariate outliersFeature contamination ( γ = 0.1 )
S6_bothNoise and outliersJoint label and covariate corruption
S7_imbalanceClass imbalanceRare positive class ( π = 0.15 )
S8_hardmixHard mixed effectsNonlinearity, heavy tails, and imbalance
Table 3. Posterior diagnostics by scenario for the Bayes_FW simulations: Columns show the global shrinkage parameter α 0 and label-noise rate ε (means/SDs), convergence summaries for the Gelman–Rubin statistic R ^ (mean, max, and percent of parameters with R ^ > 1.01 ), and effective sample size n eff (min/median).
Table 3. Posterior diagnostics by scenario for the Bayes_FW simulations: Columns show the global shrinkage parameter α 0 and label-noise rate ε (means/SDs), convergence summaries for the Gelman–Rubin statistic R ^ (mean, max, and percent of parameters with R ^ > 1.01 ), and effective sample size n eff (min/median).
α 0 ε R ^ n eff
ScenarioMeanSDMeanSDMeanMax>1.01MinMed.
S10.120.210.030.021.001.012%14503900
S20.080.250.020.021.001.013%13003600
S3−0.040.290.010.011.011.026%7802450
S40.150.240.090.031.001.013%12503300
S50.100.260.020.021.001.013%11803050
S60.220.270.100.031.011.039%6201980
S7−0.180.230.050.021.001.024%10402900
S80.050.310.120.041.011.038%5401750
Table 4. Predictor variables in the Breast Cancer Wisconsin dataset: List of diagnostic cytological features used for tumor classification.
Table 4. Predictor variables in the Breast Cancer Wisconsin dataset: List of diagnostic cytological features used for tumor classification.
FeatureDescription (Biological Meaning)
Cl.thicknessClump thickness (uniformity of cell thickness)
Cell.sizeUniformity of cell size
Cell.shapeUniformity of cell shape
Marg.adhesionMarginal adhesion of cells
Epith.c.sizeSingle epithelial cell size
Bare.nucleiPresence of bare nuclei
Bl.cromatinBland chromatin (nuclear texture)
Normal.nucleoliNormal nucleoli count
MitosesNumber of mitoses (cell divisions)
Class (target)Tumor diagnosis: malignant (1)/benign (0)
Table 5. Discrimination and classification performance on the Breast Cancer Wisconsin dataset: Mean and standard deviation (SD) of AUC, PRAUC, F1-score, and accuracy across repeated runs for each model.
Table 5. Discrimination and classification performance on the Breast Cancer Wisconsin dataset: Mean and standard deviation (SD) of AUC, PRAUC, F1-score, and accuracy across repeated runs for each model.
ModelAUCPRAUCF1ACC
MeanSDMeanSDMeanSDMeanSD
Logistic_EN0.99600.00320.99160.00770.94520.00980.96190.0063
Logistic_L10.99570.00340.99090.00850.94700.01430.96340.0089
Logistic0.99540.00380.99010.01000.94700.01430.96340.0089
RandomForest0.99490.00300.98970.00650.94550.00900.96190.0063
Bayes_FW0.99750.00260.99310.00430.95870.01030.96990.0051
GradBoost0.99350.00350.98680.00880.95410.01600.96780.0112
Table 6. Calibration and loss performance on the Breast Cancer Wisconsin dataset: Mean and standard deviation (SD) of log-loss, Brier score, expected calibration error (ECE), and maximum calibration error (MCE) across repeated runs for each model.
Table 6. Calibration and loss performance on the Breast Cancer Wisconsin dataset: Mean and standard deviation (SD) of log-loss, Brier score, expected calibration error (ECE), and maximum calibration error (MCE) across repeated runs for each model.
ModelLogLossBrierECEMCE
MeanSDMeanSDMeanSDMeanSD
Logistic_EN0.09480.01910.02550.00320.03680.00750.72820.0914
Logistic_L10.09780.02470.02600.00360.03490.00850.65310.1419
Logistic0.10300.03510.02680.00430.03440.00670.76720.0844
RandomForest0.09220.01700.02650.00600.03650.00520.49770.1754
Bayes_FW0.09120.01260.02430.00280.03210.00460.41770.1038
GradBoost0.11780.03460.02920.00830.03480.00840.67670.2187
Table 7. Posterior mean feature weights and 90% credible intervals estimated by the Bayes_FW model: Results are based on the Breast Cancer Wisconsin dataset. The 5% and 95% quantiles define the lower and upper credible interval bounds.
Table 7. Posterior mean feature weights and 90% credible intervals estimated by the Bayes_FW model: Results are based on the Breast Cancer Wisconsin dataset. The 5% and 95% quantiles define the lower and upper credible interval bounds.
w
FeatureMean5%95%
Bare.nuclei0.14670.02970.3338
Cl.thickness0.13390.02220.3233
Mitoses0.12530.00730.3341
Normal.nucleoli0.10760.01150.2676
Cell.size0.10330.00860.2864
Bl.cromatin0.10180.01140.2728
Marg.adhesion0.09920.00920.2722
Cell.shape0.09730.00730.2691
Epith.c.size0.08490.00510.2323
Table 8. Predictor variables in the Pima Indians Diabetes dataset: List of diagnostic and physiological variables used for diabetes classification.
Table 8. Predictor variables in the Pima Indians Diabetes dataset: List of diagnostic and physiological variables used for diabetes classification.
FeatureDescription (Biological Meaning)
PregnantNumber of times pregnant
GlucosePlasma glucose concentration (2-h oral glucose tolerance test)
PressureDiastolic blood pressure (mm Hg)
TricepsTriceps skin fold thickness (mm)
Insulin2-h serum insulin ( μ U/mL)
Mass (BMI)Body mass index (weight in kg/height in m2)
PedigreeDiabetes pedigree function (genetic risk measure)
AgeAge in years
OutcomeDiabetes diagnosis: positive (1)/negative (0)
Table 9. Discrimination and classification performance on the Pima Indians Diabetes dataset: Mean and standard deviation (SD) of AUC, PRAUC, F1-score, and accuracy across repeated runs for each model.
Table 9. Discrimination and classification performance on the Pima Indians Diabetes dataset: Mean and standard deviation (SD) of AUC, PRAUC, F1-score, and accuracy across repeated runs for each model.
ModelAUCPRAUCF1ACC
MeanSDMeanSDMeanSDMeanSD
Logistic_EN0.83230.04150.71960.05370.63780.06860.77870.0315
Logistic_L10.83380.04110.72200.05410.62660.06760.77220.0310
Logistic0.83240.04050.71890.05320.64530.06640.78130.0324
RandomForest0.83240.04080.71330.04780.64620.07890.76960.0516
Bayes_FW0.84260.03250.78510.04250.68810.05770.79190.0311
GradBoost0.80390.04710.66440.06850.60400.09220.73440.0567
Table 10. Calibration and loss performance on the Pima Indians Diabetes dataset: Mean and standard deviation (SD) of log-loss, Brier score, expected calibration error (ECE), and maximum calibration error (MCE) across repeated runs for each model.
Table 10. Calibration and loss performance on the Pima Indians Diabetes dataset: Mean and standard deviation (SD) of log-loss, Brier score, expected calibration error (ECE), and maximum calibration error (MCE) across repeated runs for each model.
ModelLogLossBrierECEMCE
MeanSDMeanSDMeanSDMeanSD
Logistic_EN0.48040.04920.15620.01790.08050.02140.28800.1222
Logistic_L10.48050.04990.15620.01810.07420.01830.30580.1496
Logistic0.48080.05230.15620.01850.08440.02070.30120.1357
RandomForest0.47820.04990.15860.02030.08210.01970.37670.3019
Bayes_FW0.34040.04110.14680.01690.07000.01680.27830.1101
GradBoost0.59800.12230.18610.03670.13220.05560.36700.1782
Table 11. Posterior mean feature weights and 90% credible intervals estimated by the Bayes_FW model: Results are based on the Pima Indians Diabetes dataset. The 5% and 95% quantiles define the lower and upper credible interval bounds.
Table 11. Posterior mean feature weights and 90% credible intervals estimated by the Bayes_FW model: Results are based on the Pima Indians Diabetes dataset. The 5% and 95% quantiles define the lower and upper credible interval bounds.
w
FeatureMean5%95%
glucose0.18780.03920.4044
mass0.16540.03220.3658
pregnant0.14710.02550.3459
pedigree0.14090.02360.3348
pressure0.10410.00910.2819
age0.08810.00460.2542
insulin0.08590.00470.2539
triceps0.08060.00360.2491
Table 12. Predictor variables in the South African Heart Disease (SAHeart) dataset: List of demographic, behavioral, and clinical variables used for CHD classification.
Table 12. Predictor variables in the South African Heart Disease (SAHeart) dataset: List of demographic, behavioral, and clinical variables used for CHD classification.
FeatureDescription (Biological/Clinical Meaning)
ageAge of the patient (years)
famhistFamily history of heart disease (Present/Absent)
tobaccoCumulative tobacco consumption (kg)
ldlLow-density lipoprotein cholesterol
typeaType-A behavior score (psychosocial risk factor)
sbpSystolic blood pressure (mm Hg)
obesityObesity index
adiposityAdiposity (body fat measure)
alcoholAlcohol consumption (liters per day)
chdCoronary heart disease status (1 = present, 0 = absent)
Table 13. Discrimination and classification performance on the SAHeart dataset: Mean and standard deviation (SD) of AUC, PRAUC, F1-score, and accuracy across repeated runs for each model.
Table 13. Discrimination and classification performance on the SAHeart dataset: Mean and standard deviation (SD) of AUC, PRAUC, F1-score, and accuracy across repeated runs for each model.
ModelAUCPRAUCF1ACC
MeanSDMeanSDMeanSDMeanSD
Logistic_EN0.78530.06550.65780.08910.54700.02830.72290.0481
Logistic_L10.78440.06270.65640.07920.53500.05500.71860.0468
Logistic0.77980.06740.66650.09840.55140.03890.71640.0553
Bayes_FW0.79030.05980.69330.07990.54940.02150.74000.0371
RandomForest0.72460.07420.57440.09120.49630.02400.68170.0540
GradBoost0.70040.07530.53430.08690.49110.06580.66870.0616
Table 14. Calibration and loss performance on the SAHeart dataset: Mean and standard deviation (SD) of log-loss, Brier score, expected calibration error (ECE), and maximum calibration error (MCE) across repeated runs for each model.
Table 14. Calibration and loss performance on the SAHeart dataset: Mean and standard deviation (SD) of log-loss, Brier score, expected calibration error (ECE), and maximum calibration error (MCE) across repeated runs for each model.
ModelLogLossBrierECEMCE
MeanSDMeanSDMeanSDMeanSD
Logistic_EN0.53220.05090.17900.02020.10870.03260.29350.1241
Logistic_L10.53210.04790.17880.01880.09060.04900.26460.0844
Logistic0.53470.06410.17980.02520.10570.04100.27200.0937
Bayes_FW0.52450.04130.16450.01830.14280.03100.26120.0875
RandomForest0.57380.06350.19740.02260.09760.00700.26560.1662
GradBoost0.74230.11070.23790.03810.19510.04070.55040.2081
Table 15. Posterior feature weights and 90% credible intervals estimated by Bayes_FW: Results are based on the SAHeart dataset; 5% and 95% columns give the credible interval bounds.
Table 15. Posterior feature weights and 90% credible intervals estimated by Bayes_FW: Results are based on the SAHeart dataset; 5% and 95% columns give the credible interval bounds.
w
FeatureMean5%95%
age0.14720.02810.3293
famhist0.13570.02540.3127
tobacco0.12040.01660.2926
ldl0.11810.01530.2888
typea0.10570.01110.2707
sbp0.07900.00480.2281
obesity0.07760.00410.2285
adiposity0.07320.00430.2176
alcohol0.06940.00340.2114
Table 16. Predictor variables in the Heart Disease (Cleveland) dataset: List of demographic, clinical, and exercise-related variables used for heart disease classification.
Table 16. Predictor variables in the Heart Disease (Cleveland) dataset: List of demographic, clinical, and exercise-related variables used for heart disease classification.
FeatureDescription (Units/Encoding)
ageAge in years (continuous)
sexSex (1 = male, 0 = female; binary)
cpChest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal;
4 = asymptomatic). Encoded as cp_1, cp_2, cp_3.
trestbpsResting blood pressure (mm Hg; continuous)
cholSerum cholesterol (mg/dL; continuous)
fbsFasting blood sugar > 120 mg/dL (1/0; binary)
restecgResting ECG (0 = normal; 1 = ST–T abnormality; 2 = LV hypertrophy). Encoded as restecg_1, restecg_2.
thalachMaximum heart rate achieved (continuous)
exangExercise-induced angina (1/0; binary; encoded as exang_1)
oldpeakST depression induced by exercise relative to rest (continuous)
slopeSlope of peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping). Encoded as slope_1, slope_2.
caNumber of major vessels (0–3) colored by fluoroscopy. Encoded as ca_1,
ca_2, ca_3.
thalThalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect). Encoded as thal_3, thal_6, thal_7.
targetHeart disease status (1 = present, 0 = absent; binary)
Table 17. Discrimination and classification performance on the Cleveland dataset: Mean and standard deviation (SD) of AUC, PRAUC, F1-score, and accuracy across repeated runs for each model.
Table 17. Discrimination and classification performance on the Cleveland dataset: Mean and standard deviation (SD) of AUC, PRAUC, F1-score, and accuracy across repeated runs for each model.
ModelAUCPRAUCF1ACC
MeanSDMeanSDMeanSDMeanSD
Logistic_EN0.90340.03530.89860.04270.80530.04820.82840.0398
Logistic0.90040.03840.88590.05610.81160.04030.83500.0331
Logistic_L10.90010.03420.89860.03670.79930.04180.82170.0340
Bayes_FW0.90790.03130.90760.02690.82640.03530.84400.0294
RandomForest0.88540.03750.87750.03600.75990.03740.79210.0302
GradBoost0.86410.04380.86130.03360.76480.06100.78900.0547
Table 18. Calibration and loss performance on the Cleveland dataset: Mean and standard deviation (SD) of log-loss, Brier score, expected calibration error (ECE), and maximum calibration error (MCE) across repeated runs for each model.
Table 18. Calibration and loss performance on the Cleveland dataset: Mean and standard deviation (SD) of log-loss, Brier score, expected calibration error (ECE), and maximum calibration error (MCE) across repeated runs for each model.
ModelLogLossBrierECEMCE
MeanSDMeanSDMeanSDMeanSD
Logistic_EN0.38920.06500.12270.02220.09120.01780.40490.2840
Logistic0.43440.12960.12410.02420.10640.01670.53420.2285
Logistic_L10.39410.06810.12410.02130.09160.01850.58940.1796
Bayes_FW0.33650.05270.11030.01850.09310.01810.37690.1647
RandomForest0.42840.05430.13830.01990.09550.03350.38820.1667
GradBoost0.58210.15410.16840.03980.16660.03290.61600.1846
Table 19. Posterior feature weights and 90% credible intervals estimated by Bayes_FW: Results are based on the Cleveland dataset; 5% and 95% columns give the credible interval bounds.
Table 19. Posterior feature weights and 90% credible intervals estimated by Bayes_FW: Results are based on the Cleveland dataset; 5% and 95% columns give the credible interval bounds.
w
FeatureMean5%95%
cp_30.07340.01310.1701
ca_10.07250.01320.1733
thal_30.07030.01130.1668
ca_20.06810.01110.1623
slope_10.06130.00690.1546
oldpeak0.06010.00720.1525
ca_30.05650.00640.1474
sex_10.05610.00550.1469
trestbps0.04730.00350.1332
exang_10.04560.00330.1264
thalach0.04450.00300.1283
cp_10.04350.00270.1261
cp_20.04130.00220.1233
restecg_20.03920.00230.1156
slope_20.03860.00200.1144
restecg_10.03780.00200.1142
chol0.03690.00190.1134
age0.03610.00190.1097
thal_20.03610.00170.1114
fbs_10.03490.00150.1086
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cengiz, M.A.; Öztürk, Z.; Alharthi, A. A Bayesian Feature Weighting Model with Simplex-Constrained Dirichlet and Contamination-Aware Priors for Noisy Medical Data. Mathematics 2026, 14, 1243. https://doi.org/10.3390/math14081243

AMA Style

Cengiz MA, Öztürk Z, Alharthi A. A Bayesian Feature Weighting Model with Simplex-Constrained Dirichlet and Contamination-Aware Priors for Noisy Medical Data. Mathematics. 2026; 14(8):1243. https://doi.org/10.3390/math14081243

Chicago/Turabian Style

Cengiz, Mehmet Ali, Zeynep Öztürk, and Abdulmohsen Alharthi. 2026. "A Bayesian Feature Weighting Model with Simplex-Constrained Dirichlet and Contamination-Aware Priors for Noisy Medical Data" Mathematics 14, no. 8: 1243. https://doi.org/10.3390/math14081243

APA Style

Cengiz, M. A., Öztürk, Z., & Alharthi, A. (2026). A Bayesian Feature Weighting Model with Simplex-Constrained Dirichlet and Contamination-Aware Priors for Noisy Medical Data. Mathematics, 14(8), 1243. https://doi.org/10.3390/math14081243

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop