4.1. Simulation
To evaluate the proposed Bayesian robust feature weighting framework systematically, we conducted an extensive simulation study across eight scenarios (S1–S8), each designed to reflect a distinct source of difficulty in supervised classification. The scenarios manipulate data characteristics such as dimensionality, correlation, class imbalance, outliers, and label noise, thereby enabling a comprehensive assessment of robustness and generalization.
Table 2 summarizes these scenarios.
To ensure reproducibility, we provide detailed specifications for each simulation scenario. For all scenarios, the sample size, number of features, and data-generating mechanisms are explicitly defined. The covariates are generated from standard distributions, and the true regression coefficients are constructed to reflect varying levels of sparsity and signal strength. Noise contamination is introduced through a controlled mechanism, with a predefined contamination rate and flipping probability.
Specifically, each scenario differs in terms of the proportion of relevant features, the magnitude of regression coefficients, and the level of noise contamination. The random seed is fixed across all experiments to ensure replicability. All simulations are implemented using consistent preprocessing steps, including feature standardization. Detailed parameter settings for each scenario are provided to facilitate exact reproduction of the results.
Data generation: Predictors
were drawn from multivariate Gaussian blocks with correlation parameter
, yielding both independent and correlated structures. In some scenarios, heavy-tailed covariates were introduced by replacing the first block with
-distributed samples (
degrees of freedom), simulating covariate outliers. A sparse linear signal was imposed via coefficients
with only
s variables contributing to the decision boundary; the scenario S8 additionally included nonlinear interactions (see
Table 2).
Class labels: Latent scores were computed as , with calibrated such that the marginal probability of a positive label matched the target prevalence . Observed labels were then drawn from a distribution, where , and independently flipped with probability to simulate misclassification.
Outlier injection: In scenarios involving feature contamination, a fraction
of rows in
were shifted by multiples of the marginal standard deviation in randomly chosen dimensions, producing covariate outliers (cf. S5 in
Table 2).
Evaluation metrics: Ten replications were performed for each simulation scenario. The training and testing splits were stratified to ensure both classes were proportionally represented. Model performance was evaluated across three complementary dimensions: discrimination, calibration, and accuracy.
Model discrimination refers to the ability to correctly distinguish between positive and negative instances. It was quantified using the area under the receiver-operating-characteristic curve (AUC), area under the precision–recall curve (PRAUC), and F1-score [
36,
37,
38], while AUC measures overall ranking ability, PRAUC provides a more informative assessment under class imbalance, and the F1-score balances precision and recall.
Calibration assesses agreement between predicted probabilities and observed frequencies. We employed log-loss (cross-entropy loss), Brier score, expected calibration error (ECE), and maximum calibration gap (MCE) [
31,
39,
40]. The Brier score represents a proper scoring rule capturing calibration and refinement, whereas the ECE summarizes the average deviation between predicted and empirical probabilities.
Overall classification accuracy, defined as the proportion of correctly classified observations, was reported for completeness [
41], but it was interpreted alongside discrimination and calibration measures because it can be misleading under imbalance.
Comparative models: In addition to the proposed Bayesian framework, benchmarks included logistic regression (with and without L1/elastic-net penalties), random forests, gradient boosting, and class-balanced stochastic-gradient descent.
Figure 1 and
Figure 2 show the comparative performance trends of all models across the eight experimental scenarios (S1–S8). Each subplot presents key evaluation metrics: AUC, PRAUC, F1-score, and accuracy capture discriminative ability, whereas log-loss, Brier score, expected calibration error (ECE), and maximum calibration gap (MCE) assess probability calibration.
For discrimination metrics (AUC, PRAUC, F1, Accuracy), higher values indicate better performance (
Figure 1); for calibration metrics (LogLoss, Brier, ECE, MCE), lower values are desirable (
Figure 2). The Bayesian model (Bayes_FW) is compared against classical machine learning methods, including logistic regression (and its L1 and elastic-net variants), gradient boosting, random forest, and SGD with balanced weights.
Figure 1 and
Figure 2 together demonstrate the comparative behavior of all models across the eight experimental scenarios (S1–S8). Overall, the Bayesian model maintains competitive or superior performance in probabilistic metrics such as LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Gap (MCE), underscoring its strength in uncertainty quantification and calibration. In terms of discriminative performance (AUC, PRAUC, Accuracy, and F1), logistic regression variants achieve the highest AUC and PRAUC on the clean and balanced dataset (S1), although the Bayesian model still provides strong results. Under more challenging conditions, such as the imbalance setting (S7) and the hard mixture scenario (S8), the Bayesian approach demonstrates robustness, preserving relatively stable AUC and F1 compared to other methods that exhibit sharper drops.
With respect to calibration, the Bayesian model consistently attains the lowest (and thus best) values, particularly in S1, S2, and S4, reflecting not only accurate predictions but also reliable probability estimates, a property of particular importance in risk-sensitive applications. In contrast, models such as SGD and Random Forest show higher variability, with occasional large calibration errors. Furthermore, in high-dimensional (S3) and correlated (S2) scenarios, performance differences among models tend to narrow, yet the Bayesian method continues to retain its calibration advantage. Similarly, in the presence of label noise (S4) and outliers (S5), the uncertainty-aware structure of the Bayesian model prevents the extreme degradation observed in tree-based models.
Taken together, these findings highlight that while traditional models can occasionally surpass Bayesian approaches in terms of pure discriminative accuracy (as in S1), Bayesian modeling provides more reliable probability calibration and demonstrates greater robustness across diverse and adverse data conditions.
Table 3 summarizes the convergence diagnostics and key hyperparameter estimates across eight simulation scenarios, each designed to test different data complexities and contamination settings for the proposed Bayesian Feature Weighting (Bayes_FW) model. The parameter
represents the global intercept of the linear predictor, capturing the baseline log-odds of the outcome. The mean values vary moderately across scenarios (from
in S7_imbalance to 0.22 in S6_both), reflecting adaptive shrinkage behavior under different data conditions. The global shrinkage behavior is governed by the parameter
in the hierarchical horseshoe prior, while
controls local shrinkage at the feature level. The parameter
denotes the estimated label-noise proportion, which increases notably in noise-heavy settings such as S4_lblnoise (0.09), S6_both (0.10), and S8_hardmix (0.12), confirming that the model successfully captured and quantified contamination in the data.
Convergence diagnostics based on the Gelman–Rubin statistic (mean
, maximum
) demonstrated excellent mixing and stability of the Markov chains across all scenarios. The proportion of parameters with
remained below 10% in every case, further confirming reliable convergence. Effective sample sizes (
) were generally high (median values between 1750 and 3900), ensuring that posterior estimates were based on sufficiently independent draws.
Table 3 summarizes the posterior diagnostics for all simulation settings.
Overall,
Table 3 indicates that the proposed Bayesian model achieved stable convergence and consistent inference across varying levels of correlation, label noise, imbalance, and dimensionality. The results validate the robustness and computational reliability of the MCMC implementation, even under challenging conditions such as high-dimensional noise mixtures (S8) and concurrent outlier–label-noise contamination (S6).
Figure 3 presents a comparative evaluation of the proposed Bayesian Feature Weighting (Bayes_FW) model against a diverse set of benchmark classifiers across eight well-defined data scenarios. These include standard clean data (S1), label corruption (S2), high-dimensional features (S3), label imbalance (S4), the presence of feature outliers (S5), simultaneous label and feature noise (S6), class imbalance (S7), and a challenging setting with both severe label noise and imbalance (S8). The benchmark models considered encompass logistic regression and its regularized variants (L1 and elastic net), as well as ensemble-based methods such as random forest and gradient boosting. In addition, a robust linear baseline—stochastic gradient descent with balanced class weights (SGD-balanced)—is included to account for class imbalance. This setup enables a comprehensive assessment of predictive robustness and generalization across diverse data conditions.
The proposed Bayes_FW model consistently outperforms or matches benchmarks in challenging settings, particularly under label noise (S2), feature outliers (S5), and compound corruption (S6, S8), while simpler models like Logistic Regression perform competitively in clean scenarios (e.g., S1), they show performance degradation in the presence of noise. In contrast, Bayes_FW achieves the best F1 and PRAUC scores in most corrupted settings, demonstrating superior robustness and predictive reliability.
Figure 4 reports metrics related to probabilistic calibration and uncertainty estimation for Bayes_FW and benchmark models. To assess the probabilistic calibration performance of the models, four complementary metrics are employed. Logloss measures the negative log-likelihood of the predicted class probabilities, penalizing overconfident and incorrect predictions. Brier Score captures the mean squared error between predicted probabilities and actual class labels, providing a direct measure of overall probabilistic accuracy. Expected Calibration Error (ECE) quantifies the average deviation between predicted confidence and observed accuracy across confidence bins, reflecting the alignment between model confidence and correctness. Lastly, Max Calibration Gap reports the largest observed discrepancy between confidence and accuracy, indicating the worst-case calibration error. Together, these metrics offer a comprehensive evaluation of both average and extreme calibration behavior.
The proposed Bayes_FW model shows consistently better or competitive performance across calibration metrics, particularly under noisy or imbalanced conditions (S2, S4, S6, S8), while some benchmark models achieve low classification error, they often exhibit poor calibration (e.g., Random Forest, SGD-Balanced). Bayes_FW uniquely provides both strong predictive performance and principled uncertainty estimates, making it especially well-suited for applications where reliability and trust in model output are critical—such as healthcare.
4.2. Real Medical Dataset Applications
In this study, we evaluate the performance of the proposed model using four benchmark medical datasets, namely the Breast Cancer Wisconsin, Pima Indians Diabetes, South African Heart Disease, and Cleveland Heart Disease datasets. For each dataset, we report key statistical characteristics, including the number of samples, number of features, and class distribution, to ensure transparency and reproducibility.
The Pima Indians Diabetes dataset consists of 768 samples with 8 clinical features. The target variable indicates whether a patient is diagnosed with diabetes. Approximately 35% of the samples belong to the positive class, while 65% correspond to the negative class, indicating a moderately imbalanced classification problem. All features are standardized to have zero mean and unit variance prior to model training. To ensure a reliable evaluation, we employ a stratified 5-fold cross-validation scheme, preserving class proportions across training and test splits.
The Cleveland Heart Disease dataset consists of 297 samples with 13 clinical features. The target variable represents the presence or absence of heart disease. Approximately 54% of the samples correspond to the positive class, indicating a relatively balanced dataset. Similar to the Pima dataset, all features are standardized prior to analysis, and a stratified 5-fold cross-validation procedure is used to divide the dataset into training and test sets, ensuring consistency and comparability across experiments.
The Breast Cancer Wisconsin dataset and the South African Heart Disease dataset are also included in the experimental evaluation to provide a comprehensive assessment across datasets with varying levels of class imbalance, feature characteristics, and noise sensitivity.
This standardized evaluation protocol allows for a fair and consistent comparison between the proposed method and competing approaches across diverse medical data settings.
4.2.1. Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin (Diagnostic) dataset was obtained from the UCI Machine Learning Repository. It consists of nine predictor variables describing cellular characteristics, along with a binary class label indicating whether a tumor is malignant (1) or benign (0). The dataset originally contains 699 observations; after removing instances with missing values, 683 cases remained for analysis. The class distribution is moderately imbalanced, with approximately 65% benign and 35% malignant samples. Prior to model training, all predictors were standardized to have zero mean and unit variance, and the ID attribute, which carries no predictive information, was discarded.
An overview of the predictor variables in the Breast Cancer Wisconsin dataset is provided in
Table 4. These features represent morphological and nuclear characteristics extracted from digitized cell images and form the basis for malignancy classification.
Table 5 presents the mean and standard deviation (SD) of four performance metrics—AUC, PRAUC, F1-score, and accuracy—computed from repeated cross-validation for six competing models: standard logistic regression (Logistic), L1-regularized logistic regression (Logistic_L1), elastic-net logistic regression (Logistic_EN), Random Forest, Gradient Boosting (GradBoost), and the proposed Bayesian Feature Weighting model (Bayes_FW).
Across all metrics, Bayes_FW consistently achieved the strongest overall performance. It obtained the highest mean AUC (0.9975) and PRAUC (0.9931), along with superior F1-score (0.9587) and accuracy (0.9699), while maintaining low variability (SD < 0.006). These results highlight the robustness and discriminative strength of the Bayesian approach in identifying malignant cases. Among the frequentist baselines, Logistic_EN and Logistic_L1 also performed competitively, with AUC values around 0.996 and balanced F1-scores near 0.947, suggesting that regularization contributes to slight gains over the standard logistic model. Random Forest and GradBoost delivered marginally lower performance, reflecting the limited benefit of nonlinear tree-based methods for this dataset, which primarily consists of moderately correlated numeric predictors.
The Bayesian Feature Weighting model outperformed all other approaches, demonstrating superior predictive accuracy and stability, and confirming its effectiveness for robust and interpretable tumor classification.
Table 6 summarizes the calibration and reliability metrics for the six classification models evaluated on the Breast Cancer Wisconsin dataset. The reported measures include the LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE), each presented with their mean and standard deviation (SD) across repeated cross-validation runs.
Among all compared models, the Bayesian Feature Weighting (Bayes_FW) model achieved the most reliable probabilistic predictions, as evidenced by the lowest LogLoss (0.0912) and lowest Brier score (0.0243). These results indicate that the Bayesian approach produces well-calibrated probability estimates with minimal deviation from true outcome probabilities. The model also yielded the smallest ECE (0.0321), suggesting excellent overall calibration across probability bins.
In comparison, Random Forest exhibited slightly higher LogLoss (0.0922) but maintained strong calibration (ECE = 0.0365). Traditional logistic models—Logistic_EN, Logistic_L1, and Logistic—performed reasonably well but showed modestly higher LogLoss and Brier scores, indicating slightly less accurate probability estimation. The Gradient Boosting (GradBoost) model performed the weakest in terms of calibration, with the highest LogLoss (0.1178) and Brier score (0.0292), reflecting some degree of overconfidence in its probability predictions.
The Bayes_FW model outperformed all alternatives across all four calibration metrics, confirming its superiority not only in predictive discrimination (as shown in
Table 5) but also in probability reliability and uncertainty quantification—a key advantage of the Bayesian framework.
Table 7 and
Figure 5 present the posterior mean feature weights and their 90% credible intervals estimated using the Bayesian Feature Weighting (Bayes_FW) model. These results quantify the relative contribution of each cellular attribute to malignancy prediction while incorporating model uncertainty through the Bayesian posterior distribution.
It is important to note that the feature weights represent the relative relevance of predictors within the weighting structure. The actual contribution of a predictor to the linear predictor depends on the product . Since the hierarchical horseshoe prior strongly shrinks irrelevant coefficients toward zero, predictors with near-zero have negligible predictive influence even if their corresponding weights are moderately large.
According to the results, Bare nuclei, Clump thickness, and Mitoses emerge as the most influential predictors, showing the highest posterior mean weights (0.1467, 0.1339, and 0.1253, respectively). Their wider yet consistently positive credible intervals indicate both strong and stable associations with malignancy likelihood. Intermediate importance is observed for Normal nucleoli, Cell size, and Bland chromatin, which also contribute meaningfully but with slightly lower average weights. Finally, Marginal adhesion, Cell shape, and Epithelial cell size have the smallest posterior means, suggesting relatively weaker influence in distinguishing malignant from benign tumors.
Figure 5 visually confirms this ranking pattern, where the dots represent posterior means and the horizontal bars denote 90% credible intervals. The clear separation of higher-weighted features at the top highlights the discriminative power of nuclear irregularities and cellular cohesion, which are biologically consistent with pathological observations in breast cancer diagnosis.
4.2.2. Pima Indians Diabetes Dataset
The Pima Indians Diabetes dataset is a widely used benchmark in medical machine learning, originally collected by the National Institute of Diabetes and Digestive and Kidney Diseases. It contains clinical and physiological measurements from female patients of Pima Indian heritage aged 21 years or older. The dataset includes eight predictor variables—such as glucose concentration, body mass index (BMI), and number of pregnancies—that are important risk factors for type 2 diabetes. The binary outcome variable indicates whether an individual shows signs of diabetes (1) or not (0), based on established diagnostic criteria. The dataset is frequently used to evaluate predictive models in healthcare, as it combines demographic, genetic, and lifestyle-related risk indicators with measurable biomedical parameters.
An overview of the predictor variables in the Pima Indians Diabetes dataset is provided in
Table 8. These features represent demographic, physiological, and biochemical risk factors commonly associated with type 2 diabetes.
To further validate the proposed Bayesian framework, comprehensive posterior diagnostics are provided in
Appendix A. The marginal posterior distributions of key global parameters (
Figure A2) indicate well-defined and unimodal behavior. In particular, the intercept (
) is tightly concentrated, while the global shrinkage parameter (
) exhibits a right-skewed distribution, reflecting the adaptive sparsity induced by the horseshoe prior. The label-noise parameter (
) is centered around low values, suggesting limited but non-negligible noise in the dataset.
Sampling diagnostics confirm the reliability of inference. As shown in
Figure A3, the effective sample size ratios (
) are consistently high, indicating efficient exploration of the posterior space. Similarly, the
statistics (
Figure A3) are tightly concentrated around 1, providing strong evidence of convergence across all chains.
Trace plots for both global parameters and regression coefficients (
Figure A5,
Figure A6 and
Figure A7) demonstrate good mixing behavior with no visible trends or chain separation, further supporting stable MCMC performance, while occasional spikes are observed in coefficient traces due to the heavy-tailed prior, these do not indicate pathological sampling behavior.
The joint posterior structure (
Figure A8) reveals mild dependencies among parameters, particularly between
and
, which is expected in hierarchical shrinkage models. Importantly, no pathological correlations or funnel-shaped geometries are observed.
Finally, the posterior predictive check (
Figure A9) shows strong agreement between observed and model-generated distributions, indicating that the proposed model successfully captures the underlying data-generating process. Minor deviations at extreme probability regions suggest slight calibration imperfections but do not materially affect predictive performance.
Overall, these diagnostics confirm that the proposed Bayesian feature weighting model achieves stable convergence, efficient sampling, and reliable uncertainty quantification on the Pima Indians Diabetes dataset.
Table 9 presents the mean and standard deviation (SD) of four performance metrics—AUC, PRAUC, F1-score, and accuracy—computed from repeated cross-validation for six competing models: standard logistic regression (Logistic), L1-regularized logistic regression (Logistic_L1), elastic-net logistic regression (Logistic_EN), Random Forest, Gradient Boosting (GradBoost), and the proposed Bayesian Feature Weighting model (Bayes_FW).
Across all discrimination metrics, the Bayes_FW method achieved the highest overall performance, with an AUC mean of 0.8426, PRAUC mean of 0.7851, and F1-score mean of 0.6881, outperforming both conventional logistic regression variants and ensemble-based methods. These results highlight the model’s ability to capture uncertainty in feature contributions while maintaining high discriminative power. Logistic_L1 and standard Logistic regression followed closely, exhibiting comparable AUC values (0.8338 and 0.8324, respectively) but slightly lower precision–recall and F1-scores. Ensemble models such as Random Forest and Gradient Boosting demonstrated lower AUC and PRAUC values, suggesting less stable performance for this moderately imbalanced dataset. The higher SD observed for Gradient Boosting indicates greater variability across runs, potentially due to hyperparameter sensitivity or overfitting in smaller training subsets.
Table 10 reports the calibration and reliability metrics for the same models. The measures include LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE), each presented with mean and SD across repeated cross-validation runs.
Among the compared methods, the Bayesian Feature Weighting (Bayes_FW) approach achieved the lowest LogLoss (0.3404) and Brier score (0.1468), indicating superior probabilistic accuracy and overall calibration. It also produced the smallest ECE (0.0700) and MCE (0.2783), suggesting that the Bayesian model provides well-calibrated probability estimates that closely align with observed outcomes. In contrast, the Gradient Boosting (GradBoost) model showed the weakest calibration, with the highest LogLoss and ECE values, implying overconfident predictions and larger deviations from true event frequencies.
Table 11 and
Figure 6 present the posterior mean feature weights and their 90% credible intervals estimated by the Bayesian Feature Weighting model. These results quantify the relative contribution of each clinical variable to diabetes prediction while incorporating model uncertainty through the Bayesian posterior distribution.
According to the results, glucose, body mass index, and number of pregnancies are the most influential predictors, exhibiting the highest posterior mean weights (0.1878, 0.1654, and 0.1471, respectively). These features show strong and stable associations with diabetes risk, as indicated by their positive and moderately wide credible intervals. Pedigree function, reflecting genetic predisposition, also ranks among the top predictors. Lower but meaningful contributions are observed for blood pressure, insulin, and triceps skinfold thickness, suggesting secondary influence in the model’s classification process.
Figure 6 visually confirms this ranking pattern, where the dots represent posterior means and the horizontal bars denote 90% credible intervals. The dominance of glucose concentration and body mass index underscores their well-established roles as primary determinants of type 2 diabetes, while the remaining features capture secondary but biologically consistent effects.
4.2.3. South African Heart Disease Dataset
The South African Heart Disease (SAHeart) dataset originates from a South African study on risk factors associated with coronary heart disease (CHD). It includes demographic, clinical, and lifestyle-related variables commonly linked to cardiovascular outcomes. The dataset combines biochemical measures (e.g., LDL cholesterol) with behavioral indicators (e.g., tobacco and alcohol use) and psychosocial factors (Type-A behavior). The binary outcome variable indicates the presence (1) or absence (0) of CHD. This dataset is widely used in statistical learning and medical data analysis because it provides a comprehensive mix of physiological, behavioral, and hereditary risk factors.
An overview of the predictor variables in the SAHeart dataset is provided in
Table 12.
Table 13 presents the mean and standard deviation (SD) of four performance metrics—AUC, PRAUC, F1-score, and accuracy—computed from repeated cross-validation for six competing models: elastic-net logistic regression (Logistic_EN), L1-regularized logistic regression (Logistic_L1), standard logistic regression (Logistic), Bayesian Feature Weighting (Bayes_FW), Random Forest, and Gradient Boosting (GradBoost).
Across discrimination metrics, Bayes_FW achieved the best overall performance (AUC = 0.7903; PRAUC = 0.6933), with competitive F1-score (0.5494) and accuracy (0.7400). Logistic_EN and Logistic_L1 followed closely in AUC and PRAUC. Tree-based methods (Random Forest, GradBoost) showed lower discrimination, consistent with potential overfitting or noise sensitivity in smaller biomedical datasets.
Table 14 summarizes calibration and loss metrics: LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE), each reported with mean and SD across repeated runs.
Bayes_FW achieved the lowest LogLoss (0.5245) and Brier score (0.1645), indicating strong probabilistic accuracy. Although its ECE was higher than some logistic baselines, Bayes_FW maintained competitive calibration overall, while GradBoost showed the weakest calibration (highest LogLoss, ECE, and MCE).
Table 15 and
Figure 7 present posterior mean feature weights and 90% credible intervals estimated by Bayes_FW.
According to the results, age, family history (famhist), and tobacco exhibit the largest posterior mean weights, indicating the strongest association with CHD risk in this cohort. Biochemical and physiological indicators such as ldl and bp show moderate influence, while lifestyle/anthropometric variables (obesity, adiposity, alcohol) contribute more weakly, with wider credible intervals reflecting greater uncertainty.
4.2.4. Heart Disease (Cleveland) Dataset
The Heart Disease (Cleveland) dataset is a widely used benchmark in cardiovascular research and machine learning. It contains 303 patient records collected at the Cleveland Clinic Foundation; after preprocessing (removing missing values and encoding categorical variables), approximately 297 samples remain. The outcome variable is binary: presence of heart disease (1) versus absence (0).
The dataset includes clinical, demographic, and exercise-related attributes (e.g., age, blood pressure, cholesterol, thalassemia test results, ECG findings). Categorical features (e.g., cp, thal, slope, restecg, ca, exang) were expanded to one-hot indicators for compatibility with the Bayesian feature weighting framework.
An overview of the predictor variables is provided in
Table 16.
Table 17 presents the mean and standard deviation (SD) of four performance metrics—AUC, PRAUC, F1-score, and accuracy—computed from repeated cross-validation for six competing models: elastic-net logistic regression (Logistic_EN), standard logistic regression (Logistic), L1-regularized logistic regression (Logistic_L1), Bayesian Feature Weighting (Bayes_FW), Random Forest, and Gradient Boosting (GradBoost).
Across discrimination metrics, Bayes_FW achieved the strongest overall performance (AUC = 0.9079; PRAUC = 0.9076; F1 = 0.8264; ACC = 0.8440). Penalized logistic baselines (Logistic_EN, Logistic_L1) were competitive, while tree-based methods (Random Forest, GradBoost) trailed on average, consistent with smaller sample sizes and mixed continuous/categorical predictors.
Table 18 summarizes calibration and loss metrics—LogLoss, Brier score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE)—each reported with mean and SD across repeated runs.
Bayes_FW achieved the lowest LogLoss (0.3365) and Brier score (0.1103), indicating strong probabilistic calibration. Logistic baselines were competitive but less precise, whereas GradBoost showed the weakest calibration (highest LogLoss, ECE, and MCE).
Table 19 and
Figure 8 present the posterior mean feature weights and their 90% credible intervals estimated using Bayes_FW.
According to the results, cp_3, ca_1, and thal_3 exhibit the largest posterior mean weights, indicating the strongest association with heart disease risk in this cohort. Moderately important features include oldpeak, slope_1, sex_1, and thalach. Variables such as fbs_1, chol, and age receive smaller weights after accounting for correlation among predictors.
Figure 8 visually confirms these findings by showing posterior means with 90% credible intervals.