X-HEM: An Explainable and Trustworthy AI-Based Framework for Intelligent Healthcare Diagnostics

Al-Hammouri, Mohammad F.; Vamsi, Bandi; Almalkawi, Islam T.; Al Bataineh, Ali

doi:10.3390/computers15010033

Open AccessArticle

X-HEM: An Explainable and Trustworthy AI-Based Framework for Intelligent Healthcare Diagnostics

¹

Department of Computer Engineering, Faculty of Engineering, The Hashemite University, Zarqa 13133, Jordan

²

Department of Artificial Intelligence, Madanapalle Institute of Technology & Science, Deemed to be University, Madanapalle 517325, Andhra Pradesh, India

³

Artificial Intelligence Center, Norwich University, Northfield, VT 05663, USA

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(1), 33; https://doi.org/10.3390/computers15010033

Submission received: 16 November 2025 / Revised: 25 December 2025 / Accepted: 26 December 2025 / Published: 7 January 2026

(This article belongs to the Special Issue AI-Powered IoT (AIoT) Systems: Advancements in Security, Sustainability, and Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Intracranial Hemorrhage (ICH) remains a critical life-threatening condition where timely and accurate diagnosis using non-contrast Computed Tomography (CT) scans is vital to reduce mortality and long-term disability. Deep learning methods have shown strong potential for automated hemorrhage detection, yet most existing approaches lack confidence quantification and clinical interpretability, which limits their adoption in high-stakes care. This study presents X-HEM, an explainable hemorrhage ensemble model for reliable detection of Intracranial Hemorrhage (ICH) on non-contrast head CT scans. The aim is to improve diagnostic accuracy, interpretability, and confidence for real-time clinical decision support. X-HEM integrates three convolutional backbones (VGG16, ResNet50, DenseNet121) through soft voting. Bayesian uncertainty is estimated using Monte Carlo Dropout, while Grad-CAM++ and SHAP provide spatial and global interpretability. Training and validation were conducted on the RSNA ICH dataset, with external testing on CQ500. The model achieved AUCs of 0.96 (RSNA) and 0.94 (CQ500), demonstrated well-calibrated confidence (low Brier/ECE), and provided explanations that aligned with radiologist-marked regions. The integration of ensemble learning, Bayesian uncertainty, and dual explainability enables X-HEM to deliver confidence-aware, interpretable ICH predictions suitable for clinical use.

Keywords:

intracranial hemorrhage; deep learning; ensemble model; Bayesian uncertainty; explainable AI; computed tomography; clinical decision support

1. Introduction

Intracranial Hemorrhage (ICH), defined as the sudden escape of blood within the cranial vault, is a critical neurological emergency that requires rapid diagnosis and intervention [1]. The most common causes include hypertension, head trauma, vascular malformations, and anticoagulant therapy [2]. The condition is associated with a high fatality rate because irreversible brain injury occurs quickly after onset. According to the World Health Organization (WHO), hemorrhagic strokes account for about 11% of all strokes yet contribute disproportionately to global mortality, ranking as the second leading cause of death worldwide [3]. Although less common than ischemic stroke, hemorrhagic stroke is responsible for 30–40% of stroke-related fatalities [4]. Reports from the American Heart Association (AHA) indicate that the 30-day mortality rate for ICH ranges from 35% to 52%, with nearly half of these deaths occurring within the first 24 h [5]. These statistics highlight the urgent need for diagnostic methods that are accurate, fast, and reliable in acute clinical settings. ICH manifests in five clinically recognized subtypes: Epidural Hemorrhage (EDH), Intraparenchymal Hemorrhage (IPH), Subdural Hemorrhage (SDH), Subarachnoid Hemorrhage (SAH), and Intraventricular Hemorrhage (IVH). Each subtype differs in etiology, prognosis, and treatment strategy, making subtype-level interpretability essential for clinical trustworthiness. Deep learning, particularly Convolutional Neural Networks (CNNs), has achieved impressive success in medical image analysis over recent years [6,7]. Pretrained architectures such as VGG16, ResNet50, and DenseNet121 have been widely applied for pathology detection in head Computed Tomography (CT) scans [8]. Despite their accuracy, most CNN-based systems lack transparency and confidence quantification. These “black-box” models provide predictions without explaining their reasoning or indicating the level of certainty [9]. This limitation significantly reduces clinical trust, especially in life-threatening scenarios such as ICH, where both false positives and false negatives may have serious consequences [10].

Recent research has shifted towards building reliable AI systems that combine predictive accuracy with interpretability and confidence estimation [11]. Such systems are designed to support, rather than replace, clinicians in high-risk diagnostic workflows. Similarly, methods that employ Grad-CAM or SHAP improve interpretability but are rarely combined with predictive uncertainty, an essential factor for decision support in emergency radiology. Although existing deep learning methods for ICH detection have achieved high accuracy, most focus on isolated objectives such as classification, basic ensemble fusion, or visualization. However, clinical deployment demands more than accuracy alone, requiring careful consideration of confidence calibration and decision transparency in safety-critical settings. In this context, recent medical artificial intelligence literature has increasingly highlighted reliability, interpretability, and deployment readiness as ongoing and evolving research challenges, particularly in radiology and emergency diagnostics. While these aspects continue to be actively discussed beyond the scope of any single study, the present work focuses on the technical design and empirical evaluation of a framework that aligns with these directions by integrating ensemble learning, uncertainty estimation, and explainability. Further investigation, including prospective validation and integration into clinical workflows, remains an important direction for future research.

To address these gaps, this work proposes X-HEM, a unified framework that integrates ensemble modelling, Bayesian uncertainty quantification, and dual explainability for ICH diagnosis on non-contrast head CT scans. X-HEM integrates multiple CNN backbones, Monte Carlo Dropout-based uncertainty quantification, and complementary Grad-CAM++ and SHAP explainability methods to produce robust, confidence-aware, and clinically interpretable predictions. We validate the framework on both the large-scale RSNA ICH dataset and the external CQ500 dataset and conduct ablation studies to isolate the impact of each component. The main contributions are summarized as follows:

We propose X-HEM, an ensemble of VGG16, ResNet50, and DenseNet121 optimized for slice-wise ICH classification in CT images.
We integrate Bayesian uncertainty estimation using Monte Carlo Dropout to produce confidence-aware predictions and enhance clinical reliability.
We implement a dual-mode interpretability framework combining Grad-CAM++ (localization) and SHAP (global attribution) for complementary explanations.
We validate X-HEM on both RSNA and CQ500 datasets, demonstrating strong generalization across diverse clinical imaging conditions.

The remainder of this paper is structured as follows. Section 2 reviews recent advances in deep learning for ICH detection, ensemble learning, uncertainty estimation, and explainable AI, and shows how the limitations of prior approaches motivate the design of X-HEM. Section 3 presents the datasets, preprocessing steps, and the proposed framework. Section 4 details the experimental setup, evaluation metrics, and results. Finally, Section 5 concludes with a discussion of findings, limitations, clinical implications, and promising directions for future research.

2. Related Work

Recent research on AI for intracranial hemorrhage (ICH) and related neuroimaging tasks spans four main areas: deep learning-based ICH classification, ensemble learning in medical imaging, uncertainty quantification, and explainable AI (XAI) for clinical diagnostics.

2.1. ICH Diagnosis Using Deep Learning

Deep learning has been widely applied for detecting ICH on CT scans, with CNN-based architectures such as VGG16, ResNet50, and DenseNet121 achieving strong performance on datasets like the RSNA ICH Challenge. Most of these studies, however, focus solely on binary hemorrhage classification and do not provide integrated interpretability or uncertainty estimates, both of which are key for clinical trust.

Patil et al. [12] developed a hybrid method combining image processing with Inception-ResNet V2, achieving 91% external accuracy while providing bleed localization. Yalcin et al. [13] used EfficientNet-B0 to predict hematoma expansion and achieved 84% accuracy and 82% F1-score, though the model lacked external validation. Linli et al. [14] applied a spectral-normalized Gaussian process for brain age estimation, showing how uncertainty can improve interpretability, but their work was not focused on ICH. Qiao et al. [15] proposed DeepSAP, integrating CNN and Vision Transformer models to predict stroke-associated pneumonia in ICH patients, achieving an AUC of 0.93 but without uncertainty modelling. Malik et al. [16] benchmarked CNNs, including EfficientNet-B3, achieving 93.29% accuracy but with no explainability integration. Jie et al. [17] combined Random Forest, CatBoost, and Extra Trees for predicting early neurological deterioration, employing SHAP for feature-level interpretation but relying solely on tabular data rather than imaging.

Overall, deep learning approaches have achieved impressive accuracy but usually treat classification, interpretability, and uncertainty as separate goals rather than as a unified design.

2.2. Ensemble Learning in Medical Imaging

Ensemble learning is commonly used in medical imaging to enhance robustness and generalization by combining predictions from multiple models. Strategies such as stacked generalization, hard voting, and soft voting aggregate outputs from diverse CNN architectures and often yield higher accuracy than single models.

Mogensen et al. [18] used a backward ensemble search strategy combining ResNet34 and DenseNet169 for gait disorder classification. Hazarika et al. [19] proposed an explainable ensemble with handcrafted CT features, achieving 96.91% accuracy. Zhu et al. [20] designed MEEDNets, a bio-inspired ensemble framework that achieved up to 99.43% accuracy across multiple datasets, and Sreelakshmi et al. [21] developed M-Net for brain MRI segmentation with accuracies up to 99%.

Despite these strong results, most ensemble methods either omit uncertainty quantification or provide only limited explainability, which restricts their use in safety-critical tasks such as ICH triage.

2.3. Uncertainty Estimation in AI Models

Uncertainty estimation methods have been developed to address the “black-box” nature of deep neural networks. One widely used Bayesian approximation technique is Monte Carlo (MC) Dropout, which estimates predictive confidence by performing multiple stochastic forward passes. Other approaches include Bayesian neural networks, ensemble-based calibration, and confidence adjustment strategies such as Brier scoring and temperature scaling. These techniques improve model transparency and clinical trust when they are properly integrated into the prediction pipeline.

Buddenkotte et al. [22] introduced a calibrated ensemble for medical image segmentation, improving reliability and enabling applications such as active learning. Wang et al. [23] developed a CNN–GRU framework for slice-wise ICH classification that achieved an AUC of 0.988 and ranked first in the RSNA challenge, but it did not incorporate Bayesian uncertainty estimation. Overall, while significant progress has been made on uncertainty estimation, its combined use with ensemble frameworks and explicit interpretability is still limited.

2.4. Explainable AI (XAI) in ICH Healthcare

Explainable AI has gained increasing attention due to its critical role in enabling the clinical adoption of AI systems. Visualization methods such as Grad-CAM and Grad-CAM++ are widely used to highlight image regions that contribute to model predictions. In parallel, feature attribution techniques such as SHAP and LIME provide global and instance-level explanations of decision-making processes. However, these methods are often applied in isolation and are seldom evaluated together with uncertainty measures or against expert annotations.

Beer et al. [24] demonstrated SHAP’s use in identifying genomic biomarkers for Alzheimer’s disease. Du et al. [25] employed SHAP to interpret the XGBoost model by visualizing each feature’s contribution to sarcopenia risk assessment in older adults, enhancing transparency and clinical interpretability. Mirzaei et al. [26] benchmarked CNNs on PhysioNet ICH data but lacked explainable components. Yang et al. [27] applied a modified ResNet to distinguish cerebral venous sinus thrombosis-related ICH from spontaneous ICH with an AUC of 0.95, using Grad-CAM for localization but no uncertainty quantification. These works illustrate the potential of XAI but do not yet fully connect explainability with confidence estimation or ensemble modelling.

2.5. Research Gap

Recent studies [12,13,14,15,16,17,18,19,20,21,22,23,24,26,27] demonstrate substantial progress in deep learning, ensemble modelling, and explainable methods for ICH and related tasks. However, most approaches optimize one dimension at a time: classification accuracy, basic ensembling, or interpretability, without jointly addressing calibrated uncertainty, clinically validated explanations, and subtype-aware evaluation within a single framework.

In contrast, X-HEM is designed as an integrated pipeline that (i) ensembles complementary CNN backbones, (ii) quantifies predictive uncertainty via Monte Carlo Dropout with calibration analysis, and (iii) combines Grad-CAM++ and SHAP with ROI-based evaluation. This unified design targets binary ICH detection while still leveraging subtype labels for analysis, aiming to provide predictions that are not only accurate but also confidence-aware and clinically interpretable.

3. Methods

The overall X-HEM workflow, including ensemble fusion, uncertainty estimation, and interpretability modules, is shown in Figure 1. Monte Carlo (MC) Dropout is used during inference to generate predictive uncertainty distributions, while Grad-CAM++ provides spatial localization, and SHAP quantifies feature-level attributions. Mean SHAP impact is computed to measure global interpretability and to support ablation analyses.

3.1. Datasets Used

Two publicly available datasets, RSNA Intracranial Hemorrhage Detection and CQ500, were used to develop and evaluate the proposed X-HEM framework. Both contain non-contrast head CT scans labelled for ICH and its subtypes. The RSNA dataset [28], released for the 2019 RSNA ICH Detection Challenge on Kaggle, includes over 750,000 axial slices from 25,000 CT studies. Each slice is annotated for the five ICH subtypes (EDH, IPH, SDH, SAH, IVH) and for overall hemorrhage presence. For this work, labels were mapped into a binary format (hemorrhage vs. non-hemorrhage) for classification, while subtype metadata was retained for visualization and qualitative analysis.

The CQ500 dataset [29] contains 491 CT studies from multiple hospitals in India. Each study was independently reviewed by three radiologists, with consensus labels assigned for ICH presence and subtype. Unlike RSNA, CQ500 provides a diverse clinical context and imaging variability, making it an ideal benchmark for external validation. Figure 2 shows representative slices from RSNA (left) and CQ500 (right). The RSNA scan shows a classic example of hemorrhage localization, while the CQ500 scan highlights intraparenchymal bleeding with a clear midline shift. The RSNA dataset was split into 70% for training, 10% for validation, and 20% for testing to prevent data leakage. CQ500 was used exclusively for external validation. Each CQ500 study contains approximately 30–60 slices (about 25,000 slices in total).

3.1.1. Data Pre-Processing

A structured preprocessing pipeline was applied to the RSNA and CQ500 datasets to enable effective training of the X-HEM framework. This study adopted a slice-wise classification strategy, treating each axial CT slice as an individual instance. This aligns with the RSNA annotation format and allows learning of fine-grained hemorrhagic features without the cost of volumetric modelling. Processing slices independently expanded the training set to over 750,000 labelled slices.

Binary mapping was performed where slices labelled with any of the five ICH subtypes (EDH, IPH, SDH, SAH, IVH) were assigned a hemorrhage label (1), while all others were assigned as non-hemorrhage (0). This ensured subtype information was preserved for interpretability analyses while simplifying the core classification task. Each raw DICOM image

x_{i} \in R^{512 \times 512}

was resized to

224 \times 224

pixels to match the input of VGG16, ResNet50, and DenseNet121, and normalized to the [0, 1] range using min–max normalization as shown in Equation (1).

x_{i}^{″} = \frac{x_{i}^{'} - min (x_{i}^{'})}{max (x_{i}^{'}) - min (x_{i}^{'})}

(1)

All slices were processed under a brain window (WL = 40 HU, WW = 80 HU) to ensure consistent contrast for hemorrhage regions and suppress irrelevant tissue signals. To align with ImageNet-pretrained CNN backbones, each grayscale CT slice was replicated across three channels to form a pseudo-RGB image. This ensured compatibility with VGG16, ResNet50, and DenseNet121 pretrained weights while maintaining the original grayscale intensity distribution.

Data augmentation was applied only to the training set to improve robustness and reduce overfitting. Transformations included random flips,

\pm 15^{\circ}

rotations, translations (≤10%), and zoom scaling (0.8–1.2). No augmentation was used for RSNA validation/test or CQ500 to ensure fair evaluation. Figure 3 shows the preprocessing pipeline, where each DICOM slice is resized and normalized before augmentation. After preprocessing, RSNA contained approximately 750,000 standardized slices, and CQ500 contained approximately 25,000 slices (30–60 per study). CQ500 was reserved exclusively for external validation.

3.1.2. Data Splitting

All dataset splitting was done at the study level using the StudyInstanceUID identifier to prevent data leakage. The RSNA dataset was split into 70% for training, 10% for validation, and 20% for testing, corresponding to approximately 17,500, 2500, and 5000 studies, respectively. A stratified GroupKFold strategy preserved class balance and ensured no patient appeared in more than one subset. Inclusion was limited to non-contrast CT head studies with valid metadata; incomplete or corrupted scans were excluded. For training, a cosine annealing learning rate scheduler was used (initial LR = 0.001, minimum LR =

1 \times 10^{- 6}

over 50 epochs) to stabilize convergence. The CQ500 dataset (491 studies, ∼25,000 slices) was used solely for external validation, with no re-splitting or augmentation applied, ensuring genuine out-of-distribution testing.

3.1.3. Implementation Details

All experiments were implemented in Python 3.14 using PyTorch 1.13.1 with CUDA 11.6. Training and evaluation were performed on a workstation equipped with a single NVIDIA RTX-class GPU (24 GB VRAM), an 8-core CPU, and 64 GB of system memory. All three backbones (VGG16, ResNet50, DenseNet121) were optimized using the Adam optimizer with a base learning rate of

1 \times 10^{- 4}

, a cosine-annealing learning-rate schedule (initial learning rate

= 0.001

, minimum learning rate

= 1 \times 10^{- 6}

), and a weight decay of

1 \times 10^{- 4}

. A dropout rate of 0.5 was used in the classifier layers and kept active during inference for Monte Carlo Dropout with

T = 30

stochastic forward passes per slice. Models were trained with a batch size of 32 for up to 50 epochs, using early stopping with a patience of 10 epochs based on validation loss. On the RSNA training split, training a single backbone required approximately 6–8 h, while training all three backbones, constructing the soft-voting ensemble, and configuring MC-Dropout inference completed in roughly two days of wall-clock time. To support reproducibility, all random seeds for data shuffling, weight initialization, and data augmentation were fixed to 42.

3.2. Proposed X-HEM Model

3.2.1. Ensemble Architecture

The deep ensemble forms the core of the X-HEM architecture by combining DenseNet121, VGG16, and ResNet50. Each CNN backbone captures distinct representations of hemorrhagic patterns in CT slices, improving model generalization and robustness. VGG16 is a deep sequential network with small 3 × 3 kernels, ensuring stable feature extraction. ResNet50 introduces residual connections to ease gradient flow and enable learning of complex representations. DenseNet121 improves feature reuse and efficiency by connecting each layer to all preceding ones, reducing overfitting.

Let

f_{1} (x)

,

f_{2} (x)

, and

f_{3} (x)

represent the softmax probabilities for a CT slice x obtained from DenseNet121, VGG16, and ResNet50, respectively. The final ensemble prediction

{\hat{y}}_{ensemble} (x)

is given by:

{\hat{y}}_{ensemble} (x) = \frac{1}{3} [f_{1} (x) + f_{2} (x) + f_{3} (x)] .

(2)

All base models were initialized with ImageNet-pretrained weights. MC Dropout layers were configured with a 0.5 rate during training and inference. The Adam optimizer (

l r = 1 \times 10^{- 4}

, weight decay

= 1 \times 10^{- 4}

) and a cosine annealing schedule were used. Early stopping (patience = 10 epochs) prevented overfitting, and all experiments used a fixed random seed (42) for reproducibility. Each model was trained separately on the preprocessed dataset, and predictions were averaged at the probability level to generate the final output. This ensemble setup stabilizes predictions and improves generalization.

3.2.2. Inference Algorithm

The X-HEM inference pipeline combines (i) ensemble-based classification, (ii) Bayesian uncertainty estimation, and (iii) post hoc interpretability. Each model performs multiple forward passes with dropout enabled, producing probability distributions used to estimate both prediction mean and variance. Ensemble averaging determines the class label, while the variance captures uncertainty. Grad-CAM++ and SHAP generate visual and feature-level explanations. At the study level, predictions were aggregated using a top-k pooling strategy (

k = 3

), averaging the three most confident slices. This reduced false positives and maintained a balance between sensitivity and specificity. Algorithm 1 summarizes the inference process.

Algorithm 1 X-HEM Inference with Bayesian Uncertainty and Explainability

Require: Pre-processed CT slice x
Ensure: Final prediction label $\hat{y}$ , uncertainty score U, visual explanations (Grad-CAM++, SHAP)
1: Define base models $M = {VGG 16, ResNet 50, DenseNet 121}$ , each with dropout enabled
2: Set number of Monte Carlo forward passes $T = 30$
3: Initialize prediction matrix $P \leftarrow zeros (| M |, T)$
# Monte Carlo Dropout Sampling
4: for each model $m \in M$ do
5: for $t = 1$ to T do
6: $l o g i t s \leftarrow f_{m} (x)$
7: $p_{m, t} \leftarrow softmax (l o g i t s) [ICH]$
8: $P [m, t] \leftarrow p_{m, t}$
9: end for
10: end for
# Compute Model-wise Statistics
11: for each model $m \in M$ do
12: $μ_{m} \leftarrow \frac{1}{T} \sum_{t = 1}^{T} P [m, t]$
13: $σ_{m}^{2} \leftarrow \frac{1}{T} \sum_{t = 1}^{T} {(P [m, t] - μ_{m})}^{2}$
14: end for
# Aggregate Ensemble Prediction
15: $\hat{p} \leftarrow \frac{1}{| M |} \sum_{m \in M} μ_{m}$
16: $\hat{y} \leftarrow I [\hat{p} \geq 0.5]$
# Aggregate Uncertainty
17: $U \leftarrow \frac{1}{| M |} \sum_{m \in M} σ_{m}^{2}$
# Generate Explainability Outputs
18: Generate Grad-CAM++ saliency map for class $\hat{y}$
19: Compute SHAP feature attribution scores
20: return $\hat{p}$ , $\hat{y}$ , U, Grad-CAM++ and SHAP explanations

3.2.3. Handling Class Imbalance

The RSNA dataset exhibits class imbalance, with non-hemorrhage slices dominating. To address this, focal loss was used to emphasize hard or minority-class examples:

L_{focal} = - α_{t} {(1 - p_{t})}^{γ} log (p_{t})

(3)

Here,

p_{t}

is the predicted probability of the true class,

α_{t}

adjusts class weighting, and

γ

controls the focusing parameter. Stratified sampling preserved class balance across mini-batches. No oversampling was applied to prevent patient-level duplication. During evaluation, a slice was classified as hemorrhagic if

p \geq 0.5

, and study-level predictions were aggregated using the top-k approach.

3.3. Bayesian Uncertainty Estimation

In high-stakes clinical environments such as ICH diagnosis, an AI system’s ability to quantify confidence is as critical as predictive accuracy. For this purpose, X-HEM adopts MC Dropout as a practical Bayesian approximation technique. MC Dropout provides a lightweight and scalable way to estimate predictive uncertainty without retraining or modifying the base CNN architectures. Compared with more complex Bayesian neural networks, it is computationally efficient and easily integrated into an ensemble, making it suitable for real-time clinical use.

During inference, dropout layers remain active and the model generates distributions of class probabilities across multiple stochastic forward passes. This yields predictive means for final classification and predictive variances as uncertainty estimates. MC Dropout simulates Bayesian behaviour by activating dropout layers at test time, thereby sampling from an approximate posterior distribution. For each input CT slice, the model performs

T = 30

stochastic forward passes, generating a set of class probability vectors

{p_{1}, p_{2}, \dots, p_{T}}

. From these outputs, two key statistics are computed: the predictive mean (

μ

) and the predictive variance (

σ^{2}

):

μ = \frac{1}{T} \sum_{t = 1}^{T} p_{t},

(4)

σ^{2} = \frac{1}{T} \sum_{t = 1}^{T} {(p_{t} - μ)}^{2} .

(5)

A lower variance indicates consistent predictions and higher confidence, whereas a higher variance signals uncertainty, often in ambiguous or underrepresented cases. X-HEM computes these variances for each base model and aggregates them to estimate the overall ensemble uncertainty (U), reflecting epistemic uncertainty from model and data limitations. Table 1 summarizes the uncertainty scores across key diagnostic categories. Incorrect predictions exhibit substantially higher uncertainty (0.089) than correct predictions (0.021), indicating that the model can signal when it is unsure. Similarly, hemorrhage-positive slices exhibit greater uncertainty (0.033) than non-hemorrhage slices (0.015). This can be attributed to the greater anatomical variability, irregular bleed morphology, and subtle radiological signatures associated with hemorrhagic cases. In contrast, non-hemorrhage slices are relatively homogeneous, which results in lower variance and greater confidence.

These findings confirm that predictive variance computed by X-HEM is a reliable indicator of epistemic uncertainty. The model not only delivers accurate predictions but also signals when clinical review is necessary. To visualize and validate the behaviour of uncertainty estimation, Figure 4, Figure 5 and Figure 6 provide complementary views: calibration, variance distribution, and study-level selective prediction. These visualizations demonstrate the reliability and interpretability of the uncertainty outputs and are integral to the model validation process. These analyses confirm that X-HEM delivers high classification accuracy together with well-calibrated, interpretable confidence estimates, critical attributes for safe diagnostic decision support.

3.3.1. Calibration Analysis

To assess the calibration of model predictions, we evaluated three complementary metrics: Brier Score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE). The Brier Score measures the mean squared difference between predicted probabilities and true outcomes, with lower values indicating better calibration:

Brier Score = \frac{1}{N} \sum_{i = 1}^{N} {(p_{i} - y_{i})}^{2} .

(6)

Here,

p_{i}

is the predicted probability for sample i, and

y_{i} \in {0, 1}

is the ground truth label. The ECE was computed by partitioning the predicted probabilities into M bins and calculating the weighted average absolute difference between accuracy and confidence:

ECE = \sum_{m = 1}^{M} \frac{| B_{m} |}{N} | acc (B_{m}) - conf (B_{m}) |,

(7)

where

B_{m}

denotes the set of indices falling into bin m,

acc (B_{m})

is the accuracy within bin m, and

conf (B_{m})

is the mean confidence of samples in that bin. The MCE reflects the worst-case miscalibration by reporting the maximum gap across all bins:

MCE = max_{m \in {1, \dots, M}} | acc (B_{m}) - conf (B_{m}) | .

(8)

To quantify uncertainty in these calibration metrics, we computed 95% confidence intervals using nonparametric bootstrapping at the study level. For each bootstrap sample, calibration metrics were recalculated, and percentile-based confidence intervals were reported. Furthermore, we performed a selective prediction (risk–coverage) analysis to evaluate the model’s behaviour when discarding high-uncertainty cases. Predictive variance from MC Dropout was used as a rejection criterion, and we reported coverage at different variance thresholds (0.2, 0.3, 0.35, 0.4). This analysis provides insight into the trade-off between reliability and case retention, a crucial factor for clinical decision support. No additional recalibration methods, such as temperature scaling or isotonic regression, were applied. Instead, calibration was assessed on the native model outputs, allowing a fair comparison between single CNNs and the ensemble.

3.3.2. ROI Annotation

To quantitatively validate the reliability of Grad-CAM++ visualizations, we generated region-of-interest (ROI) masks for a subset of the dataset. Since pixel-level annotations are not provided in either the RSNA or CQ500 datasets, we selected 500 representative CT slices, balanced across hemorrhage subtypes and normal cases, and had them independently annotated by certified radiologists. The inter-rater agreement before consensus was high, with a Cohen’s

κ

of 0.82, indicating strong reliability of the annotations. Grad-CAM++ heatmaps were normalized to the range [0, 1] and binarized at a fixed threshold of 0.5 to produce predicted activation masks. These masks were then compared with the radiologist ROIs to compute two quantitative explainability metrics: Intersection over Union (IoU) and Hit Rate. This ensured that explainability metrics were grounded in clinically validated references rather than subjective model activations.

3.3.3. Explainability Evaluation Protocol

Explainability evaluation was conducted at the slice level, in accordance with the slice-wise design of the X-HEM framework and the annotation format of the RSNA dataset. For each CT slice, a Grad-CAM++ activation map corresponding to the predicted class was generated and compared only with the radiologist-annotated region for the same slice. No exam-level aggregation was used for the computation of explainability metrics.

Grad-CAM++ heatmaps were normalized to the range

[0, 1]

and binarized using a fixed threshold of

0.5

, chosen to balance sensitivity and spatial specificity. Minor threshold variations did not significantly affect IoU or Hit Rate, indicating robustness, indicating that the evaluation is robust to minor threshold changes.

Two metrics were computed per slice: IoU, measuring spatial overlap with the annotated region, and Hit Rate, indicating whether the maximum activation fell within the hemorrhage area. Final results report averages over 500 annotated slices.

For feature-level explainability, SHAP values were extracted from the penultimate dense layer. Mean SHAP Impact scores were computed across all slices to provide stable feature-attribution estimates complementary to Grad-CAM++.

3.4. Explainability Modules

While achieving high predictive accuracy is crucial, clinical adoption of AI systems also depends on explainability, the model’s ability to justify its decisions in an interpretable manner. In medical imaging, this is essential, as clinicians must understand whether the model’s reasoning aligns with human diagnostic logic. To ensure transparency and promote clinical trust, the X-HEM framework integrates two complementary explainability techniques: Grad-CAM/Grad-CAM++ for spatial localization and SHAP for global feature attribution.

3.4.1. Grad-CAM and Grad-CAM++ for Spatial Attention

In X-HEM, spatial explainability is achieved using Gradient-weighted Class Activation Mapping (Grad-CAM) and its refined version, Grad-CAM++. These methods generate heatmaps that visually highlight the regions of a CT slice influencing the model’s classification output. Such spatial insight is critical for diagnosing Intracranial Hemorrhage (ICH), as identifying the precise bleed location directly supports clinical validation. Grad-CAM computes the gradient of the target class score with respect to the feature maps of the final convolutional layer. These gradients are average-pooled to obtain weights that emphasize important regions. However, Grad-CAM often struggles with small or diffuse hemorrhagic regions, leading to coarse, imprecise activations. To overcome this, Grad-CAM++ refines localization by using higher-order gradients and assigning pixel-wise weights to feature maps, thereby improving focus even when multiple regions contribute to the decision. The Grad-CAM++ class activation map for a target class c is given by:

L_{Grad - CAM + +}^{c} = ReLU (\sum_{k} α_{k}^{c} A^{k})

(9)

where

A^{k}

denotes the kth feature map from the final convolutional layer and

α_{k}^{c}

represents its weight. In this work, Grad-CAM++ was applied to each CNN backbone (VGG16, ResNet50, DenseNet121) within the ensemble, and the resulting heatmaps were used to interpret the ensemble’s soft-voted output. The Grad-CAM++ overlays consistently highlighted hemorrhagic regions, in agreement with expert radiologist annotations, for both the RSNA and CQ500 datasets. Compared to Grad-CAM, Grad-CAM++ produced sharper and anatomically aligned activations, making it better suited for clinical use. Figure 7 illustrates a qualitative comparison between Grad-CAM and Grad-CAM++ overlays on representative CT slices across five major ICH subtypes: Epidural (EDH), Intraparenchymal (IPH), Subdural (SDH), Subarachnoid (SAH) and Intraventricular (IVH). Grad-CAM overlays highlight broader regions of interest but often extend into surrounding non-hemorrhagic structures, limiting precision. In contrast, Grad-CAM++ produces sharper, more localized activations that closely align with radiologically visible hyperdense hemorrhage regions. For example, in epidural and intraventricular hemorrhage, Grad-CAM++ sharply delineates the high-density lesions, while in subarachnoid and subdural hemorrhage, it avoids unnecessary activations outside the sulci and cortical boundaries. These results demonstrate that Grad-CAM++ improves localization over Grad-CAM and also provides subtype-specific interpretability.

3.4.2. SHAP for Global Feature Attribution

While Grad-CAM++ provides spatial attention, it does not explain why the model made a decision based on its internal representations. To provide semantic-level understanding, X-HEM integrates SHAP (SHapley Additive exPlanations), a model-agnostic, game-theoretic approach that quantifies each feature’s contribution to the model’s output. SHAP was applied to the penultimate dense layer of each CNN (VGG16, ResNet50, DenseNet121) within the ensemble. This layer encodes high-level abstract features such as texture, contrast, and structural asymmetry. For each feature i, the SHAP value

ϕ_{i}

is defined as:

ϕ_{i} = \sum_{S \subseteq F \ {i}} \frac{| S |! (| F | - | S | - 1)!}{| F |!} [f (S \cup {i}) - f (S)]

(10)

where S is a subset of all features F excluding i, and

f (S)

denotes the model output using features in S. Aggregating SHAP values across all test samples identified the most influential features driving the hemorrhage classification. These results are visualized in Figure 8, which presents a SHAP summary plot highlighting the five most impactful neurons.

3.4.3. Quantitative Evaluation

To evaluate the dual explainability modules in X-HEM, a comprehensive quantitative assessment was performed, focusing on accuracy, interpretability, and clinical relevance. Grad-CAM and Grad-CAM++ were assessed using Intersection over Union (IoU), Hit Rate, and Visual Clarity, while SHAP was evaluated using Mean SHAP Impact scores. The hit rate is clinically important, as successfully highlighting the pathological region, even partially, helps radiologists make interpretations. Visual clarity ensures that explanations are clear and understandable to human observers. This is essential for clinical use. The IoU measures the overlap between the model-generated attention map and the radiologist’s annotated region of interest (ROI):

IoU = \frac{| A \cap R |}{| A \cup R |}

(11)

where A is the binary activation map from Grad-CAM++ and R is the annotated ROI. The Hit Rate quantifies whether the most activated pixel lies within the annotated ROI:

Hit Rate = \frac{Number of Hits}{Total Number of Slices} \times 100

(12)

Finally, the Mean SHAP Impact for feature i is computed as:

Mean SHAP {Impact}_{i} = \frac{1}{N} \sum_{n = 1}^{N} | ϕ_{i}^{(n)} |

(13)

Table 2 shows that Grad-CAM++ significantly improves spatial alignment with clinical annotations, achieving an IoU of 86.3% and a Hit Rate of 91.4%. Its higher clarity score (4.6/5) demonstrates its practical diagnostic utility. SHAP identified stable, meaningful feature contributions across both the RSNA and CQ500 datasets, further confirming the interpretability and reliability of X-HEM. Together, Grad-CAM++ and SHAP provide complementary spatial and semantic insights, making X-HEM both interpretable and clinically actionable.

3.5. Performance Evaluation Metrics

To ensure a comprehensive evaluation of the X-HEM framework, a combination of classification, calibration, discriminative, and explainability metrics was employed. This multidimensional assessment captures not only accuracy but also interpretability and reliability, key requirements for clinical deployment. For classification, standard measures such as Accuracy, Precision, Recall, Specificity, and F1-Score were used to assess the model’s ability to correctly identify hemorrhagic slices while minimizing false alarms. In addition, the Negative Predictive Value (NPV) was included to evaluate the proportion of predicted negatives that were truly negative, an essential metric in high-stakes clinical tasks where missed hemorrhages can have severe consequences. Discriminative ability was quantified using the Area Under the Receiver Operating Characteristic Curve (ROC-AUC), which measures how well the model distinguishes between hemorrhage and non-hemorrhage cases across thresholds. Confidence calibration was assessed using the Brier score (Equation (6)), which measures the alignment between predicted probabilities and actual outcomes. Explainability alignment was measured by the Grad-CAM Hit Rate (Equation (12)), which quantifies how often model attention maps overlap with radiologist-annotated hemorrhage regions. All performance metrics and their corresponding formulas are summarized in Table 3.

This comprehensive metric set ensures a well-rounded evaluation of the X-HEM framework, covering accuracy, discriminative ability, confidence calibration, and interpretability. In addition to slice-level analysis, study-level performance was evaluated since clinical decisions are typically made at the scan level. For each scan, slice probabilities were aggregated using two strategies: (i) the maximum probability across all slices and (ii) the mean of the top-k highest scoring slices. These aggregated scores were used to compute study-level ROC-AUC, F1-Score, and confusion matrices. For IoU and Hit-Rate, Grad-CAM++ heatmaps were thresholded at 0.5 and compared directly with the radiologist-annotated ROIs described in Section 3.4.2. This procedure ensured consistency between quantitative explainability metrics and expert-defined ground truth.

4. Results

4.1. Performance Comparison

To evaluate the classification performance of the X-HEM framework, we conducted a comparative analysis between the three base CNN models: VGG16, ResNet50, and DenseNet121, and the final ensemble. All models were evaluated on the same test set derived from the RSNA dataset to ensure consistency in benchmarking. The ensemble model, constructed via soft voting over the probabilistic outputs of the base classifiers, consistently outperformed the individual models across all standard evaluation metrics. As summarized in Table 4, the ensemble achieved the highest accuracy of 94%, precision of 93%, recall of 95%, and F1-score of 0.94, compared to the base CNNs whose accuracy ranged from 87% to 91%. The ensemble also recorded the highest AUC of 0.96, showing greater discriminative capability and overall reliability. Table 4 presents the detailed performance of individual CNN models and the proposed X-HEM ensemble on the RSNA test set. The ensemble consistently outperformed the individual models across all metrics, demonstrating the strength of soft voting in improving diagnostic accuracy for Intracranial Hemorrhage (ICH) detection. All reported metrics include 95% confidence intervals (CIs), estimated using bootstrap resampling with 1000 iterations.

The external validation results on the CQ500 dataset are summarized in Table 5. As expected, all models experienced a modest performance drop compared to the RSNA test set, reflecting the natural variability in real-world clinical data. Nonetheless, the ensemble retained superior accuracy (91%) and AUC (0.94), confirming its robustness and strong generalization across institutions and imaging conditions.

Beyond accuracy, the X-HEM framework demonstrated strong calibration, low uncertainty variance, and high alignment with explainability. Predictive variance was systematically higher for misclassified and hemorrhage-positive slices, showing the model’s ability to recognize uncertainty. Grad-CAM++ and SHAP analyses further confirmed that the ensemble’s decisions align with radiologically meaningful regions and feature patterns. These results collectively validate that X-HEM delivers both high accuracy and reliable interpretability, making it suitable for integration into clinical diagnostic workflows. In addition to the base CNNs, we compared X-HEM with more advanced architectures, including EfficientNet-B0, ConvNeXt-T, and ViT-B/16, trained under the same preprocessing and evaluation pipeline. As shown in Table 6, these models achieved competitive AUCs between 0.93 and 0.955 on RSNA and between 0.928 and 0.935 on CQ500. However, X-HEM consistently outperformed them, reaching 0.96 on RSNA and 0.94 on CQ500. Beyond accuracy, X-HEM demonstrated better calibration, with the lowest Brier score and Expected Calibration Error (ECE), while maintaining transparent predictions through Grad-CAM++ and SHAP analyses.

To further align with clinical workflows, we evaluated X-HEM at the study level by aggregating slice predictions into scan-level scores. Two aggregation strategies were applied: maximum probability pooling and top-k mean pooling. The ensemble achieved nearly identical results under both methods, maintaining high sensitivity and specificity. Table 7 reports the scan-level metrics on the RSNA and CQ500 datasets. The ensemble achieved an AUC of 0.96 on RSNA and 0.94 on CQ500, closely aligned with slice-level estimates. The consistently low false-negative rate, particularly on CQ500, where sensitivity reached 100%, highlights the clinical reliability of the framework.

Overall, the X-HEM ensemble demonstrated consistent superiority across both internal and external datasets, outperforming single CNNs and newer transformer-based models. Its strong calibration, awareness of uncertainty, and interpretable predictions position it as a reliable diagnostic assistant for intracranial hemorrhage detection in real-world clinical settings.

4.2. Uncertainty Analysis

Predictive confidence is as essential as classification accuracy in clinical AI. The X-HEM framework incorporates Bayesian inference via Monte Carlo (MC) Dropout, performing 30 stochastic forward passes per test slice to capture predictive uncertainty. Calibration was evaluated using the Brier Score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE). As summarized in Table 8, the ensemble achieved the lowest Brier Score (0.029) and the fewest calibration errors, confirming reliable probability estimates. Incorrect predictions showed substantially higher predictive variance (

σ^{2} = 0.089

) compared with correct predictions (

σ^{2} = 0.021

), indicating that X-HEM effectively recognizes when it is uncertain. Hemorrhage-positive slices exhibited slightly greater variance (

σ^{2} = 0.033

) than non-hemorrhage slices (

σ^{2} = 0.015

), reflecting higher complexity in bleed regions. Such calibrated uncertainty supports safer clinical deployment by flagging ambiguous cases for radiologist review. Table 9 further summarizes the calibration results of individual CNNs and the ensemble. The X-HEM ensemble outperformed all baselines, achieving the lowest Brier Score, ECE, and MCE across variance thresholds. At variance

\leq 0.3

, it maintained 91% coverage while discarding highly uncertain slices, demonstrating effective uncertainty-based risk control.

The ensemble also achieved strong calibration on the CQ500 dataset, with a Brier Score of 0.034 and ECE of 0.019, confirming generalization across institutions.

4.3. Subtype-Wise Performance and Class Imbalance

Subtype-level evaluation was performed across five categories of intracranial hemorrhage: Epidural (EDH), Intraparenchymal (IPH), Subdural (SDH), Subarachnoid (SAH), and Intraventricular (IVH). Table 10 reports sensitivity, specificity, precision, and F1-score for each subtype. The model achieved strong detection for IPH and SDH, while lower sensitivity in EDH reflects class imbalance typical in real-world datasets. Overall, X-HEM maintained consistent and clinically reliable performance across all subtypes.

These results confirm that X-HEM maintains high reliability in all ICH subtypes while providing calibrated confidence estimates and strong interpretability, essential for real-world clinical decision support.

4.4. Explainability Results

In addition to strong predictive performance, the X-HEM framework demonstrates high transparency through dual explainability modules. Grad-CAM++ provides spatial localization of hemorrhage regions, while SHAP quantifies the importance of global features. Together, they enhance model interpretability and clinician trust. Table 11 summarizes the mean SHAP impact values for selected neurons/features within the ensemble. Neurons with consistently high SHAP values contributed strongly to hemorrhage classification, while those with low values had minimal influence.

4.5. Ablation Study

To quantify the contribution of each core component in X-HEM, an ablation study was conducted. Four progressively enhanced model variants were evaluated on the RSNA and CQ500 datasets: (A) baseline VGG16, (B) ensemble (VGG16, ResNet50, DenseNet121), (C) ensemble + MC Dropout, and (D) full X-HEM (ensemble + MC Dropout + explainability). See Table 12.

Each addition to the framework improved both predictive and calibration performance. The full X-HEM model (Variant D) achieved the highest accuracy, lowest Brier Score, and strongest interpretability, confirming the synergistic effect of ensemble learning, uncertainty modelling, and explainability.

4.6. Statistical Significance of ROC–AUC Improvements

To test whether AUC improvements were statistically significant, DeLong’s test was applied at the study level. As shown in Table 13, the X-HEM ensemble achieved significantly higher AUC than all single backbones on both datasets (

p < 0.01

). The results demonstrate that X-HEM achieves statistically significant performance gains, strong calibration, and interpretable decision reasoning, making it a reliable and clinically applicable framework for intracranial hemorrhage detection.

5. Discussion

The X-HEM framework shows that combining ensemble deep learning, Bayesian uncertainty, and dual explainability can significantly improve both reliability and clinical interpretability in Intracranial Hemorrhage (ICH) detection. Across RSNA and CQ500 datasets, X-HEM achieved AUCs of 0.96 and 0.94, confirming strong generalization. More importantly, it provided calibrated confidence and radiologically meaningful explanations rather than relying solely on accuracy. Uncertainty estimation proved highly effective, with predictive variance consistently higher in misclassified or hemorrhagic slices, allowing the model to signal low-confidence cases. This capability adds a self-awareness layer, ensuring that ambiguous scans can be flagged for radiologist review rather than accepted blindly. The dual explainability approach further reinforces trust: Grad-CAM++ localized hemorrhagic regions in alignment with radiologist annotations, while SHAP revealed the feature-level reasoning behind predictions. Compared to baseline CNNs and transformer-based models such as EfficientNet and ViT, X-HEM matched or exceeded their accuracy while delivering better calibration and interpretability. The ablation study confirmed that each module, ensemble learning, MC Dropout, and explainability played a measurable role in improving both prediction quality and reliability.

From a practical standpoint, X-HEM directly supports clinical decision-making. In emergency triage, rapid yet reliable ICH detection is critical. By combining confidence-calibrated predictions with interpretable visual outputs, the framework helps radiologists focus on high-likelihood hemorrhage cases while identifying uncertain results that may need secondary review. This workflow reduces diagnostic risk and optimizes time-sensitive resource allocation. X-HEM can be readily integrated into PACS systems, providing both slice-level alerts and study-level summaries. Despite its strengths, limitations remain. The model analyzes 2D slices independently, without leveraging 3D contextual or temporal information. It currently performs binary ICH detection; achieving balanced multi-class performance across EDH, IPH, SDH, SAH, and IVH subtypes remains a future goal. In addition, the reliance on a fixed HU window (WL = 40, WW = 80) may affect adaptability to varied acquisition protocols. Finally, the system does not yet incorporate out-of-distribution (OOD) detection, an essential step for ensuring robustness across different scanners and institutions.

We used MC Dropout for Bayesian uncertainty estimation because it offers a good balance of accuracy and efficiency, making it suitable for real-time medical imaging. While methods like deep ensembles provide more robust estimates, they are computationally expensive. Post-hoc calibration (e.g., temperature scaling) improves confidence calibration but does not capture all uncertainty types. Evidential deep learning offers principled modeling but can be unstable or hyperparameter sensitive. A detailed comparison of these uncertainty methods, especially for the detection of intracerebral hemorrhage (ICH), remains a key area for future research.

Although traditional machine learning models like XGBoost can perform well when using handcrafted radiomic features or deep features extracted from CNNs, this study specifically focuses on end-to-end, image-based deep learning approaches. We have concentrated on evaluating architectures that learn spatial and contextual representations directly from CT images. As a result, a systematic comparison with models like XGBoost using radiomic features or deep features will be addressed in future work.

Kelly et al. [30] emphasize that for AI systems to be safely utilized in critical clinical settings, they must provide reliable confidence estimates and clear decision outputs. They point out that understanding uncertainty and explainability are essential for building clinician trust, increasing awareness of errors, and ensuring the responsible use of medical AI. Placing X-HEM within this context helps connect the framework to the wider discussion on trustworthy and deployable AI for healthcare, without needing extra experiments or changes in methods.

6. Conclusions and Future Scope

This study presented X-HEM, an explainable deep learning framework for detecting Intracranial Hemorrhage (ICH) in non-contrast CT scans. Unlike prior studies that focus on accuracy, uncertainty, or interpretability alone, X-HEM unifies all three through ensemble learning, Bayesian uncertainty estimation using Monte Carlo Dropout, and dual explainability with Grad-CAM++ and SHAP. The framework delivers accurate, confidence-calibrated, and interpretable predictions aligned with radiological reasoning. By quantifying predictive variance, it identifies low-confidence cases for further review, while the explainability modules provide both spatial and global insights that enhance trust in automated diagnosis. Evaluations on RSNA and CQ500 datasets confirmed strong generalization and clinically meaningful interpretability, with ablation results verifying the individual contributions of ensemble learning, uncertainty estimation, and explainability. A key outcome of this study is the demonstration of subtype-level interpretability, with Grad-CAM++ overlays aligning with hemorrhagic regions across EDH, IPH, SDH, SAH, and IVH subtypes. Building on this, a key direction for future work is to extend the framework to full multi-class subtype classification, which would provide finer diagnostic granularity, as treatment strategies and prognoses differ across hemorrhage types. This is technically challenging due to class imbalance, subtle anatomical variations, and inter-class similarity, but can be addressed using advanced sampling strategies, tailored loss functions, and hierarchical modelling. Future directions include adopting 3D or sequence-aware architectures for contextual modelling, incorporating radiologist feedback via active learning, and adapting the framework for deployment in low-resource clinical environments.

Author Contributions

Conceptualization, M.F.A.-H. and A.A.B.; Methodology, M.F.A.-H. and B.V.; Software, B.V.; Validation, I.T.A. and M.F.A.-H.; Formal analysis, M.F.A.-H.; Investigation, B.V. and I.T.A.; Resources, A.A.B.; Data curation, B.V.; Writing—original draft preparation, M.F.A.-H.; Writing—review and editing, A.A.B. and I.T.A.; Visualization, B.V.; Supervision, A.A.B.; Project administration, A.A.B.; Funding acquisition, A.A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets that are used in this work are publicly available and ethically cleared for academic research. The RSNA Intracranial Hemorrhage Detection dataset can be accessed via Kaggle at https://www.kaggle.com/competitions/rsna-intracranial-hemorrhage-detection (accessed on 1 February 2025) and the CQ500 dataset is available through GitHub (commit 3f2a9c1) at https://github.com/charlierabea/FORTE/tree/main/data (accessed on 1 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

WHO	World Health Organization
ICH	Intracranial Hemorrhage
AHA	American Heart Association
CNNs	Convolutional Neural Networks
CT	Computed Tomography
SHAP	SHapley Additive exPlanations
ViT	Vision Transformer
AUC	Area Under Curve
XAI	Explainable Artificial Intelligence
MC	Monte Carlo
Grad-CAM	Gradient-weighted Class Activation Mapping
EDH	Epidural Hematoma
IPH	Intraparenchymal Hemorrhage
IVH	Intraventricular Hemorrhage
SAH	Subarachnoid Hemorrhage
SDH	Subdural Hemorrhage
NPV	Negative Predictive Value
ECE	Expected Calibration Error
MCE	Maximum Calibration Error
HU windowing	Hounsfield Unit Windowing

References

Zhao, M.; Li, W.; Hu, Y.; Jiang, R.; Zhao, Y.; Zhang, D.; Zhao, J. Deep-learning tool for early identification of non-traumatic intracranial hemorrhage etiology and application in clinical diagnostics based on computed tomography (CT) scans. PeerJ 2025, 13, e18850. [Google Scholar] [CrossRef]
Arman, S.E.; Rahman, S.S.; Irtisam, N.; Deowan, S.A.; Islam, M.A.; Sakib, S.; Hasan, M. Intracranial hemorrhage classification from CT scan using deep learning and Bayesian optimization. IEEE Access 2023, 11, 83446–83460. [Google Scholar] [CrossRef]
Wang, Z.; Wu, L.; Ji, X. An interpretable deep learning system for automatic intracranial hemorrhage diagnosis with CT image. In Proceedings of the 2021 International Conference on Bioinformatics and Intelligent Computing, Harbin, China, 22–24 January 2021; pp. 338–357. [Google Scholar]
Asif, M.; Shah, M.A.; Khattak, H.A.; Mussadiq, S.; Ahmed, E.; Nasr, E.A.; Rauf, H.T. Intracranial hemorrhage detection using parallel deep convolutional models and boosting mechanism. Diagnostics 2023, 13, 652. [Google Scholar] [CrossRef]
Luong, K.G.; Duong, H.N.; Van, C.M.; Thi, T.H.H.; Nguyen, T.T.; Thoai, N.; Thi, T.T.T. A computer-aided detection to intracranial hemorrhage by using deep learning: A case study. In Computational Intelligence Methods for Green Technology and Sustainable Development, Proceedings of the International Conference GTSD2020; Springer: Berlin/Heidelberg, Germany, 2021; Volume 5, pp. 27–38. [Google Scholar]
Chen, Y.; Rivier, C.A.; Mora, S.A.; Torres Lopez, V.; Payabvash, S.; Sheth, K.N.; Anderson, C.D. Deep learning survival model predicts outcome after intracerebral hemorrhage from initial CT scan. Eur. Stroke J. 2025, 10, 225–235. [Google Scholar] [CrossRef]
Chilamkurthy, S.; Ghosh, R.; Tanamala, S.; Biviji, M.; Campeau, N.G.; Venugopal, V.K.; Mahajan, V.; Rao, P.; Warier, P. Development and Validation of Deep Learning Algorithms for Detection of Critical Findings in Head CT Scans. Radiology 2018, 289, 709–719. [Google Scholar]
Kang, D.W.; Kim, M.; Park, G.H.; Kim, Y.S.; Han, M.K.; Lee, M.; Jeong, H.G. Deep learning-assisted detection of intracranial hemorrhage: Validation and impact on reader performance. Neuroradiology 2025, 67, 1511–1519. [Google Scholar] [CrossRef] [PubMed]
Tran, A.T.; Zeevi, T.; Haider, S.P.; Abou Karam, G.; Berson, E.R.; Tharmaseelan, H.; Payabvash, S. Uncertainty-aware deep-learning model for prediction of supratentorial hematoma expansion from admission non-contrast head computed tomography scan. npj Digit. Med. 2024, 7, 26. [Google Scholar] [CrossRef] [PubMed]
Karamian, A.; Seifi, A. Diagnostic accuracy of deep learning for intracranial hemorrhage detection in non-contrast brain CT scans: A systematic review and meta-analysis. J. Clin. Med. 2025, 14, 2377. [Google Scholar] [CrossRef]
Peng, Q.; Chen, X.; Zhang, C.; Li, W.; Liu, J.; Shi, T.; Hu, R. Deep learning-based computed tomography image segmentation and volume measurement of intracerebral hemorrhage. Front. Neurosci. 2022, 16, 965680. [Google Scholar] [CrossRef]
Patil, R.; Shreya, A.; Maulik, P.; Chaudhury, S. Hybrid AI based stroke characterization with explainable model. J. Neurol. Sci. 2019, 405, 162–163. [Google Scholar] [CrossRef]
Yalcin, C.; Abramova, V.; Terceño, M.; Oliver, A.; Silva, Y.; Lladó, X. Hematoma expansion prediction in intracerebral hemorrhage patients by using synthesized CT images in an end-to-end deep learning framework. Comput. Med Imaging Graph. 2024, 117, 102430. [Google Scholar] [CrossRef]
Linli, Z.; Liang, X.; Zhang, Z.; Hu, K.; Guo, S. Enhancing brain age estimation under uncertainty: A spectral-normalized neural gaussian process approach utilizing 2.5D slicing. NeuroImage 2025, 311, 121184. [Google Scholar] [CrossRef]
Qiao, X.; Lu, C.; Xu, M.; Yang, G.; Chen, W.; Liu, Z. DeepSAP: A novel brain image-based deep learning model for predicting stroke-associated pneumonia from spontaneous intracerebral hemorrhage. Acad. Radiol. 2024, 31, 5193–5203. [Google Scholar] [CrossRef] [PubMed]
Malik, P.; Dureja, A.; Dureja, A.; Rathore, R.S.; Malhotra, N. Enhancing Intracranial Hemorrhage Diagnosis through Deep Learning Models. Procedia Comput. Sci. 2024, 235, 1664–1673. [Google Scholar] [CrossRef]
Jie, M.; Yeo, J.; Goh, C.P.; Wu, C.X.; Phng, F.; Yong, P.; Low, S.W. Development of an explainable machine learning model for predicting neurological deterioration in spontaneous intracerebral hemorrhage. Intell.-Based Med. 2025, 11, 100237. [Google Scholar]
Mogensen, K.; Guarrasi, V.; Larsson, J.; Hansson, W.; Wåhlin, A.; Koskinen, L.O.; Qvarlander, S. An optimized ensemble search approach for classification of higher-level gait disorder using brain magnetic resonance images. Comput. Methods Programs Biomed. 2025, 184, 109457. [Google Scholar] [CrossRef] [PubMed]
Hazarika, H.; Barua, R.; Gourisaria, M.K.; Jena, J.J.; Das, S.; Patra, S.S. Detection of Glioma from Brain CT Scan images using Explainable AI based Ensemble Feature Extraction. Procedia Comput. Sci. 2025, 258, 1877–1887. [Google Scholar] [CrossRef]
Zhu, H.; Wang, W.; Ulidowski, I.; Zhou, Q.; Wang, S.; Chen, H.; Zhang, Y. MEEDNets: Medical image classification via ensemble bio-inspired evolutionary DenseNets. Knowl.-Based Syst. 2023, 280, 111035. [Google Scholar] [CrossRef]
Sreelakshmi, S.; Malu, G.; Sherly, E.; Mathew, R. M-Net: An encoder–decoder architecture for medical image analysis using ensemble learning. Results Eng. 2023, 17, 100927. [Google Scholar]
Buddenkotte, T.; Sanchez, L.E.; Crispin-Ortuzar, M.; Woitek, R.; McCague, C.; Brenton, J.D.; Rundo, L. Calibrating ensembles for scalable uncertainty quantification in deep learning-based medical image segmentation. Comput. Methods Programs Biomed. 2023, 163, 107096. [Google Scholar] [CrossRef]
Wang, X.; Shen, T.; Yang, S.; Lan, J.; Xu, Y.; Wang, M.; Han, X. A deep learning algorithm for automatic detection and classification of acute intracranial hemorrhages in head CT scans. NeuroImage Clin. 2021, 32, 102785. [Google Scholar] [CrossRef] [PubMed]
Beer, S.; Elmenhorst, D.; Bischof, G.N.; Ramirez, A.; Bauer, A.; Drzezga, A.; Initiative, A.D.N. Explainable artificial intelligence identifies an AQP4 polymorphism-based risk score associated with brain amyloid burden. Neurobiol. Aging 2024, 143, 19–29. [Google Scholar] [CrossRef]
Du, J.; Tao, X.; Zhu, L.; Wang, H.; Qi, W.; Min, X.; Wei, S.; Zhang, X.; Liu, Q. Development of a visualized risk prediction system for sarcopenia in older adults using machine learning: A cohort study based on CHARLS. Front. Public Health 2025, 13, 1544894. [Google Scholar] [CrossRef]
Mirzaei, O.; Mohammed, S.A.; Sekeroglu, B.; Ilhan, A. Comparison of Intracranial Hemorrhages Detection Performances of Deep Learning Models on CT Images. Procedia Comput. Sci. 2025, 258, 3194–3202. [Google Scholar] [CrossRef]
Yang, K.C.; Xu, Y.; Lin, Q.; Tang, L.L.; Zhong, J.W.; An, H.N.; Tong, L.S. Explainable deep learning algorithm for identifying cerebral venous sinus thrombosis-related hemorrhage (CVST-ICH) from spontaneous intracerebral hemorrhage using computed tomography. EClinicalMedicine 2025, 81, 103128. [Google Scholar] [CrossRef] [PubMed]
RSNA. RSNA Intracranial Hemorrhage Detection Challenge Dataset. 2019. Available online: https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection (accessed on 12 September 2025).
Qure.ai. CQ500: Head CT Hemorrhage and Other Findings Dataset. 2018. Available online: https://www.kaggle.com/datasets/crawford/qureai-headct (accessed on 12 September 2025).
Kelly, C.J.; Karthikesalingam, A.; Suleyman, M.; Corrado, G.; King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019, 17, 195. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Workflow of the proposed X-HEM framework integrating ensemble CNN backbones, Bayesian uncertainty, and dual explainability.

Figure 2. Example axial CT slices from the RSNA (left) and CQ500 (right) datasets used for training and external validation. Hemorrhage regions are marked with red arrows.

Figure 3. Preprocessing pipeline for slice-wise CT classification. Each raw DICOM CT slice is resized to

224 \times 224

pixels and normalized to the [0–1] intensity range.

Figure 3. Preprocessing pipeline for slice-wise CT classification. Each raw DICOM CT slice is resized to

224 \times 224

pixels and normalized to the [0–1] intensity range.

Figure 4. Calibration curve of the X-HEM ensemble on the RSNA test set. The curve compares predicted confidence with empirical accuracy. The diagonal dashed line represents perfect calibration.

Figure 5. Distribution of predictive variance across all test slices. The histogram is skewed toward low variance, confirming high confidence for most predictions.

Figure 6. Study-level scatter of predictive uncertainty versus correctness. Incorrect predictions exhibit higher variance. The coverage–risk curve shows reduced risk when uncertain cases are excluded (AURC = 0.230).

Figure 7. Visualization of hemorrhage subtypes using Grad-CAM and Grad-CAM++ overlays. Representative CT slices are shown for five ICH subtypes: EDH, IPH, SDH, SAH, and IVH. Grad-CAM overlays appear coarser, while Grad-CAM++ provides sharper, anatomically consistent activations aligned with hemorrhagic lesions.

Figure 8. SHAP summary plot showing neuron-level feature impact on ICH classification. Each dot represents a SHAP value for a CT slice. Right-shifted neurons (e.g., 2 and 6) show strong positive contributions toward hemorrhage detection.

Table 1. Uncertainty scores across prediction categories.

Prediction Category	Mean Predictive Uncertainty
Correctly Predicted Slices	0.021
Incorrectly Predicted Slices	0.089
Hemorrhage Cases	0.033
Non-Hemorrhage Cases	0.015

Table 2. Explainability Metrics for the X-HEM Framework (Grad-CAM, Grad-CAM++, SHAP).

Metric	Grad-CAM	Grad-CAM++	SHAP
IoU (%)	74.5	86.3	–
Hit Rate (%)	81.2	91.4	–
Visual Clarity (1–5)	3.8	4.6	–
Mean SHAP Impact (Top 5)	–	–	0.182, 0.163, 0.152, 0.147, 0.139

Table 3. Evaluation Metrics for the X-HEM Framework.

No.	Category	Metric	Formula/Description
1	Classification	Accuracy	$\frac{T P + T N}{T P + T N + F P + F N}$ —Overall correctness across all predictions.
2	Classification	Precision	$\frac{T P}{T P + F P}$ —Fraction of positive predictions that are correct.
3	Classification	Recall (Sensitivity)	$\frac{T P}{T P + F N}$ —Ability to identify all true positives.
4	Classification	Specificity	$\frac{T N}{T N + F P}$ —Fraction of true negatives correctly identified.
5	Classification	F1-Score	$2 \times \frac{(Precision \times Recall)}{(Precision + Recall)}$ —Harmonic mean of precision and recall.
6	Classification	Negative Predictive Value (NPV)	$\frac{T N}{T N + F N}$ —Fraction of negative predictions that are correct.
7	Discriminative	ROC-AUC	Area under the ROC curve (TPR vs. FPR). Measures discrimination across thresholds.
8	Confidence Calibration	Brier Score	Defined in Equation (6). Measures the squared difference between predicted probabilities and true outcomes.
9	Explainability Alignment	Grad-CAM Hit Rate	Defined in Equation (12). Measures alignment between model attention maps and radiologist-annotated regions.

Table 4. Performance of CNN models and the X-HEM ensemble on the RSNA test set (95% CI).

Model	Accuracy	Precision	Recall	F1-Score	AUC
VGG16	0.91 (0.89–0.93)	0.89 (0.87–0.91)	0.90 (0.88–0.92)	0.89 (0.88–0.91)	0.93 (0.91–0.95)
ResNet50	0.89 (0.87–0.91)	0.88 (0.86–0.90)	0.87 (0.85–0.89)	0.87 (0.86–0.89)	0.94 (0.92–0.96)
DenseNet121	0.87 (0.85–0.89)	0.86 (0.84–0.88)	0.85 (0.83–0.87)	0.85 (0.84–0.87)	0.90 (0.88–0.92)
X-HEM (Ensemble)	0.94 (0.92–0.95)	0.93 (0.91–0.95)	0.95 (0.93–0.96)	0.94 (0.93–0.95)	0.96 (0.94–0.97)

Table 5. Model performance on the CQ500 external validation set (95% CI).

Model	Accuracy	Precision	Recall	F1-Score	AUC
VGG16	0.88 (0.86–0.90)	0.86 (0.84–0.88)	0.87 (0.85–0.89)	0.86 (0.85–0.88)	0.91 (0.89–0.93)
ResNet50	0.86 (0.84–0.88)	0.85 (0.83–0.87)	0.84 (0.82–0.86)	0.84 (0.83–0.86)	0.92 (0.90–0.94)
DenseNet121	0.85 (0.83–0.87)	0.83 (0.81–0.85)	0.82 (0.80–0.84)	0.82 (0.81–0.84)	0.89 (0.87–0.91)
X-HEM (Ensemble)	0.91 (0.89–0.93)	0.90 (0.88–0.92)	0.92 (0.90–0.94)	0.91 (0.89–0.92)	0.94 (0.92–0.96)

Table 6. Baseline performance comparison between X-HEM and recent architectures.

Model	RSNA AUC	RSNA Accuracy	CQ500 AUC	CQ500 Accuracy
EfficientNet-B0	0.950	0.91	0.930	0.89
ConvNeXt-T	0.955	0.92	0.935	0.90
ViT-B/16	0.948	0.91	0.928	0.89
X-HEM	0.960	0.94	0.940	0.91

Table 7. Scan-level performance of X-HEM on RSNA and CQ500 datasets (95% CI).

Metric	RSNA (95% CI)	CQ500 (95% CI)
Sensitivity	87.53% (87.24–87.81)	100.00% (92.44–100.00)
Specificity	97.46% (97.41–97.52)	95.89% (88.60–98.59)
Precision	85.32% (85.02–85.61)	95.89% (88.60–98.59)
Negative Predictive Value (NPV)	97.89% (97.84–97.94)	100.00% (94.80–100.00)
Accuracy	96.03% (95.97–96.09)	97.50% (92.91–99.15)
AUC	0.96	0.94

Table 8. Uncertainty and Calibration Metrics for the X-HEM Framework (RSNA Test Set).

Category	Mean Predictive Variance	95% Confidence Interval
Correct Predictions	0.021	0.010–0.025
Incorrect Predictions	0.089	0.065–0.090
Hemorrhage Cases	0.033	0.031–0.039
Non-Hemorrhage Cases	0.015	0.012–0.021

Table 9. Calibration Metrics and Coverage Analysis (RSNA Test Set, 95% CI).

Model	Brier Score	ECE	MCE	Coverage @ Var ≤ 0.3	Coverage @ Var ≤ 0.4
VGG16	0.082	0.037	0.091	80%	86%
ResNet50	0.076	0.034	0.087	81%	88%
DenseNet121	0.071	0.031	0.084	82%	89%
X-HEM (Ensemble)	0.029	0.014	0.041	91%	96%

Table 10. Subtype-Wise Classification Performance on RSNA Test Set.

Subtype	Sensitivity	Specificity	Precision	F1-Score
EDH	0.91	0.97	0.89	0.90
IPH	0.92	0.96	0.90	0.91
IVH	0.88	0.95	0.87	0.88
SDH	0.89	0.94	0.86	0.87
SAH	0.87	0.93	0.84	0.85

Table 11. SHAP Impact Values for Selected Features in X-HEM Ensemble.

Feature/Neuron ID	Mean SHAP Value	Clinical Relevance
Neuron 2	0.214	Captures hyperdense hemorrhage regions
Neuron 6	0.187	Sensitive to ventricular boundaries
Neuron 9	0.132	Detects cortical–parenchymal contrast
Neuron 4	0.098	Weak to moderate impact
Neuron 10	0.072	Minimal contribution

Table 12. Ablation Study Results with 95% Confidence Intervals.

Variant	Accuracy	F1-Score	AUC	Brier Score	Mean SHAP Impact
A (VGG16)	0.89 (0.88–0.90)	0.875 (0.86–0.89)	0.91 (0.90–0.92)	0.082 (0.080–0.085)	–
B (Ensemble)	0.92 (0.91–0.93)	0.905 (0.89–0.92)	0.94 (0.93–0.95)	0.065 (0.064–0.068)	–
C (B + Dropout)	0.93 (0.92–0.94)	0.918 (0.90–0.93)	0.95 (0.94–0.96)	0.042 (0.040–0.045)	–
D (X-HEM)	0.94 (0.93–0.95)	0.940 (0.92–0.95)	0.96 (0.95–0.97)	0.029 (0.026–0.031)	0.162

Table 13. DeLong Test for Pairwise ROC–AUC Comparisons (Study Level).

Comparison	RSNA p-Value	CQ500 p-Value
Ensemble vs. VGG16	< $0.001$	0.004
Ensemble vs. ResNet50	< $0.001$	0.009
Ensemble vs. DenseNet121	< $0.001$	0.002
VGG16 vs. ResNet50	0.12	0.18
VGG16 vs. DenseNet121	0.07	0.11
ResNet50 vs. DenseNet121	0.04	0.09

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Al-Hammouri, M.F.; Vamsi, B.; Almalkawi, I.T.; Al Bataineh, A. X-HEM: An Explainable and Trustworthy AI-Based Framework for Intelligent Healthcare Diagnostics. Computers 2026, 15, 33. https://doi.org/10.3390/computers15010033

AMA Style

Al-Hammouri MF, Vamsi B, Almalkawi IT, Al Bataineh A. X-HEM: An Explainable and Trustworthy AI-Based Framework for Intelligent Healthcare Diagnostics. Computers. 2026; 15(1):33. https://doi.org/10.3390/computers15010033

Chicago/Turabian Style

Al-Hammouri, Mohammad F., Bandi Vamsi, Islam T. Almalkawi, and Ali Al Bataineh. 2026. "X-HEM: An Explainable and Trustworthy AI-Based Framework for Intelligent Healthcare Diagnostics" Computers 15, no. 1: 33. https://doi.org/10.3390/computers15010033

APA Style

Al-Hammouri, M. F., Vamsi, B., Almalkawi, I. T., & Al Bataineh, A. (2026). X-HEM: An Explainable and Trustworthy AI-Based Framework for Intelligent Healthcare Diagnostics. Computers, 15(1), 33. https://doi.org/10.3390/computers15010033

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

X-HEM: An Explainable and Trustworthy AI-Based Framework for Intelligent Healthcare Diagnostics

Abstract

1. Introduction

2. Related Work

2.1. ICH Diagnosis Using Deep Learning

2.2. Ensemble Learning in Medical Imaging

2.3. Uncertainty Estimation in AI Models

2.4. Explainable AI (XAI) in ICH Healthcare

2.5. Research Gap

3. Methods

3.1. Datasets Used

3.1.1. Data Pre-Processing

3.1.2. Data Splitting

3.1.3. Implementation Details

3.2. Proposed X-HEM Model

3.2.1. Ensemble Architecture

3.2.2. Inference Algorithm

3.2.3. Handling Class Imbalance

3.3. Bayesian Uncertainty Estimation

3.3.1. Calibration Analysis

3.3.2. ROI Annotation

3.3.3. Explainability Evaluation Protocol

3.4. Explainability Modules

3.4.1. Grad-CAM and Grad-CAM++ for Spatial Attention

3.4.2. SHAP for Global Feature Attribution

3.4.3. Quantitative Evaluation

3.5. Performance Evaluation Metrics

4. Results

4.1. Performance Comparison

4.2. Uncertainty Analysis

4.3. Subtype-Wise Performance and Class Imbalance

4.4. Explainability Results

4.5. Ablation Study

4.6. Statistical Significance of ROC–AUC Improvements

5. Discussion

6. Conclusions and Future Scope

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI