1. Introduction
Intracranial Hemorrhage (ICH), defined as the sudden escape of blood within the cranial vault, is a critical neurological emergency that requires rapid diagnosis and intervention [
1]. The most common causes include hypertension, head trauma, vascular malformations, and anticoagulant therapy [
2]. The condition is associated with a high fatality rate because irreversible brain injury occurs quickly after onset. According to the World Health Organization (WHO), hemorrhagic strokes account for about 11% of all strokes yet contribute disproportionately to global mortality, ranking as the second leading cause of death worldwide [
3]. Although less common than ischemic stroke, hemorrhagic stroke is responsible for 30–40% of stroke-related fatalities [
4]. Reports from the American Heart Association (AHA) indicate that the 30-day mortality rate for ICH ranges from 35% to 52%, with nearly half of these deaths occurring within the first 24 h [
5]. These statistics highlight the urgent need for diagnostic methods that are accurate, fast, and reliable in acute clinical settings. ICH manifests in five clinically recognized subtypes: Epidural Hemorrhage (EDH), Intraparenchymal Hemorrhage (IPH), Subdural Hemorrhage (SDH), Subarachnoid Hemorrhage (SAH), and Intraventricular Hemorrhage (IVH). Each subtype differs in etiology, prognosis, and treatment strategy, making subtype-level interpretability essential for clinical trustworthiness. Deep learning, particularly Convolutional Neural Networks (CNNs), has achieved impressive success in medical image analysis over recent years [
6,
7]. Pretrained architectures such as VGG16, ResNet50, and DenseNet121 have been widely applied for pathology detection in head Computed Tomography (CT) scans [
8]. Despite their accuracy, most CNN-based systems lack transparency and confidence quantification. These “black-box” models provide predictions without explaining their reasoning or indicating the level of certainty [
9]. This limitation significantly reduces clinical trust, especially in life-threatening scenarios such as ICH, where both false positives and false negatives may have serious consequences [
10].
Recent research has shifted towards building reliable AI systems that combine predictive accuracy with interpretability and confidence estimation [
11]. Such systems are designed to support, rather than replace, clinicians in high-risk diagnostic workflows. Similarly, methods that employ Grad-CAM or SHAP improve interpretability but are rarely combined with predictive uncertainty, an essential factor for decision support in emergency radiology. Although existing deep learning methods for ICH detection have achieved high accuracy, most focus on isolated objectives such as classification, basic ensemble fusion, or visualization. However, clinical deployment demands more than accuracy alone, requiring careful consideration of confidence calibration and decision transparency in safety-critical settings. In this context, recent medical artificial intelligence literature has increasingly highlighted reliability, interpretability, and deployment readiness as ongoing and evolving research challenges, particularly in radiology and emergency diagnostics. While these aspects continue to be actively discussed beyond the scope of any single study, the present work focuses on the technical design and empirical evaluation of a framework that aligns with these directions by integrating ensemble learning, uncertainty estimation, and explainability. Further investigation, including prospective validation and integration into clinical workflows, remains an important direction for future research.
To address these gaps, this work proposes X-HEM, a unified framework that integrates ensemble modelling, Bayesian uncertainty quantification, and dual explainability for ICH diagnosis on non-contrast head CT scans. X-HEM integrates multiple CNN backbones, Monte Carlo Dropout-based uncertainty quantification, and complementary Grad-CAM++ and SHAP explainability methods to produce robust, confidence-aware, and clinically interpretable predictions. We validate the framework on both the large-scale RSNA ICH dataset and the external CQ500 dataset and conduct ablation studies to isolate the impact of each component. The main contributions are summarized as follows:
We propose X-HEM, an ensemble of VGG16, ResNet50, and DenseNet121 optimized for slice-wise ICH classification in CT images.
We integrate Bayesian uncertainty estimation using Monte Carlo Dropout to produce confidence-aware predictions and enhance clinical reliability.
We implement a dual-mode interpretability framework combining Grad-CAM++ (localization) and SHAP (global attribution) for complementary explanations.
We validate X-HEM on both RSNA and CQ500 datasets, demonstrating strong generalization across diverse clinical imaging conditions.
The remainder of this paper is structured as follows.
Section 2 reviews recent advances in deep learning for ICH detection, ensemble learning, uncertainty estimation, and explainable AI, and shows how the limitations of prior approaches motivate the design of X-HEM.
Section 3 presents the datasets, preprocessing steps, and the proposed framework.
Section 4 details the experimental setup, evaluation metrics, and results. Finally,
Section 5 concludes with a discussion of findings, limitations, clinical implications, and promising directions for future research.
2. Related Work
Recent research on AI for intracranial hemorrhage (ICH) and related neuroimaging tasks spans four main areas: deep learning-based ICH classification, ensemble learning in medical imaging, uncertainty quantification, and explainable AI (XAI) for clinical diagnostics.
2.1. ICH Diagnosis Using Deep Learning
Deep learning has been widely applied for detecting ICH on CT scans, with CNN-based architectures such as VGG16, ResNet50, and DenseNet121 achieving strong performance on datasets like the RSNA ICH Challenge. Most of these studies, however, focus solely on binary hemorrhage classification and do not provide integrated interpretability or uncertainty estimates, both of which are key for clinical trust.
Patil et al. [
12] developed a hybrid method combining image processing with Inception-ResNet V2, achieving 91% external accuracy while providing bleed localization. Yalcin et al. [
13] used EfficientNet-B0 to predict hematoma expansion and achieved 84% accuracy and 82% F1-score, though the model lacked external validation. Linli et al. [
14] applied a spectral-normalized Gaussian process for brain age estimation, showing how uncertainty can improve interpretability, but their work was not focused on ICH. Qiao et al. [
15] proposed DeepSAP, integrating CNN and Vision Transformer models to predict stroke-associated pneumonia in ICH patients, achieving an AUC of 0.93 but without uncertainty modelling. Malik et al. [
16] benchmarked CNNs, including EfficientNet-B3, achieving 93.29% accuracy but with no explainability integration. Jie et al. [
17] combined Random Forest, CatBoost, and Extra Trees for predicting early neurological deterioration, employing SHAP for feature-level interpretation but relying solely on tabular data rather than imaging.
Overall, deep learning approaches have achieved impressive accuracy but usually treat classification, interpretability, and uncertainty as separate goals rather than as a unified design.
2.2. Ensemble Learning in Medical Imaging
Ensemble learning is commonly used in medical imaging to enhance robustness and generalization by combining predictions from multiple models. Strategies such as stacked generalization, hard voting, and soft voting aggregate outputs from diverse CNN architectures and often yield higher accuracy than single models.
Mogensen et al. [
18] used a backward ensemble search strategy combining ResNet34 and DenseNet169 for gait disorder classification. Hazarika et al. [
19] proposed an explainable ensemble with handcrafted CT features, achieving 96.91% accuracy. Zhu et al. [
20] designed MEEDNets, a bio-inspired ensemble framework that achieved up to 99.43% accuracy across multiple datasets, and Sreelakshmi et al. [
21] developed M-Net for brain MRI segmentation with accuracies up to 99%.
Despite these strong results, most ensemble methods either omit uncertainty quantification or provide only limited explainability, which restricts their use in safety-critical tasks such as ICH triage.
2.3. Uncertainty Estimation in AI Models
Uncertainty estimation methods have been developed to address the “black-box” nature of deep neural networks. One widely used Bayesian approximation technique is Monte Carlo (MC) Dropout, which estimates predictive confidence by performing multiple stochastic forward passes. Other approaches include Bayesian neural networks, ensemble-based calibration, and confidence adjustment strategies such as Brier scoring and temperature scaling. These techniques improve model transparency and clinical trust when they are properly integrated into the prediction pipeline.
Buddenkotte et al. [
22] introduced a calibrated ensemble for medical image segmentation, improving reliability and enabling applications such as active learning. Wang et al. [
23] developed a CNN–GRU framework for slice-wise ICH classification that achieved an AUC of 0.988 and ranked first in the RSNA challenge, but it did not incorporate Bayesian uncertainty estimation. Overall, while significant progress has been made on uncertainty estimation, its combined use with ensemble frameworks and explicit interpretability is still limited.
2.4. Explainable AI (XAI) in ICH Healthcare
Explainable AI has gained increasing attention due to its critical role in enabling the clinical adoption of AI systems. Visualization methods such as Grad-CAM and Grad-CAM++ are widely used to highlight image regions that contribute to model predictions. In parallel, feature attribution techniques such as SHAP and LIME provide global and instance-level explanations of decision-making processes. However, these methods are often applied in isolation and are seldom evaluated together with uncertainty measures or against expert annotations.
Beer et al. [
24] demonstrated SHAP’s use in identifying genomic biomarkers for Alzheimer’s disease. Du et al. [
25] employed SHAP to interpret the XGBoost model by visualizing each feature’s contribution to sarcopenia risk assessment in older adults, enhancing transparency and clinical interpretability. Mirzaei et al. [
26] benchmarked CNNs on PhysioNet ICH data but lacked explainable components. Yang et al. [
27] applied a modified ResNet to distinguish cerebral venous sinus thrombosis-related ICH from spontaneous ICH with an AUC of 0.95, using Grad-CAM for localization but no uncertainty quantification. These works illustrate the potential of XAI but do not yet fully connect explainability with confidence estimation or ensemble modelling.
2.5. Research Gap
Recent studies [
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
26,
27] demonstrate substantial progress in deep learning, ensemble modelling, and explainable methods for ICH and related tasks. However, most approaches optimize one dimension at a time: classification accuracy, basic ensembling, or interpretability, without jointly addressing calibrated uncertainty, clinically validated explanations, and subtype-aware evaluation within a single framework.
In contrast, X-HEM is designed as an integrated pipeline that (i) ensembles complementary CNN backbones, (ii) quantifies predictive uncertainty via Monte Carlo Dropout with calibration analysis, and (iii) combines Grad-CAM++ and SHAP with ROI-based evaluation. This unified design targets binary ICH detection while still leveraging subtype labels for analysis, aiming to provide predictions that are not only accurate but also confidence-aware and clinically interpretable.
3. Methods
The overall X-HEM workflow, including ensemble fusion, uncertainty estimation, and interpretability modules, is shown in
Figure 1. Monte Carlo (MC) Dropout is used during inference to generate predictive uncertainty distributions, while Grad-CAM++ provides spatial localization, and SHAP quantifies feature-level attributions. Mean SHAP impact is computed to measure global interpretability and to support ablation analyses.
3.1. Datasets Used
Two publicly available datasets, RSNA Intracranial Hemorrhage Detection and CQ500, were used to develop and evaluate the proposed X-HEM framework. Both contain non-contrast head CT scans labelled for ICH and its subtypes. The RSNA dataset [
28], released for the 2019 RSNA ICH Detection Challenge on Kaggle, includes over 750,000 axial slices from 25,000 CT studies. Each slice is annotated for the five ICH subtypes (EDH, IPH, SDH, SAH, IVH) and for overall hemorrhage presence. For this work, labels were mapped into a binary format (hemorrhage vs. non-hemorrhage) for classification, while subtype metadata was retained for visualization and qualitative analysis.
The CQ500 dataset [
29] contains 491 CT studies from multiple hospitals in India. Each study was independently reviewed by three radiologists, with consensus labels assigned for ICH presence and subtype. Unlike RSNA, CQ500 provides a diverse clinical context and imaging variability, making it an ideal benchmark for external validation.
Figure 2 shows representative slices from RSNA (left) and CQ500 (right). The RSNA scan shows a classic example of hemorrhage localization, while the CQ500 scan highlights intraparenchymal bleeding with a clear midline shift. The RSNA dataset was split into 70% for training, 10% for validation, and 20% for testing to prevent data leakage. CQ500 was used exclusively for external validation. Each CQ500 study contains approximately 30–60 slices (about 25,000 slices in total).
3.1.1. Data Pre-Processing
A structured preprocessing pipeline was applied to the RSNA and CQ500 datasets to enable effective training of the X-HEM framework. This study adopted a slice-wise classification strategy, treating each axial CT slice as an individual instance. This aligns with the RSNA annotation format and allows learning of fine-grained hemorrhagic features without the cost of volumetric modelling. Processing slices independently expanded the training set to over 750,000 labelled slices.
Binary mapping was performed where slices labelled with any of the five ICH subtypes (EDH, IPH, SDH, SAH, IVH) were assigned a hemorrhage label (1), while all others were assigned as non-hemorrhage (0). This ensured subtype information was preserved for interpretability analyses while simplifying the core classification task. Each raw DICOM image
was resized to
pixels to match the input of VGG16, ResNet50, and DenseNet121, and normalized to the [0, 1] range using min–max normalization as shown in Equation (
1).
All slices were processed under a brain window (WL = 40 HU, WW = 80 HU) to ensure consistent contrast for hemorrhage regions and suppress irrelevant tissue signals. To align with ImageNet-pretrained CNN backbones, each grayscale CT slice was replicated across three channels to form a pseudo-RGB image. This ensured compatibility with VGG16, ResNet50, and DenseNet121 pretrained weights while maintaining the original grayscale intensity distribution.
Data augmentation was applied only to the training set to improve robustness and reduce overfitting. Transformations included random flips,
rotations, translations (≤10%), and zoom scaling (0.8–1.2). No augmentation was used for RSNA validation/test or CQ500 to ensure fair evaluation.
Figure 3 shows the preprocessing pipeline, where each DICOM slice is resized and normalized before augmentation. After preprocessing, RSNA contained approximately 750,000 standardized slices, and CQ500 contained approximately 25,000 slices (30–60 per study). CQ500 was reserved exclusively for external validation.
3.1.2. Data Splitting
All dataset splitting was done at the study level using the StudyInstanceUID identifier to prevent data leakage. The RSNA dataset was split into 70% for training, 10% for validation, and 20% for testing, corresponding to approximately 17,500, 2500, and 5000 studies, respectively. A stratified GroupKFold strategy preserved class balance and ensured no patient appeared in more than one subset. Inclusion was limited to non-contrast CT head studies with valid metadata; incomplete or corrupted scans were excluded. For training, a cosine annealing learning rate scheduler was used (initial LR = 0.001, minimum LR = over 50 epochs) to stabilize convergence. The CQ500 dataset (491 studies, ∼25,000 slices) was used solely for external validation, with no re-splitting or augmentation applied, ensuring genuine out-of-distribution testing.
3.1.3. Implementation Details
All experiments were implemented in Python 3.14 using PyTorch 1.13.1 with CUDA 11.6. Training and evaluation were performed on a workstation equipped with a single NVIDIA RTX-class GPU (24 GB VRAM), an 8-core CPU, and 64 GB of system memory. All three backbones (VGG16, ResNet50, DenseNet121) were optimized using the Adam optimizer with a base learning rate of , a cosine-annealing learning-rate schedule (initial learning rate , minimum learning rate ), and a weight decay of . A dropout rate of 0.5 was used in the classifier layers and kept active during inference for Monte Carlo Dropout with stochastic forward passes per slice. Models were trained with a batch size of 32 for up to 50 epochs, using early stopping with a patience of 10 epochs based on validation loss. On the RSNA training split, training a single backbone required approximately 6–8 h, while training all three backbones, constructing the soft-voting ensemble, and configuring MC-Dropout inference completed in roughly two days of wall-clock time. To support reproducibility, all random seeds for data shuffling, weight initialization, and data augmentation were fixed to 42.
3.2. Proposed X-HEM Model
3.2.1. Ensemble Architecture
The deep ensemble forms the core of the X-HEM architecture by combining DenseNet121, VGG16, and ResNet50. Each CNN backbone captures distinct representations of hemorrhagic patterns in CT slices, improving model generalization and robustness. VGG16 is a deep sequential network with small 3 × 3 kernels, ensuring stable feature extraction. ResNet50 introduces residual connections to ease gradient flow and enable learning of complex representations. DenseNet121 improves feature reuse and efficiency by connecting each layer to all preceding ones, reducing overfitting.
Let
,
, and
represent the softmax probabilities for a CT slice
x obtained from DenseNet121, VGG16, and ResNet50, respectively. The final ensemble prediction
is given by:
All base models were initialized with ImageNet-pretrained weights. MC Dropout layers were configured with a 0.5 rate during training and inference. The Adam optimizer (, weight decay ) and a cosine annealing schedule were used. Early stopping (patience = 10 epochs) prevented overfitting, and all experiments used a fixed random seed (42) for reproducibility. Each model was trained separately on the preprocessed dataset, and predictions were averaged at the probability level to generate the final output. This ensemble setup stabilizes predictions and improves generalization.
3.2.2. Inference Algorithm
The X-HEM inference pipeline combines (i) ensemble-based classification, (ii) Bayesian uncertainty estimation, and (iii) post hoc interpretability. Each model performs multiple forward passes with dropout enabled, producing probability distributions used to estimate both prediction mean and variance. Ensemble averaging determines the class label, while the variance captures uncertainty. Grad-CAM++ and SHAP generate visual and feature-level explanations. At the study level, predictions were aggregated using a top-
k pooling strategy (
), averaging the three most confident slices. This reduced false positives and maintained a balance between sensitivity and specificity. Algorithm 1 summarizes the inference process.
| Algorithm 1 X-HEM Inference with Bayesian Uncertainty and Explainability |
Require: Pre-processed CT slice x Ensure: Final prediction label , uncertainty score U, visual explanations (Grad-CAM++, SHAP) 1: Define base models , each with dropout enabled 2: Set number of Monte Carlo forward passes 3: Initialize prediction matrix # Monte Carlo Dropout Sampling 4: for each model do 5: for to T do 6: 7: 8: 9: end for 10: end for # Compute Model-wise Statistics 11: for each model do 12: 13: 14: end for # Aggregate Ensemble Prediction 15: 16: # Aggregate Uncertainty 17: # Generate Explainability Outputs 18: Generate Grad-CAM++ saliency map for class 19: Compute SHAP feature attribution scores 20: return , , U, Grad-CAM++ and SHAP explanations
|
3.2.3. Handling Class Imbalance
The RSNA dataset exhibits class imbalance, with non-hemorrhage slices dominating. To address this, focal loss was used to emphasize hard or minority-class examples:
Here, is the predicted probability of the true class, adjusts class weighting, and controls the focusing parameter. Stratified sampling preserved class balance across mini-batches. No oversampling was applied to prevent patient-level duplication. During evaluation, a slice was classified as hemorrhagic if , and study-level predictions were aggregated using the top-k approach.
3.3. Bayesian Uncertainty Estimation
In high-stakes clinical environments such as ICH diagnosis, an AI system’s ability to quantify confidence is as critical as predictive accuracy. For this purpose, X-HEM adopts MC Dropout as a practical Bayesian approximation technique. MC Dropout provides a lightweight and scalable way to estimate predictive uncertainty without retraining or modifying the base CNN architectures. Compared with more complex Bayesian neural networks, it is computationally efficient and easily integrated into an ensemble, making it suitable for real-time clinical use.
During inference, dropout layers remain active and the model generates distributions of class probabilities across multiple stochastic forward passes. This yields predictive means for final classification and predictive variances as uncertainty estimates. MC Dropout simulates Bayesian behaviour by activating dropout layers at test time, thereby sampling from an approximate posterior distribution. For each input CT slice, the model performs
stochastic forward passes, generating a set of class probability vectors
. From these outputs, two key statistics are computed: the predictive mean (
) and the predictive variance (
):
A lower variance indicates consistent predictions and higher confidence, whereas a higher variance signals uncertainty, often in ambiguous or underrepresented cases. X-HEM computes these variances for each base model and aggregates them to estimate the overall ensemble uncertainty (
U), reflecting epistemic uncertainty from model and data limitations.
Table 1 summarizes the uncertainty scores across key diagnostic categories. Incorrect predictions exhibit substantially higher uncertainty (0.089) than correct predictions (0.021), indicating that the model can signal when it is unsure. Similarly, hemorrhage-positive slices exhibit greater uncertainty (0.033) than non-hemorrhage slices (0.015). This can be attributed to the greater anatomical variability, irregular bleed morphology, and subtle radiological signatures associated with hemorrhagic cases. In contrast, non-hemorrhage slices are relatively homogeneous, which results in lower variance and greater confidence.
These findings confirm that predictive variance computed by X-HEM is a reliable indicator of epistemic uncertainty. The model not only delivers accurate predictions but also signals when clinical review is necessary. To visualize and validate the behaviour of uncertainty estimation,
Figure 4,
Figure 5 and
Figure 6 provide complementary views: calibration, variance distribution, and study-level selective prediction. These visualizations demonstrate the reliability and interpretability of the uncertainty outputs and are integral to the model validation process. These analyses confirm that X-HEM delivers high classification accuracy together with well-calibrated, interpretable confidence estimates, critical attributes for safe diagnostic decision support.
3.3.1. Calibration Analysis
To assess the calibration of model predictions, we evaluated three complementary metrics: Brier Score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE). The Brier Score measures the mean squared difference between predicted probabilities and true outcomes, with lower values indicating better calibration:
Here,
is the predicted probability for sample
i, and
is the ground truth label. The ECE was computed by partitioning the predicted probabilities into
M bins and calculating the weighted average absolute difference between accuracy and confidence:
where
denotes the set of indices falling into bin
m,
is the accuracy within bin
m, and
is the mean confidence of samples in that bin. The MCE reflects the worst-case miscalibration by reporting the maximum gap across all bins:
To quantify uncertainty in these calibration metrics, we computed 95% confidence intervals using nonparametric bootstrapping at the study level. For each bootstrap sample, calibration metrics were recalculated, and percentile-based confidence intervals were reported. Furthermore, we performed a selective prediction (risk–coverage) analysis to evaluate the model’s behaviour when discarding high-uncertainty cases. Predictive variance from MC Dropout was used as a rejection criterion, and we reported coverage at different variance thresholds (0.2, 0.3, 0.35, 0.4). This analysis provides insight into the trade-off between reliability and case retention, a crucial factor for clinical decision support. No additional recalibration methods, such as temperature scaling or isotonic regression, were applied. Instead, calibration was assessed on the native model outputs, allowing a fair comparison between single CNNs and the ensemble.
3.3.2. ROI Annotation
To quantitatively validate the reliability of Grad-CAM++ visualizations, we generated region-of-interest (ROI) masks for a subset of the dataset. Since pixel-level annotations are not provided in either the RSNA or CQ500 datasets, we selected 500 representative CT slices, balanced across hemorrhage subtypes and normal cases, and had them independently annotated by certified radiologists. The inter-rater agreement before consensus was high, with a Cohen’s of 0.82, indicating strong reliability of the annotations. Grad-CAM++ heatmaps were normalized to the range [0, 1] and binarized at a fixed threshold of 0.5 to produce predicted activation masks. These masks were then compared with the radiologist ROIs to compute two quantitative explainability metrics: Intersection over Union (IoU) and Hit Rate. This ensured that explainability metrics were grounded in clinically validated references rather than subjective model activations.
3.3.3. Explainability Evaluation Protocol
Explainability evaluation was conducted at the slice level, in accordance with the slice-wise design of the X-HEM framework and the annotation format of the RSNA dataset. For each CT slice, a Grad-CAM++ activation map corresponding to the predicted class was generated and compared only with the radiologist-annotated region for the same slice. No exam-level aggregation was used for the computation of explainability metrics.
Grad-CAM++ heatmaps were normalized to the range and binarized using a fixed threshold of , chosen to balance sensitivity and spatial specificity. Minor threshold variations did not significantly affect IoU or Hit Rate, indicating robustness, indicating that the evaluation is robust to minor threshold changes.
Two metrics were computed per slice: IoU, measuring spatial overlap with the annotated region, and Hit Rate, indicating whether the maximum activation fell within the hemorrhage area. Final results report averages over 500 annotated slices.
For feature-level explainability, SHAP values were extracted from the penultimate dense layer. Mean SHAP Impact scores were computed across all slices to provide stable feature-attribution estimates complementary to Grad-CAM++.
3.4. Explainability Modules
While achieving high predictive accuracy is crucial, clinical adoption of AI systems also depends on explainability, the model’s ability to justify its decisions in an interpretable manner. In medical imaging, this is essential, as clinicians must understand whether the model’s reasoning aligns with human diagnostic logic. To ensure transparency and promote clinical trust, the X-HEM framework integrates two complementary explainability techniques: Grad-CAM/Grad-CAM++ for spatial localization and SHAP for global feature attribution.
3.4.1. Grad-CAM and Grad-CAM++ for Spatial Attention
In X-HEM, spatial explainability is achieved using Gradient-weighted Class Activation Mapping (Grad-CAM) and its refined version, Grad-CAM++. These methods generate heatmaps that visually highlight the regions of a CT slice influencing the model’s classification output. Such spatial insight is critical for diagnosing Intracranial Hemorrhage (ICH), as identifying the precise bleed location directly supports clinical validation. Grad-CAM computes the gradient of the target class score with respect to the feature maps of the final convolutional layer. These gradients are average-pooled to obtain weights that emphasize important regions. However, Grad-CAM often struggles with small or diffuse hemorrhagic regions, leading to coarse, imprecise activations. To overcome this, Grad-CAM++ refines localization by using higher-order gradients and assigning pixel-wise weights to feature maps, thereby improving focus even when multiple regions contribute to the decision. The Grad-CAM++ class activation map for a target class
c is given by:
where
denotes the
kth feature map from the final convolutional layer and
represents its weight. In this work, Grad-CAM++ was applied to each CNN backbone (VGG16, ResNet50, DenseNet121) within the ensemble, and the resulting heatmaps were used to interpret the ensemble’s soft-voted output. The Grad-CAM++ overlays consistently highlighted hemorrhagic regions, in agreement with expert radiologist annotations, for both the RSNA and CQ500 datasets. Compared to Grad-CAM, Grad-CAM++ produced sharper and anatomically aligned activations, making it better suited for clinical use.
Figure 7 illustrates a qualitative comparison between Grad-CAM and Grad-CAM++ overlays on representative CT slices across five major ICH subtypes: Epidural (EDH), Intraparenchymal (IPH), Subdural (SDH), Subarachnoid (SAH) and Intraventricular (IVH). Grad-CAM overlays highlight broader regions of interest but often extend into surrounding non-hemorrhagic structures, limiting precision. In contrast, Grad-CAM++ produces sharper, more localized activations that closely align with radiologically visible hyperdense hemorrhage regions. For example, in epidural and intraventricular hemorrhage, Grad-CAM++ sharply delineates the high-density lesions, while in subarachnoid and subdural hemorrhage, it avoids unnecessary activations outside the sulci and cortical boundaries. These results demonstrate that Grad-CAM++ improves localization over Grad-CAM and also provides subtype-specific interpretability.
3.4.2. SHAP for Global Feature Attribution
While Grad-CAM++ provides spatial attention, it does not explain why the model made a decision based on its internal representations. To provide semantic-level understanding, X-HEM integrates SHAP (SHapley Additive exPlanations), a model-agnostic, game-theoretic approach that quantifies each feature’s contribution to the model’s output. SHAP was applied to the penultimate dense layer of each CNN (VGG16, ResNet50, DenseNet121) within the ensemble. This layer encodes high-level abstract features such as texture, contrast, and structural asymmetry. For each feature
i, the SHAP value
is defined as:
where
S is a subset of all features
F excluding
i, and
denotes the model output using features in
S. Aggregating SHAP values across all test samples identified the most influential features driving the hemorrhage classification. These results are visualized in
Figure 8, which presents a SHAP summary plot highlighting the five most impactful neurons.
3.4.3. Quantitative Evaluation
To evaluate the dual explainability modules in X-HEM, a comprehensive quantitative assessment was performed, focusing on accuracy, interpretability, and clinical relevance. Grad-CAM and Grad-CAM++ were assessed using Intersection over Union (IoU), Hit Rate, and Visual Clarity, while SHAP was evaluated using Mean SHAP Impact scores. The hit rate is clinically important, as successfully highlighting the pathological region, even partially, helps radiologists make interpretations. Visual clarity ensures that explanations are clear and understandable to human observers. This is essential for clinical use. The IoU measures the overlap between the model-generated attention map and the radiologist’s annotated region of interest (ROI):
where
A is the binary activation map from Grad-CAM++ and
R is the annotated ROI. The Hit Rate quantifies whether the most activated pixel lies within the annotated ROI:
Finally, the Mean SHAP Impact for feature
i is computed as:
Table 2 shows that Grad-CAM++ significantly improves spatial alignment with clinical annotations, achieving an IoU of 86.3% and a Hit Rate of 91.4%. Its higher clarity score (4.6/5) demonstrates its practical diagnostic utility. SHAP identified stable, meaningful feature contributions across both the RSNA and CQ500 datasets, further confirming the interpretability and reliability of X-HEM. Together, Grad-CAM++ and SHAP provide complementary spatial and semantic insights, making X-HEM both interpretable and clinically actionable.
3.5. Performance Evaluation Metrics
To ensure a comprehensive evaluation of the X-HEM framework, a combination of classification, calibration, discriminative, and explainability metrics was employed. This multidimensional assessment captures not only accuracy but also interpretability and reliability, key requirements for clinical deployment. For classification, standard measures such as Accuracy, Precision, Recall, Specificity, and F1-Score were used to assess the model’s ability to correctly identify hemorrhagic slices while minimizing false alarms. In addition, the Negative Predictive Value (NPV) was included to evaluate the proportion of predicted negatives that were truly negative, an essential metric in high-stakes clinical tasks where missed hemorrhages can have severe consequences. Discriminative ability was quantified using the Area Under the Receiver Operating Characteristic Curve (ROC-AUC), which measures how well the model distinguishes between hemorrhage and non-hemorrhage cases across thresholds. Confidence calibration was assessed using the Brier score (Equation (
6)), which measures the alignment between predicted probabilities and actual outcomes. Explainability alignment was measured by the Grad-CAM Hit Rate (Equation (
12)), which quantifies how often model attention maps overlap with radiologist-annotated hemorrhage regions. All performance metrics and their corresponding formulas are summarized in
Table 3.
This comprehensive metric set ensures a well-rounded evaluation of the X-HEM framework, covering accuracy, discriminative ability, confidence calibration, and interpretability. In addition to slice-level analysis, study-level performance was evaluated since clinical decisions are typically made at the scan level. For each scan, slice probabilities were aggregated using two strategies: (i) the maximum probability across all slices and (ii) the mean of the top-
k highest scoring slices. These aggregated scores were used to compute study-level ROC-AUC, F1-Score, and confusion matrices. For IoU and Hit-Rate, Grad-CAM++ heatmaps were thresholded at 0.5 and compared directly with the radiologist-annotated ROIs described in
Section 3.4.2. This procedure ensured consistency between quantitative explainability metrics and expert-defined ground truth.
4. Results
4.1. Performance Comparison
To evaluate the classification performance of the X-HEM framework, we conducted a comparative analysis between the three base CNN models: VGG16, ResNet50, and DenseNet121, and the final ensemble. All models were evaluated on the same test set derived from the RSNA dataset to ensure consistency in benchmarking. The ensemble model, constructed via soft voting over the probabilistic outputs of the base classifiers, consistently outperformed the individual models across all standard evaluation metrics. As summarized in
Table 4, the ensemble achieved the highest accuracy of 94%, precision of 93%, recall of 95%, and F1-score of 0.94, compared to the base CNNs whose accuracy ranged from 87% to 91%. The ensemble also recorded the highest AUC of 0.96, showing greater discriminative capability and overall reliability.
Table 4 presents the detailed performance of individual CNN models and the proposed X-HEM ensemble on the RSNA test set. The ensemble consistently outperformed the individual models across all metrics, demonstrating the strength of soft voting in improving diagnostic accuracy for Intracranial Hemorrhage (ICH) detection. All reported metrics include 95% confidence intervals (CIs), estimated using bootstrap resampling with 1000 iterations.
The external validation results on the CQ500 dataset are summarized in
Table 5. As expected, all models experienced a modest performance drop compared to the RSNA test set, reflecting the natural variability in real-world clinical data. Nonetheless, the ensemble retained superior accuracy (91%) and AUC (0.94), confirming its robustness and strong generalization across institutions and imaging conditions.
Beyond accuracy, the X-HEM framework demonstrated strong calibration, low uncertainty variance, and high alignment with explainability. Predictive variance was systematically higher for misclassified and hemorrhage-positive slices, showing the model’s ability to recognize uncertainty. Grad-CAM++ and SHAP analyses further confirmed that the ensemble’s decisions align with radiologically meaningful regions and feature patterns. These results collectively validate that X-HEM delivers both high accuracy and reliable interpretability, making it suitable for integration into clinical diagnostic workflows. In addition to the base CNNs, we compared X-HEM with more advanced architectures, including EfficientNet-B0, ConvNeXt-T, and ViT-B/16, trained under the same preprocessing and evaluation pipeline. As shown in
Table 6, these models achieved competitive AUCs between 0.93 and 0.955 on RSNA and between 0.928 and 0.935 on CQ500. However, X-HEM consistently outperformed them, reaching 0.96 on RSNA and 0.94 on CQ500. Beyond accuracy, X-HEM demonstrated better calibration, with the lowest Brier score and Expected Calibration Error (ECE), while maintaining transparent predictions through Grad-CAM++ and SHAP analyses.
To further align with clinical workflows, we evaluated X-HEM at the study level by aggregating slice predictions into scan-level scores. Two aggregation strategies were applied: maximum probability pooling and top-
k mean pooling. The ensemble achieved nearly identical results under both methods, maintaining high sensitivity and specificity.
Table 7 reports the scan-level metrics on the RSNA and CQ500 datasets. The ensemble achieved an AUC of 0.96 on RSNA and 0.94 on CQ500, closely aligned with slice-level estimates. The consistently low false-negative rate, particularly on CQ500, where sensitivity reached 100%, highlights the clinical reliability of the framework.
Overall, the X-HEM ensemble demonstrated consistent superiority across both internal and external datasets, outperforming single CNNs and newer transformer-based models. Its strong calibration, awareness of uncertainty, and interpretable predictions position it as a reliable diagnostic assistant for intracranial hemorrhage detection in real-world clinical settings.
4.2. Uncertainty Analysis
Predictive confidence is as essential as classification accuracy in clinical AI. The X-HEM framework incorporates Bayesian inference via Monte Carlo (MC) Dropout, performing 30 stochastic forward passes per test slice to capture predictive uncertainty. Calibration was evaluated using the Brier Score, Expected Calibration Error (ECE), and Maximum Calibration Error (MCE). As summarized in
Table 8, the ensemble achieved the lowest Brier Score (0.029) and the fewest calibration errors, confirming reliable probability estimates. Incorrect predictions showed substantially higher predictive variance (
) compared with correct predictions (
), indicating that X-HEM effectively recognizes when it is uncertain. Hemorrhage-positive slices exhibited slightly greater variance (
) than non-hemorrhage slices (
), reflecting higher complexity in bleed regions. Such calibrated uncertainty supports safer clinical deployment by flagging ambiguous cases for radiologist review.
Table 9 further summarizes the calibration results of individual CNNs and the ensemble. The X-HEM ensemble outperformed all baselines, achieving the lowest Brier Score, ECE, and MCE across variance thresholds. At variance
, it maintained 91% coverage while discarding highly uncertain slices, demonstrating effective uncertainty-based risk control.
The ensemble also achieved strong calibration on the CQ500 dataset, with a Brier Score of 0.034 and ECE of 0.019, confirming generalization across institutions.
4.3. Subtype-Wise Performance and Class Imbalance
Subtype-level evaluation was performed across five categories of intracranial hemorrhage: Epidural (EDH), Intraparenchymal (IPH), Subdural (SDH), Subarachnoid (SAH), and Intraventricular (IVH).
Table 10 reports sensitivity, specificity, precision, and F1-score for each subtype. The model achieved strong detection for IPH and SDH, while lower sensitivity in EDH reflects class imbalance typical in real-world datasets. Overall, X-HEM maintained consistent and clinically reliable performance across all subtypes.
These results confirm that X-HEM maintains high reliability in all ICH subtypes while providing calibrated confidence estimates and strong interpretability, essential for real-world clinical decision support.
4.4. Explainability Results
In addition to strong predictive performance, the X-HEM framework demonstrates high transparency through dual explainability modules. Grad-CAM++ provides spatial localization of hemorrhage regions, while SHAP quantifies the importance of global features. Together, they enhance model interpretability and clinician trust.
Table 11 summarizes the mean SHAP impact values for selected neurons/features within the ensemble. Neurons with consistently high SHAP values contributed strongly to hemorrhage classification, while those with low values had minimal influence.
4.5. Ablation Study
To quantify the contribution of each core component in X-HEM, an ablation study was conducted. Four progressively enhanced model variants were evaluated on the RSNA and CQ500 datasets: (A) baseline VGG16, (B) ensemble (VGG16, ResNet50, DenseNet121), (C) ensemble + MC Dropout, and (D) full X-HEM (ensemble + MC Dropout + explainability). See
Table 12.
Each addition to the framework improved both predictive and calibration performance. The full X-HEM model (Variant D) achieved the highest accuracy, lowest Brier Score, and strongest interpretability, confirming the synergistic effect of ensemble learning, uncertainty modelling, and explainability.
4.6. Statistical Significance of ROC–AUC Improvements
To test whether AUC improvements were statistically significant, DeLong’s test was applied at the study level. As shown in
Table 13, the X-HEM ensemble achieved significantly higher AUC than all single backbones on both datasets (
). The results demonstrate that X-HEM achieves statistically significant performance gains, strong calibration, and interpretable decision reasoning, making it a reliable and clinically applicable framework for intracranial hemorrhage detection.
5. Discussion
The X-HEM framework shows that combining ensemble deep learning, Bayesian uncertainty, and dual explainability can significantly improve both reliability and clinical interpretability in Intracranial Hemorrhage (ICH) detection. Across RSNA and CQ500 datasets, X-HEM achieved AUCs of 0.96 and 0.94, confirming strong generalization. More importantly, it provided calibrated confidence and radiologically meaningful explanations rather than relying solely on accuracy. Uncertainty estimation proved highly effective, with predictive variance consistently higher in misclassified or hemorrhagic slices, allowing the model to signal low-confidence cases. This capability adds a self-awareness layer, ensuring that ambiguous scans can be flagged for radiologist review rather than accepted blindly. The dual explainability approach further reinforces trust: Grad-CAM++ localized hemorrhagic regions in alignment with radiologist annotations, while SHAP revealed the feature-level reasoning behind predictions. Compared to baseline CNNs and transformer-based models such as EfficientNet and ViT, X-HEM matched or exceeded their accuracy while delivering better calibration and interpretability. The ablation study confirmed that each module, ensemble learning, MC Dropout, and explainability played a measurable role in improving both prediction quality and reliability.
From a practical standpoint, X-HEM directly supports clinical decision-making. In emergency triage, rapid yet reliable ICH detection is critical. By combining confidence-calibrated predictions with interpretable visual outputs, the framework helps radiologists focus on high-likelihood hemorrhage cases while identifying uncertain results that may need secondary review. This workflow reduces diagnostic risk and optimizes time-sensitive resource allocation. X-HEM can be readily integrated into PACS systems, providing both slice-level alerts and study-level summaries. Despite its strengths, limitations remain. The model analyzes 2D slices independently, without leveraging 3D contextual or temporal information. It currently performs binary ICH detection; achieving balanced multi-class performance across EDH, IPH, SDH, SAH, and IVH subtypes remains a future goal. In addition, the reliance on a fixed HU window (WL = 40, WW = 80) may affect adaptability to varied acquisition protocols. Finally, the system does not yet incorporate out-of-distribution (OOD) detection, an essential step for ensuring robustness across different scanners and institutions.
We used MC Dropout for Bayesian uncertainty estimation because it offers a good balance of accuracy and efficiency, making it suitable for real-time medical imaging. While methods like deep ensembles provide more robust estimates, they are computationally expensive. Post-hoc calibration (e.g., temperature scaling) improves confidence calibration but does not capture all uncertainty types. Evidential deep learning offers principled modeling but can be unstable or hyperparameter sensitive. A detailed comparison of these uncertainty methods, especially for the detection of intracerebral hemorrhage (ICH), remains a key area for future research.
Although traditional machine learning models like XGBoost can perform well when using handcrafted radiomic features or deep features extracted from CNNs, this study specifically focuses on end-to-end, image-based deep learning approaches. We have concentrated on evaluating architectures that learn spatial and contextual representations directly from CT images. As a result, a systematic comparison with models like XGBoost using radiomic features or deep features will be addressed in future work.
Kelly et al. [
30] emphasize that for AI systems to be safely utilized in critical clinical settings, they must provide reliable confidence estimates and clear decision outputs. They point out that understanding uncertainty and explainability are essential for building clinician trust, increasing awareness of errors, and ensuring the responsible use of medical AI. Placing X-HEM within this context helps connect the framework to the wider discussion on trustworthy and deployable AI for healthcare, without needing extra experiments or changes in methods.
6. Conclusions and Future Scope
This study presented X-HEM, an explainable deep learning framework for detecting Intracranial Hemorrhage (ICH) in non-contrast CT scans. Unlike prior studies that focus on accuracy, uncertainty, or interpretability alone, X-HEM unifies all three through ensemble learning, Bayesian uncertainty estimation using Monte Carlo Dropout, and dual explainability with Grad-CAM++ and SHAP. The framework delivers accurate, confidence-calibrated, and interpretable predictions aligned with radiological reasoning. By quantifying predictive variance, it identifies low-confidence cases for further review, while the explainability modules provide both spatial and global insights that enhance trust in automated diagnosis. Evaluations on RSNA and CQ500 datasets confirmed strong generalization and clinically meaningful interpretability, with ablation results verifying the individual contributions of ensemble learning, uncertainty estimation, and explainability. A key outcome of this study is the demonstration of subtype-level interpretability, with Grad-CAM++ overlays aligning with hemorrhagic regions across EDH, IPH, SDH, SAH, and IVH subtypes. Building on this, a key direction for future work is to extend the framework to full multi-class subtype classification, which would provide finer diagnostic granularity, as treatment strategies and prognoses differ across hemorrhage types. This is technically challenging due to class imbalance, subtle anatomical variations, and inter-class similarity, but can be addressed using advanced sampling strategies, tailored loss functions, and hierarchical modelling. Future directions include adopting 3D or sequence-aware architectures for contextual modelling, incorporating radiologist feedback via active learning, and adapting the framework for deployment in low-resource clinical environments.