Explainable MRI-Based Ensemble Learnable Architecture for Alzheimer’s Disease Detection

Adeniran, Opeyemi Taiwo; Ojeme, Blessing; Ajibola, Temitope Ezekiel; Peter, Ojonugwa Oluwafemi Ejiga; Ajala, Abiola Olayinka; Rahman, Md Mahmudur; Khalifa, Fahmi

doi:10.3390/a18030163

Open AccessArticle

Explainable MRI-Based Ensemble Learnable Architecture for Alzheimer’s Disease Detection

by

Opeyemi Taiwo Adeniran

¹

,

Blessing Ojeme

¹,

Temitope Ezekiel Ajibola

¹

,

Ojonugwa Oluwafemi Ejiga Peter

¹

,

Abiola Olayinka Ajala

¹

,

Md Mahmudur Rahman

¹

and

Fahmi Khalifa

^2,*

¹

Department of Computer Science, School of Computer, Mathematical and Natural Sciences, Morgan State University, Baltimore, MD 21251, USA

²

Electrical and Computer Engineering Department, School of Engineering, Morgan State University, Baltimore MD 21251, USA

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(3), 163; https://doi.org/10.3390/a18030163

Submission received: 17 January 2025 / Revised: 6 March 2025 / Accepted: 7 March 2025 / Published: 13 March 2025

(This article belongs to the Special Issue Algorithms for Computer Aided Diagnosis: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

With the advancements in deep learning methods, AI systems now perform at the same or higher level than human intelligence in many complex real-world problems. The data and algorithmic opacity of deep learning models, however, make the task of comprehending the input data information, the model, and model’s decisions quite challenging. This lack of transparency constitutes both a practical and an ethical issue. For the present study, it is a major drawback to the deployment of deep learning methods mandated with detecting patterns and prognosticating Alzheimer’s disease. Many approaches presented in the AI and medical literature for overcoming this critical weakness are sometimes at the cost of sacrificing accuracy for interpretability. This study is an attempt at addressing this challenge and fostering transparency and reliability in AI-driven healthcare solutions. The study explores a few commonly used perturbation-based interpretability (LIME) and gradient-based interpretability (Saliency and Grad-CAM) approaches for visualizing and explaining the dataset, models, and decisions of MRI image-based Alzheimer’s disease identification using the diagnostic and predictive strengths of an ensemble framework comprising Convolutional Neural Networks (CNNs) architectures (Custom multi-classifier CNN, VGG-19, ResNet, MobileNet, EfficientNet, DenseNet), and a Vision Transformer (ViT). The experimental results show the stacking ensemble achieving a remarkable accuracy of 98.0% while the hard voting ensemble reached 97.0%. The findings present a valuable contribution to the growing field of explainable artificial intelligence (XAI) in medical imaging, helping end users and researchers to gain deep understanding of the backstory behind medical image dataset and deep learning model’s decisions.

Keywords:

Alzheimer’s disease detection; explainable AI; stacking ensemble; healthcare AI; Brain MRI; medical diagnostics; Grad-CAM; LIME; saliency map; early AD detection

1. Introduction

Alzheimer’s disease (AD) is a global health problem, being the most common cause of dementia and the fifth leading cause of death among people aged 65 and older [1]. An estimated 6.9 million Americans are living with Alzheimer’s dementia, and this number is projected to reach a staggering 13.8 million by 2060 [1]. As a complex and progressive neurodegenerative disorder characterized by cognitive decline, memory loss, and functional impairment, early and accurate detection of AD is crucial for effective intervention and management.

Several traditional computational models, including linear regression [2] and non-linear regression analyses, as well as linear discriminant analysis [3,4], have been constructed for analyzing complex datasets in critical domains such as healthcare. These techniques have produced results with acceptable accuracy, and evidence in the artificial intelligence (AI) literature indicates that the adoption of machine learning over the last decade has enhanced the capability to analyze and interpret complex and vast medical image datasets, identifying patterns for disease diagnosis, prediction, and treatment that elude traditional computational methods.

In Alzheimer’s disease, for instance, machine learning and deep learning models are increasingly being proposed for deployment in its detection and prediction [5]. However, the interpretability of these models remains a significant concern, as clinicians require clear and understandable explanations of the decision-making processes to trust and adopt AI-driven diagnostic tools in practice [6]. The successful adoption of predictive models in these settings heavily depends on how well decision-makers can understand, interpret, and justify the model’s decisions, thereby building trust in their functionality.

The need to utilize XAI to address these concerns has become more pronounced, especially considering the complex nature of Alzheimer’s disease [7]. This is the motivation and driving force for this study. The principal contributions of this work are as follows:

This study integrates perturbation-based (LIME) and gradient-based (Saliency and Grad-CAM) interpretability approaches to visualize and explain an ensemble framework consisting of stacking and hard voting techniques, enhancing Alzheimer’s disease diagnosis through magnetic resonance imaging (MRI).
A comprehensive ensemble architecture is developed that integrates multi-classifier Convolutional Neural Network architectures: VGG-19 for deep feature extraction, ResNet to address vanishing gradients, MobileNet and EfficientNet for efficient processing, DenseNet for feature reuse, a custom multi-classifier CNN, and a Vision Transformer for capturing complex spatial relationships.
The proposed framework overcomes the limitations of single-model approaches, achieving high diagnostic accuracy and robustness while providing transparent decision-making processes.
The implemented visualization techniques highlight critical MRI regions, enhancing clinicians’ understanding of the model and making the decision-making process more transparent and trustworthy.
Experimental results demonstrate that both stacking and hard voting provide a refined focus on diagnostically relevant areas, proving particularly sensitive to early-stage cognitive changes in MCI cases.

This nuanced focus and superior interpretability position the proposed framework, which incorporates both stacking and hard voting ensemble methods, as a valuable approach in clinical settings where detailed, actionable insights are crucial. While the dataset is not publicly available, it can be accessed upon request to ADNI, subject to approval and adherence to their data use agreement. The integration of CNNs and ViTs within an explainable ensemble architecture offers a robust approach for AI-driven healthcare applications, setting the foundation for future advancements in precision diagnostics and model interpretability in medical imaging.

2. Related Work

Based on the methods employed to generate explanations, the two main types of explainability techniques identified in the AI literature are the perturbation-based and gradient-based methods [8,9]. While perturbation-based methods such as LIME [10], SHAP [11], deconvolution, and occlusion leverage perturbations of individual instances to construct interpretable local approximations (e.g., linear models), which in turn serve as explanations of individual predictions of black-box models, gradient-based methods (Vanilla gradient, guided backpropagation, integrated gradients, guided integrated gradients, SmoothGrad, Grad-CAM, and guided Grad-CAM) leverage gradients computed at individual instances to explain predictions of complex models.

In this review, we describe relevant studies concerning approaches centered on how the dangers of building DL-driven complex medical disease models grow when end users’ understanding and control of the models are not taken into consideration. Specifically, we review research endeavors describing local methods for addressing the challenges of visualization and explanation of data, models, and outcomes in the medical disease domain, namely (1) perturbation-based methods for explaining medical disease diagnostic and predictive models; and (2) gradient-based methods for explaining medical disease diagnostic and predictive models.

2.1. Perturbation-Based Methods for Explaining Medical Disease Diagnostic and Predictive Models

LIME [10] is among the most widely utilized local explanation algorithms, offering a means to interpret complex model predictions in an understandable and reliable manner. This is achieved by constructing an interpretable surrogate model locally around a specific prediction. Similarly, SHAP [11], a model-agnostic approach grounded in Shapley values, provides feature-level attributions to explain individual predictions comprehensively. Both LIME and SHAP are instrumental in elucidating decision-making processes, especially in ensemble learning models, and contribute to enhanced transparency and informed decision-making in medical contexts [11].

In the domain of image-based medical disease detection and prediction, techniques such as LIME, SHAP, deconvolution, and occlusion have significantly improved the interpretability of deep learning models. LIME and SHAP, in particular, have been effectively employed in heart disease prediction to illuminate model rationale and facilitate trust among clinicians [11]. For medical image classification, LIME and SHAP, combined with Grad-CAM, have demonstrated considerable utility in explaining the predictions of CNN-based models for diseases such as pneumonia and Alzheimer’s [12]. Notably, Grad-CAM has shown superior capability in highlighting critical features that influence model decisions, thereby enhancing model transparency [13].

The importance of XAI techniques, including LIME and SHAP, is underscored in numerous studies for their roles in fostering trust and transparency in machine learning models applied to lung disease classification and other medical imaging tasks [14]. These methods address the inherent opacity of deep learning models, ensuring that AI systems in healthcare are not only accurate but also interpretable and reliable [15].

Moreover, the integration of these interpretability techniques into precision healthcare analytics emphasizes their pivotal role in automating image interpretation and enhancing diagnostic accuracy across diverse medical imaging modalities [16]. However, perturbation-based methods like LIME and SHAP present their own challenges. These approaches can generate explanations inconsistent with the original dataset due to the introduction of random noise, leading to unreliable interpretations [8,9]. While these methods aim to provide intuitive insights, the explanations can sometimes be too complex for non-expert users, hindering their practical application in clinical environments [17].

2.2. Gradient-Based Methods for Explaining Medical Disease Diagnostic and Predictive Models

The literature extensively demonstrates the use of heatmaps for explaining deep learning models, particularly in medical imaging [18]. Explanation algorithms such as Vanilla Gradient, Guided Backpropagation, Integrated Gradients, Guided Integrated Gradients, SmoothGrad, and Grad-CAM play a vital role in enhancing the interpretability of complex models used for image-based medical disease detection and prediction. By addressing the “black box” nature of deep learning models, these methods offer visual explanations of features captured by Convolutional Neural Networks, aiding in the understanding, verification, and justification of model predictions. This is particularly critical for fostering clinical acceptance and trust in AI systems deployed in high-stakes fields such as medical diagnostics [19,20,21].

Grad-CAM, for example, highlights the importance of various regions in an image by leveraging activation maps. However, its initial implementation often underestimated the significance of certain features, a limitation subsequently addressed by Integrated Grad-CAM. The latter employs a path integral of gradient-based terms to enhance object localization and improve model interpretation [13]. Similarly, Geometrically Guided Integrated Gradients refine traditional integrated gradient methods by incorporating the local geometry of the model parameter space, leading to more accurate attributions and explanations of predictions [22].

Despite these advances, gradient-based methods often produce explanations with excessive high-frequency noise that can obscure meaningful insights in medical imaging analysis. This noise arises from operations in convolutional neural networks and can mislead interpretations of model decisions [23]. Additionally, the transformation of gradient outputs into interpretable images frequently results in considerable noise, complicating the extraction of clinically useful information [24]. These limitations are particularly problematic in medical contexts where clear and reliable interpretations are essential for clinical decision making and building trust in AI-assisted diagnosis.

These advancements not only bolster the interpretability of AI models but also enhance their overall accuracy by ensuring that models focus on clinically relevant features. Consequently, these methods significantly improve the reliability and utility of medical image classification systems, paving the way for more robust integration of AI in healthcare settings.

3. Methodology

The step-by-step XAI workflow for AD detection is illustrated in Algorithm 1. This workflow begins with the acquisition providing detailed information about the pre-processing steps applied to the dataset.

The dataset used for the study was acquired from the Alzheimer’s Disease Neuroimaging Initiative [25]. The dataset comprises 21,976 DICOM files, divided into 17,580 training images, 2198 validation images, and 2198 testing images, each annotated with key demographic and diagnostic information. Participants are categorized into three diagnostic groups: CN, MCI, and AD. Longitudinal data collected over multiple visits (baseline, year 1, and year 2) enables the analysis of disease progression and structural brain changes over time. This longitudinal aspect supports the study of transitions from MCI to AD, a critical area of research for developing early intervention strategies.

Algorithm 1 A step-by-step XAI workflow for Alzheimer’s Disease detection

Data Acquisition:
- Access DICOM files from GCP (Google Cloud)
- Extract subject information from filenames
- Match image files with diagnostic labels from metadata file
- Filter invalid or unmatched data entries
- Split data into training (64%), validation (16%), and test (20%) sets
Data Preprocessing:
- Convert DICOM files to readable format using PyDICOM
- Convert grayscale images to RGB using CV2
- Normalize pixel values to range [0, 1]
- Resize all images to consistent dimensions (224 × 224)
- Create optimized TensorFlow datasets with prefetching
- Batch images for efficient processing (BATCH_SIZE = 128)
Model Development: Develop three parallel model architectures:
- Multi-Classifier CNN:
- Initialize CNN architecture with multiple convolutional blocks
- Add dropout layers for regularization
- Configure for multi-class classification (CN, MCI, AD)
- Transfer Learning Models:
- Category 1 (Fully Frozen): MobileNetV2, VGG19
- Category 2 (Partially Unfrozen): ResNet50, EfficientNetB0, DenseNet201
- Add custom classification layers with regularization
- Fine-tune with appropriate learning rates
- Vision Transformer:
- Use pre-trained ViT with fine-tuning of later layers
- Add custom classification head with BatchNormalization
- Optimize with AdamW and learning rate scheduler
Ensemble Techniques: Combine predictions using two methods:
- Hard Voting:
- For each test image, collect class predictions from all models
- Select final class based on majority vote
- Stacking:
- Train meta-model using base model predictions as features
- Use validation set to train meta-model
- Generate final predictions using meta-model
Model Evaluation:
- Calculate accuracy, precision, recall, F1-score
- Generate confusion matrices
- Compute ROC curves and AUC
- Compare performance across individual models and ensemble
Model Interpretation: Apply three explanation techniques:
- LIME:
- For each test image, generate perturbed samples
- Get model predictions for perturbed samples
- Fit local interpretable model (e.g., linear regression)
- Visualize feature importance
- Saliency Map:
- Compute gradient of output with respect to input image
- Visualize gradient magnitude to highlight important pixels
- Generate heatmap overlay on original image
- GradCAM:
- Identify target convolutional layer
- Compute gradients of target class with respect to feature maps
- Weight feature maps by gradient importance
- Generate class activation map
- Resize and overlay on original image

The dataset features advanced MRI modalities and sequences tailored to investigate neurodegenerative changes. Key sequences include Sagittal 3D FLAIR for detecting white matter lesions, High-Resolution Hippocampus Imaging for assessing hippocampal atrophy, and Axial 3TE T2 STAR for identifying microhemorrhages and iron deposition. Perfusion-Weighted Imaging (PWI) and Axial 3D PASL measure cerebral blood flow dynamics, providing insights into hypoperfusion associated with AD progression. Other sequences, such as T2-FLAIR and T2-TSE with Fat Saturation, enhance tissue contrast for visualizing structural abnormalities, while MPRAGE delivers high-resolution images of cortical and subcortical regions. Standardized imaging protocols, including field mapping for correcting distortions and B1-calibration scans, ensure data consistency across participants and visits.

Demographic details, such as sex (male or female), age (ranging from early 60 s to over 90 years), and visit timepoints, further enhance the dataset’s utility. This diversity ensures the dataset’s applicability to real-world clinical scenarios and supports generalized research across diverse populations. Sample MRI images, shown in Figure 1, provide a visual representation of the dataset, illustrating distinct patterns and features for each diagnostic group: CN, MCI, and AD. These visualizations highlight structural differences in the brain, which are critical for training ML models.

The dataset preprocessing was specifically designed to support custom multi-classifier-CNN, ViT, and transfer learning models. In this study, all available MRI sequences from the ADNI dataset (3-plane_localizer, Axial_PD_T2_FSE, Axial_T2-FLAIR, Axial_T2-Star, B1-Calibration_Body, B1-Calibration_PA, Field_Mapping, MP-RAGE_REPEAT, MP-RAGE, MPRAGE_SENSE2, MPRAGE, Double_TSE, MPRAGE_SENS, SURVEY) were processed as input to the models. Instead of selecting specific sequences, each DICOM image was treated as an independent input sample, regardless of which sequence it belongs to. The grayscale DICOM images were transformed into RGB format, standardized to a pixel value range of 0–255, and resized to a fixed resolution of 224 × 224 pixels using area-based interpolation. Subsequently, pixel values were normalized to the range [0, 1] by dividing by 255.0, a crucial step to ensure efficient training and enhanced convergence of deep learning models.

3.1. Model Development

Transitioning from the data preprocessing stage, the next phase centers on the development of deep learning models, incorporating a custom-built multiclassifier CNN, transfer learning models, and the Vision Transformer. This stage focuses on leveraging advanced architectures to extract meaningful features and improve classification accuracy, forming the core of the Alzheimer’s disease detection framework.

First, a custom-built CNN for multiclass classification tasks such as AD detection and classification was designed. The model structure is detailed in Algorithm 2. Secondly, the transfer learning architecture utilizes pre-trained models initialized with ImageNet weights to leverage their robust feature extraction capabilities for the target classification in Algorithm 3. Vision Transformer was also included in the ensemble module to enhance the overall performance. In particular, the ViT-B16 architecture introduces a transformative approach to image classification by leveraging the self-attention mechanism of Transformers, contrasting traditional CNNs as described in Algorithm 4.

Algorithm 2 CNN architecture for MRI classification

Convolutional Layers: Apply a sequence of Conv2D layers with increasing filter counts:
- Conv2D(filters = 32, kernel_size = 3 × 3, activation = ‘relu’)
- Conv2D(filters = 64, kernel_size = 3 × 3, activation = ‘relu’)
- Conv2D(filters = 128, kernel_size = 3 × 3, activation = ‘relu’)
- Conv2D(filters = 256, kernel_size = 3 × 3, activation = ‘relu’)
- Conv2D(filters = 512, kernel_size = 3 × 3, activation = ‘relu’)
- Conv2D(filters = 1024, kernel_size = 3 × 3, activation = ‘relu’)
MaxPooling Layers: After each convolution or set of convolutions:
- Apply MaxPooling2D(pool_size = 2 × 2) six times, respectively
- This reduces spatial dimensions while preserving important features
Flatten Operation:
- Flatten the output of the last MaxPooling layer
- Convert 3D feature maps to a 1D feature vector
Dense Layers with ReLU: Apply four fully connected layers with decreasing neuron counts:
- Dense(units = 512, activation = ‘relu’)
- Dense(units = 512, activation = ‘relu’)
- Dense(units = 256, activation = ‘relu’)
- Dense(units = 128, activation = ‘relu’)
Dropout:
- Apply Dropout(rate = 0.5)
- This prevents overfitting by randomly deactivating 50% of neurons during training
Softmax Layer:
- Dense(units = num_classes, activation = ‘softmax’)
- Outputs probability distribution over classes

Algorithm 3 Comprehensive transfer learning architecture for MRI classification

1:

Base Model Selection: Choose from two categories of pretrained models

2:

Category 1: Fully Frozen Models

MobileNetV2: Initialize with ImageNet weights, include_top = False
VGG19: Initialize with ImageNet weights, include_top = False
Freeze all layers in these models (layer.trainable = False)

3:

Category 2: Partially Unfrozen Models

ResNet50: Initialize with ImageNet weights, include_top = False
EfficientNetB0: Initialize with ImageNet weights, include_top = False
DenseNet201: Initialize with ImageNet weights, include_top = False
Freeze all layers except the last 20 (layer.trainable = False for layer in base_model.layers[:−20])

4:

Feature Extraction:

Apply Global Average Pooling 2D to the base model output
x = GlobalAveragePooling2D()(base_model.output)

5:

Model-Specific Dense Layers:

For VGG19:
–
Dense(4096, activation = ‘relu’)
–
Dense(2048, activation = ‘relu’)
–
Dense(1024, activation = ‘relu’)
–
Dense(512, activation = ‘relu’)
–
Dense(256, activation = ‘relu’)
For MobileNetV2, ResNet50, EfficientNetB0, DenseNet201:
–
Dense(1024, activation = ‘relu’)
–
Dense(512, activation = ‘relu’)
–
Dense(256, activation = ‘relu’)

6:

Regularization Techniques:

For ResNet50, MobileNetV2:
–
Apply Dropout after each Dense layer (rates: 0.5, 0.5, 0.3)
For EfficientNetB0, DenseNet201:
–
Apply BatchNormalization + Dropout after each Dense layer
–
Dropout rates: 0.5, 0.5, 0.3
For VGG19:
–
Apply BatchNormalization + Dropout after each Dense layer
–
Dropout rates: 0.5 for first four layers, 0.3 for the last layer

7:

Classification Layer:

Final Dense layer with softmax activation: Dense(num_classes, activation = ‘softmax’)

8:

Optimization:

Adam optimizer with learning rate 1 $\times 10^{- 4}$ (for ResNet50, EfficientNetB0, DenseNet201)
Default Adam optimizer (for MobileNetV2, VGG19)
Loss function: sparse_categorical_crossentropy
Metric: accuracy

9:

Training Strategy:

Multi-GPU training with tf.distribute.MirroredStrategy()
Early stopping (patience = 3) monitoring validation metrics
Model checkpoint to save best weights
Maximum 1000 epochs with validation after each epoch

Algorithm 4 Vision Transformer (ViT) for MRI classification

1:

Pre-trained ViT Model:

Use ViT_B16 pre-trained on ImageNet (pretrained = True)
Configure for input size of 224 × 224 pixels
Remove classification head (include_top = False)

2:

Fine-tuning Strategy:

Freeze all layers except the last 10 layers for fine-tuning
layer.trainable = False for layer in vit_model.layers[:−10]

3:

Transformer Architecture: (Embedded in pre-trained model)

Patch Embedding: Divide input into fixed-size patches
Position Embedding: Add learnable position encodings
Transformer Encoder: Multiple self-attention blocks
–
Multi-Head Self-Attention (MSA)
–
MLP blocks with residual connections
–
Layer Normalization
[CLS] Token: Special token for classification

4:

Custom Classification Head:

Layer Normalization: LayerNorm(epsilon = 1 $\times 10^{- 6}$ )
Dense Layer 1: Dense(1024, activation = ‘relu’, kernel_regularizer = l2(0.001))
Batch Normalization: BatchNormalization()
Dropout: Dropout(0.5)
Dense Layer 2: Dense(512, activation = ‘relu’)
Batch Normalization: BatchNormalization()
Dropout: Dropout(0.5)
Output Layer: Dense(num_classes, activation = ‘softmax’)

5:

Optimization:

Optimizer: AdamW with initial_learning_rate = 1 $\times 10^{- 4}$
Loss Function: sparse_categorical_crossentropy
Metrics: accuracy

6:

Learning Rate Strategy:

Reduce learning rate when validation loss plateaus
ReduceLROnPlateau(monitor = ‘val_loss’, factor = 0.5, patience = 5)

7:

Training Strategy:

Multi-GPU training with tf.distribute.MirroredStrategy()
Early stopping with patience = 3, monitoring validation accuracy
Model checkpoint to save best weights
Maximum 1000 epochs with validation after each epoch

3.2. Ensemble Classifier

In this study, two widely adopted ensemble approaches, stacking and hard voting, are utilized to maximize the predictive power of the framework. Both methods capitalize on the diverse feature extraction capabilities of multiple deep learning architectures, including Multiclassifier-CNN, DenseNet, MobileNetV2, VGG19, EfficientNet B0, ResNet, and ViT, to deliver a comprehensive and balanced classification output. These ensemble strategies ensure that the final predictions are not overly reliant on any single model’s performance but instead represent a collective decision-making process, enhancing both accuracy and robustness in Alzheimer’s disease classification.

Stacking is an ensemble method that enhances performance by combining predictions from multiple base models through a higher-level meta-model, which can be mathematically expressed as follows [26]:

y_{meta} = g (f_{1} (x), f_{2} (x), \dots, f_{n} (x))

(1)

where:

$f_{1}, f_{2}, \dots, f_{n}$ are the predictions from the base models for an input x;
g denotes the meta-model that combines these predictions to generate the final output $y_{meta}$ .

As shown in Figure 2, the stack framework leverages the predictions of multiple pre-trained models as input features for a meta-model, integrating outputs from diverse architectures, including Multiclassifier-CNN, DenseNet, MobileNetV2, VGG19, EfficientNet B0, ResNet, and ViT. Each base model generates predictions for every input sample, capturing complementary features that contribute to a more robust final prediction.

The predictions from the base models are reshaped into meta-features and combined in the second stage of the stack framework. These meta-features are passed to a logistic regression meta-model, which maps the combined predictions to the final output classes. The meta-model uses the aggregated information from all base models to enhance classification accuracy and generalization.

Hard voting, introduced because it is a widely used ensemble learning method, aggregates predictions from multiple classifiers, where each independently produces a prediction, and the ensemble’s final output is determined by the majority vote, mathematically expressed as follows [27]:

\hat{y} = mode ({y_{1}, y_{2}, \dots, y_{M}})

(2)

where:

$\hat{y}$ is the final ensemble prediction;
${y_{1}, y_{2}, \dots, y_{M}}$ are predictions made by M classifiers;
mode represents the majority label among the predictions.

In Figure 3, the process begins with the initialization of the ensemble and the collection of predictions from the seven models. The ensemble includes architectures such as Multiclassifier-CNN, DenseNet, MobileNetV2, VGG19, EfficientNet B0, ResNet, and ViT, each providing independent predictions for the input data. Each model contributes its prediction, and the final decision is based on the majority vote (argmax) across the models. This voting mechanism ensures that the ensemble captures the consensus among the individual models, reducing the likelihood of misclassification caused by the biases or errors of any single model.

This method leverages the diversity of the included architectures, combining their unique strengths to enhance generalization and accuracy. By relying on majority consensus, the hard voting ensemble is particularly effective in handling challenging multi-class classification problems where individual models may have varying levels of confidence. The output of the ensemble represents the most agreed-upon prediction, improving the robustness of the classification system.

3.3. Model Interpretation

This study employs three prominent explainable artificial intelligence (XAI) techniques, GradCAM, Local LIME, and Saliency Maps, to enhance the interpretability of the proposed ensemble framework. These methods provide distinct yet complementary insights into the model’s decision-making process, thereby improving transparency and fostering trust in the predictions. GradCAM generates class activation maps by identifying regions of input images that significantly influence predictions, focusing on feature localization in convolutional layers. Local LIME constructs a model-agnostic surrogate model, which is typically linear, to approximate the contributions of specific features to the prediction, offering interpretability irrespective of the underlying architecture. Saliency Maps utilize gradient information to directly compute the sensitivity of the prediction with respect to input features, enabling a visualization of critical input regions influencing the output.

The distinction among these techniques lies in their methodological focus and application scope. GradCAM emphasizes the spatial importance of features within convolutional neural networks, making it particularly effective for visualizing the decision-making process of CNNs. In contrast, Local LIME operates independently of the model architecture, explaining predictions through local approximations and feature attributions. Saliency Maps provide a more direct gradient-based analysis, highlighting input regions of high predictive relevance without relying on intermediate feature representations. By integrating these methods, this study ensures a comprehensive and interpretable understanding of the ensemble framework, aligning with the principles of transparency and accountability in AI systems.

3.4. Model Evaluation

To comprehensively evaluate the performance of the models, several key metrics were employed. First, a full classification report includes metrics such as accuracy, precision, recall, and F1-score, providing a holistic view of the model’s performance for each category. Particular emphasis was placed on the F1-score, as it effectively balances precision and recall, making it especially valuable for assessing model performance on imbalanced datasets [28]. By combining these metrics, the classification report ensures a nuanced understanding of the model’s strengths and weaknesses in handling different classes. Secondly, the confusion matrix breaks down the predictions into true positives, true negatives, false positives, and false negatives, enabling the identification of specific areas where the model may misclassify. This granular view is essential for diagnosing issues such as class imbalance or systematic errors in predictions [29]. For example, high false positives in certain categories might indicate over-sensitivity, while low true positives might suggest insufficient sensitivity to certain features. The metrics used are defined as follows:

True Positive (TP): Cases correctly identified as positive (correctly diagnosed with Alzheimer’s);
True Negative (TN): Cases correctly identified as negative (correctly diagnosed as healthy);
False Positive (FP): Cases incorrectly identified as positive (healthy subjects misdiagnosed with Alzheimer’s);
False Negative (FN): Cases incorrectly identified as negative (Alzheimer’s subjects misdiagnosed as healthy).

From these basic quantities, the following performance metrics were calculated:

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(3)

Precision = \frac{TP}{TP + FP}

(4)

Recall (Sensitivity or True Positive Rate) = \frac{TP}{TP + FN}

(5)

Specificity (True Negative Rate) = \frac{TN}{TN + FP}

(6)

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(7)

Finally, the Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) were used exclusively for evaluating the ensemble techniques, specifically hard voting and stacking. The ROC curve plots the True Positive Rate (Sensitivity) against the False Positive Rate (1-Specificity) across various threshold settings. An ideal classifier achieves a curve that closely approaches the top-left corner of the plot, corresponding to a True Positive Rate of 1.0 and a False Positive Rate of 0.0. The AUC serves as a single-value metric summarizing the overall performance of the ensemble models [30]. AUC values range from 0.5, indicating no discrimination (equivalent to random guessing), to 1.0, representing perfect classification. High AUC values indicate strong class separation capabilities, reflecting the robustness of the ensemble methods compared to individual classifiers.

4. Results and Discussion

Table 1 shows classification performance of the base models, namely, VGG-19, ResNet, MobileNet, EfficientNet, DenseNet, a custom-built multi-classifier CNN, and ViT for Alzheimer’s disease classification.

The ViT model achieved the highest accuracy of 98.0%, with precision and F1-scores also at 98.0% and a recall of 97.0%. Its Transformer-based architecture, known for modeling long-range dependencies and spatial hierarchies in image data, likely contributed to its ability to effectively capture complex patterns in the dataset. The custom-built multi-classification CNN demonstrated an accuracy of 94.0%, with precision, recall, and F1-scores all at 94.0%, indicating consistent performance across these metrics. Despite a lack of advanced feature extraction mechanisms, the CNN-based model remains competitive in Alzheimer’s disease classification.

DenseNet achieved an accuracy of 88.0% with a precision of 89.0%, recall of 87.0%, and F1-score of 88.0%. Its densely connected architecture likely facilitated feature reuse, which may explain its relatively strong performance. MobileNet, with an accuracy of 87.0%, demonstrated precision, recall, and F1-scores of 87.0%, showing comparable effectiveness to DenseNet in this application. The VGG-19 model achieved an accuracy of 83.0%, with precision and F1-score values of 84.0% and a recall of 83.0%, while its performance is satisfactory, it was outperformed by models such as DenseNet and the custom CNN, likely due to its simpler architecture. ResNet demonstrated an accuracy of 65.0%, precision of 78.0%, recall of 65.0%, and F1-score of 66.0%. This relatively low performance indicates difficulties in capturing relevant features for this classification task. Adjustments to its architecture or further optimization may improve its performance. EfficientNet exhibited the lowest performance, with an accuracy of 41.0%, precision of 16.0%, recall of 41.0%, and an F1-score of 23.0%. This result suggests a potential mismatch between the model’s scale and complexity and the dataset characteristics, leading to suboptimal outcomes in Alzheimer’s disease classification.

Figure 4 reveals distinct performance patterns across the neural network architectures, with ViT and standard CNNs demonstrating exceptional performance, consistently approaching the 95% reference threshold across all metrics. Mid-tier models, namely, VGG-19, MobileNet, and DenseNet, maintain stable 80–90% performance with minimal variance. ResNet presents an interesting case, with precision significantly outperforming its other metrics by approximately 10 percentage points. Most notably, EfficientNet exhibits substantial limitations despite its design focus, particularly in F1-Score, where it achieves only about 23%, suggesting significant precision–recall tradeoffs. The tight confidence intervals observed for high-performing models indicate statistical reliability and consistency across evaluation instances while revealing an apparent inverse relationship between model efficiency and classification performance in this specific application context.

The performance of two ensemble techniques, stacking and hard voting, for Alzheimer’s disease classification is summarized in Table 2. Both techniques were evaluated using key metrics accuracy, precision, recall, and F1-score to assess their effectiveness in classifying the stages of Alzheimer’s disease. As seen in the table, the stacking ensemble method achieved the highest accuracy of 98.0%, with a precision of 98.0%, recall of 97.0%, and an F1-score of 97.0%. This high level of performance demonstrates the effectiveness of stacking in leveraging the strengths of multiple base classifiers, resulting in a robust model capable of capturing complex patterns within the dataset. The elevated precision and recall metrics indicate that stacking excels in distinguishing between different stages of Alzheimer’s disease, minimizing misclassification errors and enhancing classification reliability.

The hard voting ensemble also exhibited strong performance, achieving an accuracy of 97.0%, with precision, recall, and F1-scores all at 97.0%. Although showing slightly lower performance than the stacking method, the hard voting technique still delivers high classification accuracy and consistency across all evaluation metrics. This performance suggests that hard voting, by aggregating the predictions of multiple classifiers, provides a balanced and effective approach that closely approximates the stacking method in terms of overall classification capability.

Figure 5 shows the comparative analysis of the stacking and hard voting ensemble methods, revealing virtually identical high performance across all metrics (accuracy, precision, recall, and F1-Score). Both methods consistently reach or exceed the 95% reference threshold and exhibit minimal variance, as evidenced by the tight confidence intervals. This suggests interchangeability between these ensemble techniques for high-performance classification tasks.

4.1. Performance Analysis Using the Confusion Matrix

The color intensity in the confusion matrix visualizes the frequency of classifications, with darker colors representing higher counts (primarily along the diagonal for correct predictions) and lighter colors representing fewer occurrences (for misclassifications).

The confusion matrix in Figure 6a illustrates the performance of the hard voting ensemble technique in classifying the stages of Alzheimer’s disease across three categories: CN, MCI, and AD. For the CN class, 858 instances were correctly classified, with 16 misclassified as MCI and 2 as AD. The MCI class showed 880 correctly classified instances, with 7 misclassified as CN and 5 as AD. For the AD class, 404 instances were correctly classified, while 11 were misclassified as CN and 15 as MCI. This evaluation highlights the effectiveness of the hard voting ensemble technique, demonstrating strong classification accuracy with minimal errors across the three categories.

The confusion matrix in Figure 6b illustrates the classification performance of the stacking ensemble technique for Alzheimer’s disease across three categories: CN, MCI, and AD. For the CN class, 863 instances were correctly classified, with 14 misclassified as MCI and 2 as AD. The MCI class showed 897 correctly classified instances, with 8 misclassified as CN and 5 as AD. For the AD class, 387 instances were correctly classified, while 3 were misclassified as CN and 19 as MCI. This analysis demonstrates the model’s strong classification accuracy, with relatively few misclassifications observed across all categories.

4.1.1. Analysis of Misclassifications and Hard Voting Performance

In reference to Table 3, several critical observations warrant discussion regarding the classification performance metrics of the evaluated deep learning architectures. The ViT architecture demonstrates superior performance with remarkably low misclassification rates (0.17–0.65%) across all diagnostic transition boundaries, suggesting enhanced feature representation capabilities for neurological status differentiation. A substantial performance gradient is evident between the highest-performing models (ViT, CNN) and the remaining architectures. Specifically, the error rates increase by approximately 3-fold from CNN_model3 (1.52–1.59%) to the next-tier models DenseNet3 and MobileNet3 (4.18–5.62%), suggesting significant differences in their ability to capture relevant diagnostic features.

The ResNet implementation exhibits notable asymmetry in its error distribution pattern, with acceptable performance at the cognitively normal to mild cognitive impairment (CN–MCI) boundary (6.92%), but significantly compromised classification at the MCI to Alzheimer’s disease (MCI–AD) boundary (22.70%). This suggests differential sensitivity to feature characteristics across the cognitive decline spectrum. The EfficientNet implementation demonstrates concerning performance metrics with 50% error rates universally, indicating potential implementation issues or sub-optimal hyperparameter selection. Further investigation into training protocols and architectural modifications is warranted. Importantly, the symmetrical distribution of false positive and false negative rates within each model at specific diagnostic boundaries suggests appropriately calibrated classification thresholds, a critical factor for potential clinical deployment considerations.

The hard voting ensemble (0.967 accuracy) effectively mitigated these misclassifications across all categories, as demonstrated in the confusion matrix (Figure 6). For the CN class, hard voting achieved 97.9% accuracy, with 16 instances misclassified as MCI and 2 as AD. In the MCI class, the accuracy reached 98.7%, with 7 samples misclassified as CN and 5 as AD. The AD class showed a slightly lower accuracy of 93.9%, where 11 cases were misclassified as CN and 15 as MCI.

Despite including lower-performing models, hard voting neutralized extreme misclassification patterns by leveraging complementary strengths of different architectures, while not achieving the highest absolute accuracy (Figure 7), this approach provided more balanced and clinically reliable classification, substantially reducing systematic misclassification risks.

4.1.2. Analysis of Misclassifications and Stacking Performance

Table 4 provides a comprehensive overview of all misclassification patterns. The most frequent error type is misclassifying class 2 (severe abnormalities) as class 0 (normal), which accounts for 44.44% of all errors. This represents the highest clinical risk as it could lead to missed diagnoses of severe conditions. ResNet demonstrated superior performance on these critical errors with 75% accuracy, while most other models struggled. The table categorizes each error by type and indicates its clinical impact.

Table 5 breaks down how each individual model performed across different error categories. DenseNet, ResNet, and VGG19 tied for the highest overall accuracy (44.44%) on misclassified samples, but with different strengths. ResNet excelled at correctly classifying critical 2→0 errors (75% accuracy), while DenseNet performed well on other error types but failed on 2→0 errors. The VIT model failed to correctly classify any misclassified samples, suggesting limited value in the ensemble for difficult cases.

Table 6 examines each misclassified sample individually, showing which models correctly classified each instance and the level of model agreement. The highest agreement (57.14%) was for instance 1550. For critical 2→0 errors, model agreement was notably low (averaging 17.86%), highlighting the difficulty in correctly identifying these cases. No instance was correctly classified by all models, indicating that even the best ensemble cannot fully resolve certain challenging examples.

Table 7 focuses specifically on the most concerning 2→0 error type where severe abnormalities are misclassified as normal. ResNet’s superior performance is highlighted, correctly identifying three out of four instances (75% accuracy), while most other models failed completely. This analysis suggests that greater weight should be given to ResNet’s predictions for potential class 2 cases in future ensemble configurations.

Figure 8 depicts accuracy comparison across seven deep learning models in our stacking ensemble. ViT and CNN demonstrate the highest accuracy (99.3% and 97.8%), while EfficientNet shows the lowest (41.4%). Despite this variation, our ensemble achieves a 99.59% overall accuracy, exceeding any individual model. The performance diversity is beneficial—lower-performing models contribute unique classification patterns that resolve specific error cases where higher-performing models falter. This confirms our ensemble effectively balances misclassification patterns across architectures, resulting in a more robust classification system.

4.2. Performance Evaluation Using ROC and AUC Metrics

The performance of the ensemble classifiers was evaluated using ROC curves and AUC metrics for the classes CN, MCI, and AD. Figure 9a presents the ROC curves for the hard voting ensemble classifier, illustrating its performance in distinguishing between the three classes: CN, MCI, and AD. The AUC values are exceptionally high, with CN and MCI both achieving an AUC of 0.98, while AD achieves an AUC of 0.97. These results highlight the strong predictive accuracy of the hard voting ensemble, particularly for CN and MCI. Although the AUC for AD is marginally lower, it still demonstrates excellent classification performance.

Figure 10 presents the performance metrics of the individual base models that comprise the ensemble classifiers. The base models demonstrate varied performance across accuracy and ROC–AUC metrics. ViT emerges as the top performer with 0.995 accuracy and a perfect 1.000 ROC–AUC score, followed closely by CNN with 0.978 accuracy and 0.999 ROC–AUC. DenseNet and MobileNet3 show comparable performance with accuracy values of approximately 0.92 and ROC–AUC scores around 0.99. VGG19 exhibits a notable gap between its moderate accuracy (0.860) and strong ROC–AUC (0.969). ResNet achieves moderate results with 0.649 accuracy but a relatively high 0.887 ROC–AUC. EfficientNet significantly underperforms compared to all other models, with only 0.414 accuracy and 0.522 ROC–AUC.

When comparing individual model performance to the ensemble results, it is evident that while the hard voting ensemble does not surpass the top individual performers, it successfully mitigates the impact of weaker models, delivering consistently high performance across all classes. Moreover, the AUC results for CN, MCI, and AD do not strongly indicate overfitting for several reasons:

Reasonable progression of difficulty: the slight performance differences between classes (CN being easiest and AD being hardest to classify in hard voting) align with clinical expectations, as cognitive normal cases are typically more distinct than disease states.
Consistent performance across ensembles: if severe overfitting were present, we might see unrealistically perfect performance across all classes or unexpected patterns (such as AD being easier to classify than CN).
Ensemble techniques inherently reduce overfitting: the very purpose of ensemble methods is to improve generalization by combining diverse models, which helps mitigate any overfitting present in the individual models.
Performance ceiling effect: the near-perfect AUC values in the stacking ensemble could represent a genuine ceiling effect rather than overfitting, especially if the classification task is relatively straightforward with well-separated classes.

4.3. Comparative Performance Analysis of CNN and Transformer Architectures for Alzheimer’s Disease Progression Detection

Table 8 presents the core classification metrics for MCI detection across all seven architectures. While ViT and EfficientNet demonstrate exceptional accuracy (99.78% and 100%, respectively), their confidence and uncertainty metrics reveal important differences. ViT combines high accuracy with high confidence (0.998) and low uncertainty (0.004), whereas EfficientNet, despite achieving perfect accuracy, exhibits low confidence (0.399) and high uncertainty (1.533), suggesting potential overfitting. CNN_model offers a balanced performance profile with 98.46% accuracy, high confidence (0.979), and low uncertainty (0.054), making it particularly robust for clinical applications.

Table 9 focuses specifically on early detection capabilities at the critical CN-to-MCI transition. The overlap coefficient measures sensitivity to early biomarkers, the confusion rate indicates specificity, and the confidence gap reflects the model’s certainty. ViT demonstrates an optimal balance between sensitivity (0.9423 overlap) and specificity (0.0017 confusion), while EfficientNet shows high sensitivity (0.9500) but poor specificity (0.5000), indicating limited discriminative ability. ResNet, despite moderate overall performance, exhibits the highest confidence gap (0.0143), suggesting that when it identifies biomarkers, it does so with greater certainty.

Table 10 addresses model interpretability through confidence and uncertainty metrics. Higher confidence with lower standard deviation and lower entropy values generally indicate more interpretable models. ViT leads in interpretability metrics with the highest mean confidence (0.998), low standard deviation (0.035), and lowest entropy (0.004), suggesting consistent, reliable predictions. The uniform difficult cases count (91) across all models highlights a consistent set of challenging cases that could benefit from targeted improvement strategies. These metrics help quantify the “black box” nature of different architectures, with Transformer-based models surprisingly offering more consistent interpretations than conventional CNNs.

Table 11 examines the later disease transition from MCI to AD, providing context for comparing early vs. late detection capabilities. ViT maintains strong performance across disease stages (0.9286 overlap, 0.0065 confusion), indicating consistent reliability throughout the disease continuum. EfficientNet shows a dramatic performance drop at this boundary (0.0000 overlap), revealing its specificity for early-stage detection only. ResNet demonstrates a significantly higher confusion rate (0.2270) at this boundary, explaining its tendency to misclassify MCI as AD (42.09%). These metrics highlight the varying abilities of architectures to maintain diagnostic consistency across disease progression.

Table 12 reveals an inverse relationship between classification performance and interpretability. Traditional CNNs (CNN_model) offer high interpretability with strong biomarker alignment and region specificity for hippocampal and entorhinal structures. In contrast, ViT, despite superior accuracy (99.78%), provides only medium interpretability with less localized focus. All models support visualization through GradCAM, LIME, and Saliency Map techniques, each with specific limitations. These findings suggest that optimal clinical deployment may require ensemble approaches balancing the detection capabilities of Transformers with the enhanced interpretability of CNNs, addressing a critical consideration for AD diagnostic tools where both accuracy and explainability are essential.

4.4. Computational Efficiency Analysis

Figure 11 shows the computational cost in comparison to the performance of individual models. VGG19 demonstrates exceptional efficiency (∼0.011), achieving the best ratio of accuracy to computational cost despite being a larger model. This unexpected result challenges the conventional wisdom about VGG’s inefficiency. EfficientNet ranks second in efficiency (∼0.0065), living up to its name as designed for efficiency through compound scaling of network dimensions. ResNet follows with moderate efficiency (∼0.0048). CNN and ViT show similar mid-range efficiency scores (∼0.0022 and ∼0.0026, respectively), despite representing completely different architectural approaches (traditional convolutional vs. Transformer-based). DenseNet and MobileNet surprisingly demonstrate the lowest efficiency scores (∼0.0014 and ∼0.0012), with MobileNet being particularly disappointing given its design intent for mobile/resource-constrained environments.

4.4.1. Ensemble Framework Robustness

Figure 12 shows each model’s robustness to noise. ViT and EfficientNet maintain nearly perfect performance (0.99 and 1.00, respectively) even at a 0.05 noise level and show the smallest degradation at higher noise levels (0.82 and 1.00 at 0.2 noise). ResNet and CNN demonstrate good robustness, maintaining above 0.88 performance even at a 0.2 noise level. MobileNet shows the most dramatic performance degradation as noise increases (0.73→0.47), making it the least suitable for noisy data environments. VGG19’s performance drops significantly from 0.85 to 0.49 at a 0.2 noise level, revealing a weakness despite its high efficiency score.

4.4.2. Ablation Study Insights

Figure 13 presents an ablation study evaluating the model’s performance when one or more components are removed. Critical Components: removing ViT from the ensemble causes a 1.4% accuracy drop, followed by CNN (0.3% drop), confirming these are the most valuable contributors. Redundant Components: removing DenseNet or VGG19 actually improves ensemble performance slightly, suggesting they introduce noise or redundancy. Optimization Opportunity: the ablation study suggests a more efficient ensemble could be constructed by eliminating DenseNet and potentially VGG19, reducing computational overhead while maintaining or improving accuracy.

4.5. Model Trade-Offs

The computational complexity analysis of the ViT and base neural network models is summarized in Table 13 and is summarized as follows: Architecture-Specific Trade-offs: CNN: fastest inference (0.43 ms) but highest memory usage (1027.84 MB), likely due to layer-wise activation storage. ViT: highest accuracy but also the most computationally expensive (5.15 ms inference, 332.34 MB). VGG19: despite a large parameter count, it maintains relatively low memory usage (36.14 MB), making it suitable for constrained environments. DenseNet: deepest model (718 layers) with 87% non-trainable parameters, reducing fine-tuning flexibility.

Parameter Utilization: models with higher proportions of trainable parameters (CNN, ViT) generally contribute more to ensemble performance than those with many non-trainable parameters (DenseNet, VGG19).

Memory vs. Size: there is no clear correlation between model size and memory usage during inference (e.g., CNN uses 37× more memory than VGG19 despite being 4.6× smaller).

Ensemble Design Implications: the data suggest a more optimal ensemble could be created using only ViT, CNN, and ResNet, potentially with EfficientNet if robustness to noise is critical. This would reduce computational overhead while maintaining or improving accuracy.

Deployment Strategy: for resource-constrained environments where a single model must be chosen, the optimal selection would depend on specific priorities: for raw accuracy: ViT; for noise robustness: EfficientNet; for inference speed: CNN; for efficiency (accuracy/cost): VGG19; and for balanced performance: ResNet.

4.6. Comparative Analysis of Ensemble Explainability Techniques

First, the Grad-CAM visualizations in Figure 14 provide valuable insights into the spatial attention of the stacking and hard voting ensemble techniques.

Color Intensity and Diagnostic Relevance: Grad-CAM visualizations use a color spectrum from blue (very low attention, 0.0–0.33) to red (high attention, 0.66–1.0), revealing where models focus during classification. While both ensemble methods highlight similar regions, subtle but important differences exist in attention distribution. In AD cases, stacking ensembles exhibit slightly more concentrated red regions in medial temporal areas, particularly the hippocampus, indicating more precise focus on established pathological regions. Hard voting shows slightly more diffused yellow-to-red patterns, suggesting less targeted attention. These subtle differences in attention concentration correlate with stacking’s marginally superior accuracy (98.0%) compared to hard voting (97.0%) in Table 2.
MCI Attention Patterns: For MCI cases, the primary distinction lies in gradient differentiation—stacking displays more nuanced attention gradients with better yellow-to-orange transitions (medium-to-high attention, 0.5–0.8) distributed across temporal-parietal regions, capturing the subtle, widespread changes characteristic of early neurodegeneration. Hard voting shows slightly less intensity differentiation, potentially missing the graduated nature of early pathological changes. These subtle gradient differences may contribute to stacking’s enhanced ability to detect borderline MCI cases, as evidenced by the model performance metrics in Table 8.
Color Distribution in Misclassifications: Analysis of misclassified cases reveals diagnostic patterns consistent across both ensemble methods. Correctly classified AD cases display concentrated red regions in hippocampal areas, while misclassified cases (particularly AD classified as CN) show inappropriate green-to-blue coloration (low attention, 0.0–0.4) in pathologically significant regions. This color distribution anomaly provides a visual explanation for the classification errors documented in Table 7, where even the best model (ResNet) achieved only 75% accuracy on these critical cases.

4.6.1. LIME Results Comparison

Secondly, the LIME visualizations in Figure 15 provide additional insights into the ensemble techniques capabilities to highlight areas linked to brain atrophy.

Binary Feature Attribution (Green vs. Red): Unlike Grad-CAM’s continuous spectrum, LIME provides binary classification of regions as either positively contributing (green) or negatively contributing (red) to the diagnosis. In AD cases, both ensembles show green regions in areas associated with known pathology, but stacking demonstrates more precise positive contribution boundaries, particularly in medial temporal structures. This binary delineation offers clearer interpretability of which specific regions influence diagnostic decisions, supporting stacking’s higher precision (98.0%) reported in Table 2.
Color Distribution in CN vs. AD: LIME visualizations reveal a diagnostic color inversion between CN and AD cases. CN cases predominantly display red coloration in medial temporal regions, indicating these areas negatively contribute to an AD diagnosis when healthy. Conversely, AD cases show green coloration in these same regions, reflecting positive contribution to diagnosis. This color-based contrastive explanation provides intuitive interpretability that aligns with the clinical understanding of Alzheimer’s progression.
MCI Contribution Patterns: MCI cases reveal a distinctive mixed pattern of green and red regions. Stacking demonstrates more distributed green areas across cortical regions compared to hard voting, suggesting better sensitivity to the subtle, widespread changes in early neurodegeneration. This distribution difference correlates with the overlap coefficient metrics in Table 9, where higher values indicate better sensitivity to early biomarkers. The balanced color distribution in stacking’s LIME visualizations provides a potential explanation for its superior early detection capabilities.

4.6.2. Saliency Map Results Comparison

Finally, the Saliency maps in Figure 16 provide an additional perspective on feature importance in our ensemble classification approaches.

Brightness Intensity and Diagnostic Focus: Saliency maps visualize feature importance through brightness intensity bright red indicates high importance, while darker areas represent low importance. Both ensembles show similar patterns, though stacking exhibits marginally more concentrated bright spots in the hippocampal and temporal regions, correlating with its slightly superior precision (98.0%).
Diagnostic-Specific Brightness Patterns: Each diagnostic category shows characteristic brightness distributions across both methods: AD cases with concentrated bright spots in temporal structures, MCI with distributed patterns across broader regions, and CN with fewer, less intense bright spots, validating neuroanatomical progression patterns.
Early Detection Enhancement: For early-stage MCI, stacking shows slightly more nuanced brightness variations in the cortical and subcortical regions, potentially capturing subtle structural changes indicative of early neurodegeneration, contributing to enhanced sensitivity to early biomarkers (Table 9).

4.7. Comparative Evaluation

Table 14 provides a comparative analysis of ensemble techniques in AD classification. While [31] demonstrates marginally higher accuracy (98.27% vs. our 98.00%), our stacking framework offers several significant advantages: (1) Our approach implements three complementary explainability techniques (Grad-CAM, LIME, and saliency maps) versus [31]’s more limited interpretability methods; (2) Our integration of Vision Transformer models introduces state-of-the-art self-attention mechanisms absent in [31]’s approach; (3) Our framework demonstrates particular strength in MCI classification with 897 correctly classified instances and only 13 misclassifications (see Figure 6); (3) Our ensemble maintains high performance even under increased noise conditions (Figure 12), a critical advantage for clinical deployment not addressed in [31]; and (5) As shown in Figure 6 our model maintains consistent high performance across all three diagnostic categories (CN, MCI, AD), whereas [31] showed higher variability between classification accuracy for different stages. In total, the stacking framework leverages diverse base models to achieve an optimal balance of diagnostic precision and interpretability while maintaining state-of-the-art classification performance.

5. Conclusions and Future Work

This research has proposed a comprehensive ensemble approach for AD diagnosis using the ADNI dataset. The proposed pipeline leveraged a diverse set of models including CNN architectures and a ViT to enhance accuracy and robustness of the analysis for the three-way classification problem (CN, MCI, AD). Stacking emerged as the most effective ensemble technique, achieving a diagnostic accuracy of 98.0%, while hard voting closely followed at 97.0%. The stacking method demonstrated a refined focus on critical neurodegenerative regions such as the hippocampus and cortical areas, which are vital for detecting early cognitive impairments like MCI. The use of multiple visualization methods highlighted the spatial attention patterns and provided clinicians with actionable insights, making the diagnostic process both transparent and interpretable.

This work advances the field of XAI in medical imaging by demonstrating that accuracy and interpretability can coexist without compromise. The multi-method explainability approach using Grad-CAM, LIME, and Saliency Maps revealed critical insights into model decision-making processes, with each method providing complementary perspectives on the classification decisions. This comprehensive explainability framework addresses the limitations of individual methods, providing clinically relevant insights that enhance the trustworthiness of the classification results.

Future research should prioritize integrating multimodal data, such as PET scans, genetic markers, and cognitive test scores, which would enrich the diagnostic precision and offer a more comprehensive understanding of disease progression. Future work may implement domain-specific pretraining on large medical imaging datasets before fine-tuning on Alzheimer’s datasets. Additionally, incorporating anatomical atlases to precisely map visualization outputs to specific brain structures and conducting correlation analysis between model attention patterns and quantitative measurements of hippocampal and entorhinal cortex volumes would further enhance the approach.

Given the subtle changes associated with early-stage cognitive impairments, additional efforts should focus on improving sensitivity to these features, possibly by incorporating temporal modeling to analyze longitudinal MRI data and capture disease progression over time. The error analysis highlighted that the most challenging misclassification type was identifying severe cases (AD) incorrectly as normal (CN), suggesting that focused refinement of models for these critical error cases could further improve clinical utility. Finally, validation in larger, more diverse clinical datasets will be essential to ensure the generalizability of these findings across different patient populations and imaging protocols.

Author Contributions

Conceptualization was performed by O.T.A., B.O., T.E.A., O.O.E.P., A.O.A., M.M.R. and F.K.; methodology was developed by O.T.A.; software implementation was carried out by O.T.A.; validation was conducted by B.O., F.K. and M.M.R.; formal analysis and investigation were undertaken by O.T.A.; resources were provided by O.T.A. and M.M.R.; data curation was handled by O.T.A.; writing—original draft preparation was done by O.T.A., B.O., F.K. and M.M.R.; writing—review and editing was carried out by O.T.A., B.O., T.E.A., O.O.E.P., A.O.A., M.M.R. and F.K.; visualization was performed by O.T.A., B.O., T.E.A., O.O.E.P., A.O.A., F.K. and M.M.R.; supervision was carried out by B.O., F.K. and M.M.R.; project administration was managed by B.O., F.K. and M.M.R.; and funding acquisition was led by M.M.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Center for Equitable Artificial Intelligence and Machine Learning Systems (CEAMLS) at Morgan State University. Additionally, it is supported in part by the National Science Foundation (NSF) under Grant No. 2131307, “CISE-MSI: DP: IIS: III: Deep Learning-Based Automated Concept and Caption Generation of Medical Images Towards Developing an Effective Decision Support System”, and in part by the Office of the Director, National Institutes of Health (NIH) Common Fund under Award No. 1OT2OD032581-01. The content is solely the responsibility of the authors and does not necessarily represent the official views of AIM-AHEAD, the NIH, or any other funding agencies.

Informed Consent Statement

This work used publicly available data.

Data Availability Statement

The data presented in this study were obtained from the ADNI database following an approved access request. While these datasets are not publicly accessible, they can be obtained by submitting a request through the ADNI portal at https://adni.loni.usc.edu/ (accessed on 1 Jaunary 2025). As data were obtained from a controlled access database and not collected directly from participants by the authors, the informed consent statement is not applicable (N/A). The source code and implementation details can be found in the GitHub repository at https://github.com/opeyemiTaiwo/Explainable-Framework-for-Early-Detection-of-Alzheimer-s-Disease (accessed on 1 Jaunary 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Razzak, I.; Naz, S.; Ashraf, A.; Khalifa, F.; Bouadjenek, M.R.; Mumtaz, S. Mutliresolutional Ensemble PartialNet for Alzheimer Detection using Magnetic Resonance Imaging Data. Int. J. Intell. Syst. 2022, 37, 3708–3821. [Google Scholar] [CrossRef]
Alzheimer’s Association Report. 2024 Alzheimer’s disease facts and figures. Alzheimer’S Dement. 2024, 20, 3708–3821. [Google Scholar] [CrossRef] [PubMed]
Lazarova, S.; Grigorova, D.; Petrova-Antonova, D. Detection of Alzheimer’s Disease Using Logistic Regression and Clock Drawing Errors. Brain Sci. 2023, 13, 1139. [Google Scholar] [CrossRef]
Golestani, R.; Gharbali, A.; Nazarbaghi, S. Assessment of Linear Discrimination and Nonlinear Discrimination Analysis in Diagnosis Alzheimer’s Disease in Early Stages. Adv. Alzheimer’s Dis. 2020, 9, 21–32. [Google Scholar] [CrossRef]
Popescu, S.G.; Whittington, A.; Gunn, R.N.; Matthews, P.M.; Glocker, B.; Sharp, D.J.; Cole, J.H.; Initiative, F.T.A.D.N. Nonlinear biomarker interactions in conversion from mild cognitive impairment to Alzheimer’s disease. Hum. Brain Mapp. 2020, 41, 4406–4418. [Google Scholar] [CrossRef] [PubMed]
Doshi-Velez, F.; Kim, B. Towards a Rigorous Science of Interpretable Machine Learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
Razzak, I.; Naz, S.; Alinejad-Rokny, H.; Nguyen, T.N.; Khalifa, F. A Cascaded Mutliresolution Ensemble Deep Learning Framework for Large Scale Alzheimer’s Disease Detection using Brain MRIs. IEEE/ACM Trans. Comput. Biol. Bioinform. 2024, 21, 573–583. [Google Scholar] [CrossRef]
Agarwal, S.; Jabbari, S.; Agarwal, C.; Upadhyay, S.; Wu, Z.S.; Lakkaraju, H. Towards the Unification and Robustness of Perturbation and Gradient Based Explanations. In Proceedings of the 38th International Conference on Machine Learning, PMLR 139, Virtual, 18–24 July 2021; Available online: https://arxiv.org/abs/2102.10618 (accessed on 1 March 2024).
Alami, A.; Boumhidi, J.; Chakir, L. Explainability in CNN-based Deep Learning models for medical image classification. In Proceedings of the International Symposium on Computer Vision, Fez, Morocco, 8–10 May 2024. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
Rezk, N.G.; Alshathri, S.; Sayed, A.; Hemdan, E.E.-D.; El-Behery, H. XAI-Augmented Voting Ensemble Models for Heart Disease Prediction: A SHAP and LIME-Based Approach. Bioengineering 2024, 11, 1016. [Google Scholar] [CrossRef]
Bloch, L.; Friedrich, C.M. Systematic comparison of 3D Deep learning and classical machine learning explanations for Alzheimer’s Disease detection. Comput. Biol. Med. 2024, 170, 108029. [Google Scholar] [CrossRef]
Sattarzadeh, S.; Sudhakar, M.; Plataniotis, K.N.; Jang, J.; Jeong, Y.; Kim, H. Integrated Grad-Cam: Sensitivity-Aware Visual Explanation of Deep Convolutional Networks Via Integrated Gradient-Based Scoring. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1775–1779. [Google Scholar] [CrossRef]
Shah, S.T.H.; Khan, I.I.; Imran, A.; Shah, S.B.H.; Mehmood, A.; Qureshi, S.A.; Raza, M.; Di Terlizzi, A.; Cavagliá, M.; Deriu, M.A. Data-driven classification and explainable-AI in the field of lung imaging. Front. Big Data 2024, 7, 1393758. [Google Scholar] [CrossRef]
Salahuddin, Z.; Woodruff, H.C.; Chatterjee, A.; Lambin, P. Transparency of Deep Neural Networks for Medical Image Analysis: A Review of Interpretability Methods. Comput. Biol. Med. 2022, 140, 105111. Available online: https://arxiv.org/abs/2212.10565 (accessed on 1 March 2024). [CrossRef]
Ijiga, A.C.; Igbede, M.A.; Ukaegbu, C.; Olatunde, T.I.; Olajide, F.I.; Enyejo, L.A. Precision healthcare analytics: Integrating ML for automated image interpretation, disease detection, and prognosis prediction. World J. Biol. Pharm. Health Sci. 2024, 18, 336–354. [Google Scholar] [CrossRef]
Shivhare, I.; Jogani, V.; Purohit, J.; Shrawne, S.C. Analysis of Explainable Artificial Intelligence Methods on Medical Image Classification. In Proceedings of the International Conference on Artificial Intelligence and Emerging Technologies, Bhilai, India, 5–6 January 2023. [Google Scholar] [CrossRef]
Rodrigues, C.M.; Boutry, N.; Najman, L. Transforming gradient-based techniques into interpretable methods. Pattern Recognit. Lett. 2024, 184, 66–73. [Google Scholar] [CrossRef]
Muzellec, S.; Andéol, L.; Fel, T.; VanRullen, R.; Serre, T. Gradient strikes back: How filtering out high frequencies improves explanations. arXiv 2023, arXiv:2307.09591. [Google Scholar]
Pelka, O.; Friedrich, C.M.; Nensa, F.; Mönninghoff, C.; Bloch, L.; Jöckel, K.-H.; Schramm, S.; Hoffmann, S.S.; Winkler, A.; Weimar, C.; et al. Sociodemographic data and APOE-ε4 augmentation for MRI-based detection of amnestic mild cognitive impairment using deep learning systems. PLoS ONE 2020, 15, e0236868. [Google Scholar] [CrossRef] [PubMed]
Zeineldin, R.A.; Karar, M.E.; Elshaer, Z.; Coburger, J.; Wirtz, C.R.; Burgert, O.; Mathis-Ullrich, F. Explainability of deep neural networks for MRI analysis of brain tumors. Int. J. Comput. Assist. Radiol. Surg. 2022, 17, 1673–1683. [Google Scholar] [CrossRef] [PubMed]
Rahman, M.M.; Lewis, N.; Plis, S. Geometrically Guided Integrated Gradients. arXiv 2022, arXiv:2206.05903. [Google Scholar]
Band, S.S.; Yarahmadi, A.; Hsu, C.-C.; Biyari, M.; Sookhak, M.; Ameri, R.; Dehzangi, I.; Chronopoulos, A.T.; Liang, H.-W. Application of explainable Artificial Intelligence in Medical Health: A Systematic Review of Interpretability Methods. Inf. Med. Unlock. 2023, 40, 101286. [Google Scholar] [CrossRef]
Qiu, L.; Yang, Y.; Cao, C.C.; Zheng, Y.; Ngai, H.; Hsiao, J.; Chen, L. Generating Perturbation-based Explanations with Robustness to Out-of-Distribution Data. In Proceedings of the ACM Web Conference 2022, New York, NY, USA, 25–29 April 2022. [Google Scholar] [CrossRef]
Alzheimer’s Disease Neuroimaging Initiative. ADNI Data and Samples. 2024. Available online: https://adni.loni.usc.edu/data-samples/adni-data/ (accessed on 1 January 2025).
Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
Bonab, H.; Can, F. Less Is More: A Comprehensive Framework for the Number of Components of Ensemble Classifiers. arXiv 2017, arXiv:1709.02925. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
Richardson, E.; Trevizani, R.; Greenbaum, J.A.; Carter, H.; Nielsen, M.; Peters, B. The Receiver Operating Characteristic Curve Accurately Assesses Imbalanced Datasets. Patterns 2024, 5, 100994. [Google Scholar] [CrossRef]
Adarsh, V.; Gangadharan, G.R.; Fiore, U.; Zanetti, P. Multimodal classification of Alzheimer’s disease and Mild Cognitive Impairment using Custom MKSCDDL Kernel over CNN with Transparent Decision-Making for Explainable Diagnosis. Sci. Rep. 2024, 14, 1774. [Google Scholar] [CrossRef]
Mahmud, T.; Barua, K.; Habiba, S.U.; Sharmen, N.; Hossain, M.S.; Andersson, K. An Explainable AI Paradigm for Alzheimer’s Diagnosis Using Deep Transfer Learning. Diagnostics 2024, 14, 345. [Google Scholar] [CrossRef]
Duamwan, L.M.; Bird, J.J. Explainable AI for Medical Image Processing: A Study on MRI in Alzheimer’s Disease. In Proceedings of the PETRA ’23: Proceedings of the 16th International Conference on Pervasive Technologies Related to Assistive Environments, Corfu, Greece, 5–7 July 2023; pp. 480–484. [Google Scholar] [CrossRef]
El-Sappagh, S.; Alonso, J.M.; Islam, S.M.R.; Sultan, A.M.; Kwak, K.S. A multilayer multimodal detection and prediction model based on explainable artificial intelligence for Alzheimer’s disease. Sci. Rep. 2021, 11, 2660. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Sample MRI images labeled to show structural brain differences.

Figure 2. Stacking ensemble architecture for multi-model classification.

Figure 3. Hard voting ensemble architecture for multi-model classification.

Figure 4. The 95% confidence intervals for performance metrics for ViT and each of the based models.

Figure 5. The 95% confidence intervals for performance metricsohlfor each of the ensemble architecture.

Figure 6. Confusion matrix for (a) hard voting and (b) stacking ensemble classifiers for Alzheimer’s disease classification.

Figure 7. Hard voting model accuracy comparing individual models.

Figure 8. Stacking model accuracy comparing individual models.

Figure 9. ROC curves for (a) hard voting and (b) stacking ensemble classifiers in Alzheimer’s disease classification.

Figure 10. Combined accuracy and AUC-ROC metrics for individual model in Alzheimer’s disease classification.

Figure 11. Models computational cost and performance analysis.

Figure 12. Model robustness to noise.

Figure 13. Ablation studies of the ViT and base models.

Figure 14. Grad-CAM visualization for (a) hard voting and (b) stacking ensemble classifiers.

Figure 15. LIME explanation for (a) hard voting and (b) stacking ensemble classifiers.

Figure 16. Saliency map for (a) hard voting and (b) stacking ensemble.

Table 1. Performance metrics of base models and ViT for Alzheimer’s disease Classification. Here, ViT, VGG, ResNet, and CNN stand for Vision Transformer, Visual Geometry Group, Residual Network, Convolutional Neural Network, respectively.

Models	Accuracy	Precision	Recall	F1-Score
ViT	98.0%	98.0%	97.0%	98.0%
VGG-19	83.0%	84.0%	83.0%	84.0%
ResNet	65.0%	78.0%	65.0%	66.0%
MobileNet	87.0%	87.0%	87.0%	87.0%
EfficientNet	41.0%	16.0%	41.0%	23.0%
DenseNet	88.0%	89.0%	87.0%	88.0%
CNN	94.0%	94.0%	94.0%	94.0%

Table 2. Performance comparison of ensemble techniques in Alzheimer’s classification.

Ensemble Techniques	Accuracy	Precision	Recall	F1-Score
Stacking	98.0%	98.0%	97.0%	97.0%
Hard Voting	97.0%	97.0%	97.0%	97.0%

Table 3. False positive and false negative rates by model.

Model	CN→MCI (FP)	MCI→AD (FP)	MCI→CN (FN)	AD→MCI (FN)
ViT3	0.17%	0.65%	0.17%	0.65%
CNN_model3	1.52%	1.59%	1.52%	1.59%
DenseNet3	5.25%	4.69%	5.25%	4.69%
MobileNet3	5.62%	4.18%	5.62%	4.18%
VGG19_3	7.84%	8.08%	7.84%	8.08%
ResNet3	6.92%	22.70%	6.92%	22.70%
EfficientNet3	50.00%	50.00%	50.00%	50.00%

Table 4. Summary of misclassification patterns for stacking calssifier.

Error Type	% of Errors	Error Category	Clinical Impact	Best Models	Best Accuracy	VIT Accuracy
2→0	44.44%	False Negative	High	ResNet	75.00%	0.00%
0→1	22.22%	False Positive	Medium	DenseNet, VGG19	100.00%	0.00%
1→2	11.11%	False Positive	Medium–High	Multiple	100.00%	0.00%
2→1	11.11%	False Negative	High	DenseNet	100.00%	0.00%
1→0	11.11%	False Negative	Medium–High	Multiple	100.00%	0.00%

Table 5. Model performance on error types. Here, ViT, VGG, ResNet, and CNN stand for Vision Transformer, Visual Geometry Group, Residual Network, Convolutional Neural Network, respectively.

Model	2→0 (4)	0→1 (2)	1→2 (1)	2→1 (1)	1→0 (1)	Overall
DenseNet	0.00%	100.00%	100.00%	100.00%	0.00%	44.44%
ResNet	75.00%	50.00%	0.00%	0.00%	0.00%	44.44%
VGG19	25.00%	100.00%	0.00%	0.00%	100.00%	44.44%
CNN	25.00%	50.00%	0.00%	0.00%	100.00%	33.33%
EfficientNet	0.00%	0.00%	100.00%	0.00%	100.00%	22.22%
MobileNet	0.00%	50.00%	100.00%	0.00%	0.00%	22.22%
ViT	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%

Table 6. Error instance analysis for stacking architecture.

Error Type	Instance Index	Correct Models	# Correct	% Agreement
2→0	604	CNN, ResNet	2/7	28.57%
2→0	1224	ResNet	1/7	14.29%
2→0	2045	VGG19	1/7	14.29%
2→0	2116	ResNet	1/7	14.29%
0→1	1353	CNN, DenseNet, VGG19	3/7	42.86%
0→1	1550	Multiple	4/7	57.14%
1→2	487	Multiple	3/7	42.86%
2→1	1326	DenseNet	1/7	14.29%
1→0	1798	Multiple	3/7	42.86%

Table 7. Critical error analysis (2→0 error type). Here, ViT, VGG, ResNet, and CNN stand for Vision Transformer, Visual Geometry Group, Residual Network, Convolutional Neural Network, respectively.

Model	Correct 2→0	% Correct	False Negative Rate
ResNet	3/4	75.00%	25.00%
CNN	1/4	25.00%	75.00%
VGG19	1/4	25.00%	75.00%
DenseNet	0/4	0.00%	100.00%
EfficientNet	0/4	0.00%	100.00%
MobileNet	0/4	0.00%	100.00%
ViT	0/4	0.00%	100.00%

Table 8. MCI classification performance across architectures. Here, ViT, VGG, ResNet, and CNN stand for Vision Transformer, Visual Geometry Group, Residual Network, Convolutional Neural Network, respectively.

Architecture	MCI Accuracy	Misclassified as CN	Misclassified as AD	Mean Confidence	Mean Uncertainty
ViT	99.78%	0.11%	0.11%	0.998	0.004
EfficientNet	100.00%	0.00%	0.00%	0.399	1.533
CNN	98.46%	1.43%	0.11%	0.979	0.054
DenseNet	94.07%	4.84%	1.10%	0.890	0.350
MobileNet	93.19%	6.26%	0.55%	0.877	0.369
VGG19	89.78%	5.16%	5.05%	0.806	0.561
ResNet	50.66%	7.25%	42.09%	0.522	0.713

Table 9. Early biomarker detection at the cn to mci boundary. Here, ViT, VGG, ResNet, and CNN stand for Vision Transformer, Visual Geometry Group, Residual Network, Convolutional Neural Network, respectively.

Architecture	Overlap Coefficient	Confusion Rate	Confidence Gap
EfficientNet	0.9500	0.5000	0.0000
ViT	0.9423	0.0017	0.0017
CNN	0.9270	0.0152	0.0032
MobileNet	0.8980	0.0562	0.0062
DenseNet	0.8956	0.0525	0.0122
VGG19	0.8668	0.0784	0.0097
ResNet	0.8055	0.0692	0.0143

Table 10. Model interpretability metrics. Here, ViT, VGG, ResNet, and CNN stand for Vision Transformer, Visual Geometry Group, Residual Network, Convolutional Neural Network, respectively.

Architecture	Confidence Mean	Confidence Std	Mean Entropy	Max Entropy	Difficult Cases
ViT	0.998	0.035	0.004	1.024	91
CNN	0.979	0.108	0.054	1.545	91
DenseNet	0.890	0.191	0.350	1.582	91
MobileNet	0.877	0.211	0.369	1.578	91
VGG19	0.806	0.241	0.561	1.580	91
ResNet	0.522	0.365	0.713	1.585	91
EfficientNet	0.399	0.001	1.533	1.538	91

Table 11. MCI to AD boundary metrics. Here, ViT, VGG, ResNet, and CNN stand for Vision Transformer, Visual Geometry Group, Residual Network, Convolutional Neural Network, respectively.

Architecture	Overlap Coefficient	Confusion Rate	Confidence Gap
ViT	0.9286	0.0065	0.0038
CNN	0.8958	0.0159	0.0087
VGG19	0.8222	0.0808	0.0616
DenseNet	0.8191	0.0469	0.0128
MobileNet	0.8323	0.0418	0.0281
ResNet	0.5673	0.2270	0.0312
EfficientNet	0.0000	0.5000	0.0000

Table 12. Model interpretability and clinical relevance. Here, ViT, VGG, ResNet, and CNN stand for Vision Transformer, Visual Geometry Group, Residual Network, Convolutional Neural Network, respectively.

Architecture	Interpretability Score	Visualization Approach	Biomarker Alignment	Clinical Explainability	Region Specificity
CNN	High	Gradient-based	High	High	High
ResNet	High	Gradient-based	High	High	High
VGG19	High	Gradient-based	High	High	Medium
DenseNet	High	Gradient-based	High	High	Medium
EfficientNet	Medium–High	Efficient feature	High	Medium	Medium
MobileNet	Medium–High	Efficient feature	High	Medium	Medium
ViT	Medium	Attention-based	Medium	Medium	Medium

Table 13. Computational complexity analysis of the neural network models.

Model	Size (MB)	Parameters (M)	Layers	Avg. Inference Time (ms)	Memory Usage (MB)	Trainable/Non-Trainable
CNN	27.62	7.24	20	0.43	1027.84	7.24 M/0
DenseNet	79.93	20.95	718	2.92	222.47	2.63 M/18.33 M
EfficientNet	22.99	6.03	249	1.55	40.95	3.32 M/2.70 M
MobileNet	16.12	4.23	162	0.97	766.00	1.97 M/2.26 M
ResNet	100.49	26.34	183	1.64	83.51	11.69 M/14.66 M
VGG19	127.04	33.30	39	2.18	36.14	13.26 M/20.04 M
ViT	332.34	87.12	27	5.15	71.95	87.12 M/0.003 M

Table 14. Performance Comparison of Ensemble Techniques in Alzheimer’s Classification and Related Methods.

Ref	Classifier	Best Accuracy Score	XAI Method	Dataset
Mahmud et al. [32]	DenseNet169 and DenseNet201	96.00%	Saliency Maps, Grad-CAM	MRI Scans OASIS
Duamwan et al. [33]	CNN	94.96%	LIME	ADNI MRI
El-Sappag et al. [34]	Random Forest	93.95%	SHAP	ADNI Multimodal with 12 Features
Adarsh et al. [31]	CNN + MKSCDDL + Scandent Decision Trees	98.27%	Scandent Decision Trees, Discriminative Dictionary Learning	ADNI MRI
This Study	Stacking Ensemble	98.00%	Grad-CAM, LIME, Saliency Map	ADNI MRI

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Adeniran, O.T.; Ojeme, B.; Ajibola, T.E.; Peter, O.O.E.; Ajala, A.O.; Rahman, M.M.; Khalifa, F. Explainable MRI-Based Ensemble Learnable Architecture for Alzheimer’s Disease Detection. Algorithms 2025, 18, 163. https://doi.org/10.3390/a18030163

AMA Style

Adeniran OT, Ojeme B, Ajibola TE, Peter OOE, Ajala AO, Rahman MM, Khalifa F. Explainable MRI-Based Ensemble Learnable Architecture for Alzheimer’s Disease Detection. Algorithms. 2025; 18(3):163. https://doi.org/10.3390/a18030163

Chicago/Turabian Style

Adeniran, Opeyemi Taiwo, Blessing Ojeme, Temitope Ezekiel Ajibola, Ojonugwa Oluwafemi Ejiga Peter, Abiola Olayinka Ajala, Md Mahmudur Rahman, and Fahmi Khalifa. 2025. "Explainable MRI-Based Ensemble Learnable Architecture for Alzheimer’s Disease Detection" Algorithms 18, no. 3: 163. https://doi.org/10.3390/a18030163

APA Style

Adeniran, O. T., Ojeme, B., Ajibola, T. E., Peter, O. O. E., Ajala, A. O., Rahman, M. M., & Khalifa, F. (2025). Explainable MRI-Based Ensemble Learnable Architecture for Alzheimer’s Disease Detection. Algorithms, 18(3), 163. https://doi.org/10.3390/a18030163

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explainable MRI-Based Ensemble Learnable Architecture for Alzheimer’s Disease Detection

Abstract

1. Introduction

2. Related Work

2.1. Perturbation-Based Methods for Explaining Medical Disease Diagnostic and Predictive Models

2.2. Gradient-Based Methods for Explaining Medical Disease Diagnostic and Predictive Models

3. Methodology

3.1. Model Development

3.2. Ensemble Classifier

3.3. Model Interpretation

3.4. Model Evaluation

4. Results and Discussion

4.1. Performance Analysis Using the Confusion Matrix

4.1.1. Analysis of Misclassifications and Hard Voting Performance

4.1.2. Analysis of Misclassifications and Stacking Performance

4.2. Performance Evaluation Using ROC and AUC Metrics

4.3. Comparative Performance Analysis of CNN and Transformer Architectures for Alzheimer’s Disease Progression Detection

4.4. Computational Efficiency Analysis

4.4.1. Ensemble Framework Robustness

4.4.2. Ablation Study Insights

4.5. Model Trade-Offs

4.6. Comparative Analysis of Ensemble Explainability Techniques

4.6.1. LIME Results Comparison

4.6.2. Saliency Map Results Comparison

4.7. Comparative Evaluation

5. Conclusions and Future Work

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI