Next Article in Journal
Mathematical Modeling Using Gaussian Functions and Chaotic Attractors: A Hybrid Approach for Realistic Representation of the Intrinsic Dynamics of Heartbeats
Next Article in Special Issue
Online Hyperparameter Tuning in Bayesian Optimization for Material Parameter Identification: An Application in Strain-Hardening Plasticity for Automotive Structural Steel
Previous Article in Journal
Stochastic Models of Neuronal Growth
Previous Article in Special Issue
A Novel Approach for Modeling Strain Hardening in Plasticity and Its Material Parameter Identification by Bayesian Optimization for Automotive Structural Steels Application
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Learning Approaches with Explainable AI for Differentiating Alzheimer’s Disease and Mild Cognitive Impairment

1
School of Mathematical and Natural Sciences, Arizona State University, Phoenix, AZ 85051, USA
2
Department of Mathematics, Florida Gulf Coast University, Fort Myers, FL 33928, USA
3
Department of Statistics, University of South Carolina, Columbia, SC 29225, USA
4
Department of Public Health, Julia Jones Matthews School of Population and Public Health, Texas Tech University Health Sciences Center, Lubbock, TX 79409, USA
*
Author to whom correspondence should be addressed.
AppliedMath 2025, 5(4), 171; https://doi.org/10.3390/appliedmath5040171
Submission received: 9 October 2025 / Revised: 18 November 2025 / Accepted: 24 November 2025 / Published: 4 December 2025
(This article belongs to the Special Issue Optimization and Machine Learning)

Abstract

Early and accurate diagnosis of Alzheimer’s disease is critical for effective clinical intervention, particularly in distinguishing it from mild cognitive impairment, a prodromal stage marked by subtle structural changes. In this study, we propose a hybrid deep learning ensemble framework for Alzheimer’s disease classification using structural magnetic resonance imaging. Gray and white matter slices are used as inputs to three pretrained convolutional neural networks: ResNet50, NASNet, and MobileNet, each fine-tuned through an end-to-end process. To further enhance performance, we incorporate a stacked ensemble learning strategy with a meta-learner and weighted averaging to optimally combine the base models. Evaluated on the Alzheimer’s Disease Neuroimaging Initiative dataset, the proposed method achieves state-of-the-art accuracy of 99.21% for Alzheimer’s disease vs. mild cognitive impairment and 91.02% for mild cognitive impairment vs. normal controls, outperforming conventional transfer learning and baseline ensemble methods. To improve interpretability in image-based diagnostics, we integrate Explainable AI techniques by Gradient-weighted Class Activation Mapping, which generates heatmaps and attribution maps that highlight critical regions in gray and white matter slices, revealing structural biomarkers that influence model decisions. These results highlight the framework’s potential for robust and scalable clinical decision support in neurodegenerative disease diagnostics.

1. Introduction

AD is a progressive neurodegenerative disorder and the most common cause of dementia worldwide, affecting millions of elderly individuals and placing immense socio-economic burdens on healthcare systems globally [1,2,3]. Alzheimer’s disease is a degenerative neurological condition that gradually impairs memory, reasoning, and the ability to carry out everyday activities [4]. It represents the leading cause of dementia, distinct from normal aging, and is linked to the accumulation of abnormal amyloid plaques and tau tangles in the brain, which harm and eventually destroy nerve cells. Common signs include forgetfulness, disorientation, difficulty with problem-solving or planning, and shifts in mood or personality, all of which intensify as the illness progresses [5]. Although there is currently no cure, available treatments may ease symptoms, and certain lifestyle adjustments can provide additional support. With the aging global population, early detection of AD has become a crucial area of research to enable timely interventions and slow disease progression [6]. Despite significant advances in biomarker discovery, the clinical diagnosis of AD remains challenging due to its complex pathology and overlapping symptoms with other cognitive disorders [7,8,9].
Descriptive epidemiological data on AD indicate that both prevalence and incidence rise steadily with advancing age, reaching their peak among the elderly, with women affected more often than men [10]. From a demographic perspective, AD represents a growing public health concern as populations continue to age, contributing heavily to illness and death rates [11]. Although regional variations are observed, the disorder affects communities worldwide, creating a considerable strain on caregivers as well as healthcare infrastructure [12].
Traditionally, AD diagnosis has relied on cognitive tests such as the MMSE and CDR, as well as invasive tests like CSF biomarker assays and PET scans [13,14,15]. While accurate, these procedures are often expensive, invasive, and unavailable in primary care settings. As a result, MRI-based analysis has gained traction due to its non-invasive nature and ability to reveal structural changes in the brain, such as hippocampal atrophy, that are indicative of AD [16,17,18].
Alzheimer’s disease can be identified through a combination of clinical, cognitive, and biological assessments [19,20]. Physicians typically begin with a detailed medical and family history alongside evaluations of symptom progression, focusing on memory decline and difficulties in daily functioning. Cognitive screening tools, such as the MMSE or the MoCA [21], are frequently employed to measure impairments in memory, language, and reasoning. Neurological and physical examinations help exclude other potential causes of cognitive decline, while neuroimaging techniques like MRI or CT scans can reveal brain atrophy and rule out alternative conditions [22]. Advanced imaging, such as PET scans, may further detect amyloid plaques or tau tangles characteristic of AD [23,24]. In addition, biomarker analyses of cerebrospinal fluid or blood provide evidence of abnormal amyloid and tau protein levels, and in rare hereditary cases, genetic testing may be used to confirm early-onset disease. Together, these methods contribute to a comprehensive and reliable diagnosis of AD.
Over the past decade, DL has emerged as a transformative approach for medical image analysis. Convolutional neural networks, in particular, have demonstrated remarkable success in detecting patterns in MRI scans for AD classification [25,26,27]. Compared to traditional ML methods, DL techniques eliminate the need for handcrafted features, instead learning hierarchical representations directly from the raw imaging data [28,29]. These models can capture subtle structural changes associated with different stages of AD, including NC, MCI, and AD itself [30,31,32].
Despite these promising developments, several challenges persist. Many CNN-based studies have limited generalizability due to small sample sizes or single-source datasets. Furthermore, standalone models often suffer from reduced accuracy when applied to real-world clinical settings with diverse imaging protocols and patient demographics [17,33]. To address these limitations, ensemble methods that combine multiple models have been explored, leveraging model diversity to improve robustness and performance [25,33].
This study aims to address these gaps by introducing a new framework as follows:
  • We propose a novel ensemble diagnostic pipeline for Alzheimer’s disease (AD) classification that integrates transfer learning, weighted averaging, and stacked generalization into a unified framework.
  • Our method leverages the complementary strengths of pretrained architectures, including ResNet50, NASNet, and MobileNet [34,35,36], to enhance feature representation and robustness.
  • A meta-learner is employed to fuse predictions from multiple base models, leading to improved accuracy and generalization across diverse patient populations.
  • We rigorously evaluate the proposed method on the ADNI dataset, ensuring clinical relevance and comparability with established diagnostic benchmarks.
  • Our approach achieves superior performance compared with existing baselines, particularly in distinguishing between early AD and MCI, a challenging diagnostic boundary in clinical practice.
  • We incorporate interpretability through Grad-CAM overlays (Figure 1), which demonstrate that the model consistently attends to clinically relevant neuroanatomical regions associated with AD progression.
  • By integrating multiple architectures with decision fusion strategies, the proposed pipeline provides a robust and scalable diagnostic tool with strong potential for real-world clinical deployment.
The rest of this paper is organized as follows. Section 2 reviews related works in deep learning for Alzheimer’s disease detection. Section 3 presents the data source and preprocessing steps, along with the proposed ensemble methodology. Section 4 discusses experimental results and comparative evaluations with XAI for the prediction using Grad-CAM. Section 5 concludes this paper with future directions.

2. Related Works

Several studies and recent research have proposed hybrid and ensemble DL approaches to enhance classification accuracy and generalizability in AD and MCI. For instance, Mmadumbu et al. [33] developed a hybrid system that integrates ResNet50 and MobileNetV2 for MRI-based classification, achieving an accuracy of over 96%. Other approaches have combined CNNs with long short-term memory (LSTM) networks for multimodal analysis, incorporating both imaging and clinical data [37]. The introduction of stacked generalization, where outputs of base learners are fed into a meta-classifier, further refines decision boundaries and has demonstrated improved sensitivity in distinguishing between MCI and AD [25]. Bossa and Sahli [31] proposed a differential equation-based disease progression model that simulates individual biomarker trajectories and clinical outcomes. This system accounts for the heterogeneity observed in AD progression and shows promise for predicting MCI-to-AD conversion. Similarly, Cheung et al. [32] explored retinal imaging as an alternative to MRI, applying DL to retinal photographs and achieving classification accuracies exceeding 90%. These novel modalities may complement neuroimaging and offer less invasive alternatives for community-based screening [38,39]. In Junior et al.’s [40] study, ADNI dataset MRI scans were categorized into three groups: AD, MCI, and normal cognition. Their model achieved 85% accuracy, demonstrating the potential of XAI-based DL for transparent, clinically relevant AD diagnosis, where Local LIME and Grad-CAM were used to highlight brain regions important to predictions, especially changes near the hippocampus in MCI. Anzum et al. [41] proposed a method that combines RNA text data with brain MRI images to improve diagnostic precision, with a focus on improving AD detection by combining transformer models and advanced computer vision algorithms. This is in contrast to traditional imaging approaches that use MRI, CT, or PET, which achieve accuracies of 80–90%. The "black-box" nature of AI models impedes clinical application, according to a thorough review of AD detection research employing XAI by Viswan et al. [42]. Using conceptual kinds (post hoc versus ante hoc, model-agnostic versus model-specific, local versus global), the paper analyzes well-known frameworks, including LRP, Grad-CAM, SHAP, and LIME. The study’s conclusions address the limitations, difficulties, and prospects for improving XAI in reliable AD diagnosis. Alotaibi et al. [43] described the Enhancing Automated Detection and Classification of Dementia in Thinking-Incapable People Using Artificial Intelligence Techniques (EADCD-TIPAIT) for the early detection of dementia. To extract valuable biomarkers, the method utilizes MRI data for preprocessing, z-score normalization, and feature selection, employing a BGGO algorithm. Then, using an ISSA to modify hyperparameters, the WNN classifier is utilized to detect and categorize dementia. The suggested EADCD-TIPAIT obtained 95% accuracy on a dementia prediction dataset, indicating its potential for reliable and efficient diagnosis. The study by Vlontzou et al. [44] provides an interpretable ML framework for enhancing the diagnosis of AD and MCI using volumetric MRI and genetic data. Both attribution-based and counterfactual-based interpretability approaches were employed to evaluate the strength of explanations, utilizing a combined strategy that incorporates SHAP and counterfactuals. The top model achieved an F1 score of 90.8% and a balanced accuracy of 87.5%. Important volumetric and genetic characteristics were identified as key risk factors, underscoring the framework’s potential for clear and clinically appropriate MCI/AD identification. In Fathi et al.’s [45] article, a lightweight convolutional neural network, FiboNeXt, was developed to identify AD from MRI images. Based on the ConvNeXt architecture, the model reduces trainable parameters and increases efficiency by incorporating concatenation layers, attention mechanisms, and a design inspired by the Fibonacci sequence. Training and evaluation were carried out on two publicly available MRI datasets: the original and enlarged versions. FiboNeXt achieved test accuracies of 99.66% and 99.63%, and validation accuracies of 95.40% and 95.93%. The results establish FiboNeXt as a competitive solution for computer vision tasks in medical imaging, demonstrating its strong performance and potential uses beyond AD diagnosis. Recent advances in medical imaging analysis have significantly improved the early detection of pancreatic tumors through deep learning and transfer learning. Liu et al. (2023) demonstrated that large-scale non-contrast CT screening combined with a convolutional neural network could detect pancreatic cancer with clinical-grade accuracy [46]. Qiu et al. (2024) developed a cascaded segmentation framework to improve boundary delineation in CT-based tumor analysis [47]. Ozawa et al. (2025) further integrated detection and indirect imaging indicators to achieve robust identification of small pancreatic ductal adenocarcinomas [48]. Complementarily, Alaca (2025) converted CT images into graph structures and applied Whale Optimization with transfer learning to classify pancreatic tumors [49]. Extending this work, Alaca (2025) employed DARTS-optimized MobileViT networks for enhanced diagnostic accuracy using graph-based deep representations [50]. Together, these studies underscore the rapid evolution of hybrid and optimization-based deep learning pipelines for pancreatic cancer diagnosis. While many models perform well on curated test sets, their reliability in cross-site or cross-population applications is limited. Future research must address these limitations through domain adaptation, federated learning, and interpretable AI frameworks [51,52,53].

3. Methods and Materials

In this section, we present a robust ensemble learning methodology for the early diagnosis of AD using structural MRI. Our framework is based on the integration of multiple deep ConvNets with two key ensemble strategies: weighted averaging and stacked generalization. This methodology enhances prediction reliability by combining the outputs of diverse models through a meta-learner, thereby surpassing the limitations of traditional majority-voting techniques.

3.1. Data Source and Preprocessing

The ADNI dataset was selected for this study due to its high quality and suitability for AD research. Collected from imaging centers worldwide and preprocessed by ADNI-funded MRI laboratories, the dataset ensures standardized and reliable neuroimaging inputs [54]. To further enhance consistency, all images were uniformly scaled to 224 × 224 pixels. Additionally, we employed a tailored data augmentation strategy, including cropping, flipping, scaling, and brightness/contrast adjustments. These augmentation steps increased dataset diversity and reduced overfitting risks, enabling more robust training of deep learning models while preserving the semantic integrity of neuroanatomical structures. A simple pathway is given in Figure 2.
The experimental data were refined from the ADNI dataset and supplemented with the publicly available Augmented Alzheimer’s MRI dataset from Kaggle [54]. The combined dataset provides structural MRI scans categorized into AD, MCI, and NC groups, enabling robust deep ensemble training and evaluation. We used a refined subset of the ADNI dataset along with the publicly available Augmented Alzheimer’s MRI dataset from Kaggle [55]. The dataset was divided into three subsets following a 60–20–20 split to ensure balanced representation and robust model evaluation. The train/set (60%, 20,400 images) was used for model training and included both original and augmented MRI slices to improve generalization and reduce the risk of overfitting. The val/set (20%, 6800 images) was employed for hyperparameter tuning and early stopping to optimize model performance. The test/set (20%, 6800 images) consisted primarily of original, non-augmented MRI slices and was reserved for unbiased final evaluation. This stratified split was designed to maintain proportional class distribution across subjects, ensuring the reliability and reproducibility of the ensemble deep learning framework. In total, the dataset consisted of about 34,000 structural MRI images representing 32 horizontal brain slices across four diagnostic categories: Mild Demented (28 subjects), Moderate Demented (2 subjects), Non-Demented (100 subjects), and Very Mild Demented (70 subjects). For analysis, the Non-Demented and Very Mild Demented classes were combined and considered as the normal control (NC) group. This structure allowed the deep ensemble learning models to be trained and tested on data that balanced variety from augmentation with authenticity from original MRI scans, supporting reliable Alzheimer’s disease classification.

3.2. Deep Learning Framework of Hybrid Ensemble

Let D = { ( x i , y i ) } i = 1 N denote a dataset of N structural MRI samples, where x i is an input sample (a set of slices representing gray and white matter) and y i { 0 , 1 } represents the ground-truth label for binary classification (e.g., AD vs. MCI). Our proposed architecture employs K base deep neural network classifiers { f k ( · ; θ k ) } k = 1 K , each trained on the same dataset but with possibly different architectures or training initializations.

3.2.1. Base Learners: Transfer and Fine-Tuning

Each base model f k is a deep ConvNet initialized using pretrained weights from ImageNet. We adopt three well-established architectures: ResNet50, NASNet, and MobileNet, denoted as f 1 , f 2 , and f 3 , respectively, from recent developments (see [34,35,36]). To adapt these models for medical imaging, we apply transfer learning followed by fine-tuning in two stages.
Transfer learning has become an essential paradigm in medical imaging applications, where labeled data is typically scarce. In this study, we leverage transfer learning through two critical phases: feature freezing and fine-tuning. Each plays a vital role in adapting pretrained ConvNets, originally trained on large-scale natural image datasets such as ImageNet [56], to the domain of structural MRI scans for AD and MCI classification.
  • Feature Freezing: Feature freezing refers to the strategy of keeping the weights of the convolutional base layers fixed and training only the newly added fully connected (dense) layers. This technique is grounded in the idea that early convolutional layers in deep neural networks capture generic low-level features such as edges, textures, and patterns, which are broadly transferable across visual domains—even between natural images and medical images. By freezing these layers, we preserve the robust visual representations learned from large datasets while reducing the computational burden and preventing overfitting—particularly valuable in small datasets like ADNI. In our experiments, after importing pretrained models like ResNet50, NASNet, and MobileNet, we remove the original classification head and replace it with custom dense layers tailored for binary classification (AD vs. MCI). These new layers are randomly initialized and trained on MRI data while keeping the backbone unchanged. The goal during this stage is to quickly adapt the model to the new classification task by optimizing only a small subset of parameters. This approach yields surprisingly strong baseline performance and serves as a low-cost initialization for deeper adaptation.
  • Fine-Tuning: While feature freezing captures general features, it does not fully exploit the domain-specific structure inherent in MRI images. Therefore, in the fine-tuning stage, we unfreeze a selected portion of the top layers of the convolutional base and retrain the network end to end using a small learning rate. This strategy allows the network to adapt higher-level features, such as spatial patterns in gray and white matter, which may be uniquely informative for distinguishing between AD and MCI. Fine-tuning is especially beneficial when there is a domain shift between the source dataset (e.g., ImageNet) and the target domain (e.g., neuroimaging). By updating the weights of the later layers, the model learns hierarchical features that are more semantically aligned with the target task. However, this stage must be executed carefully. A learning rate that is too high can disrupt previously learned general features, while too low a rate may not provide meaningful adaptation. We use a small learning rate (e.g., 2 × 10 5 ) with adaptive optimization (Adam) and apply dropout to reduce overfitting. Together, feature freezing and fine-tuning form a powerful two-step training strategy that balances generalization and specificity. This approach enables the reuse of high-quality pretrained models while allowing deep adaptation to the specific characteristics of MRI data for Alzheimer’s diagnosis. Each base model f k outputs a probability p k ( x i ) [ 0 , 1 ] for the input x i , interpreted as the predicted confidence of the sample belonging to the AD class.

3.2.2. Weighted Averaging Ensemble

At inference time, the goal is to predict the class label y ^ { 0 , 1 } for a previously unseen sample x . To do this, the input x is first passed through all K trained base models f 1 , f 2 , , f K . Each model outputs a probability indicating the likelihood that x belongs to the positive class (e.g., AD). Two ensemble strategies can be employed to compute the final prediction from a weighted averaging ensemble. In this approach, we compute a convex combination of the base model predictions using the learned weight vector α R K , where α 1 = 1 and α k 0 for all k. The ensemble probability is given by
p ^ = α p = k = 1 K α k f k ( x ) .
The final hard prediction is then obtained via thresholding:
y ^ = I ( p ^ > τ ) ,
where I ( · ) denotes the indicator function and τ is a decision threshold, typically set to 0.5 for balanced binary classification. This method assumes that the optimal prediction lies along a weighted average of the base outputs and is best suited for base models that are diverse but linearly complementary.
To combine the predictive outputs of K independently trained base classifiers, we implement a weighted averaging ensemble strategy. Each base classifier f k produces a scalar probability p k ( x i ) [ 0 , 1 ] indicating the predicted likelihood that sample x i belongs to the positive class (e.g., AD). Let the vector of predictions for input x i be denoted as
p i = p 1 ( x i ) p 2 ( x i ) p K ( x i ) [ 0 , 1 ] K .
To form the ensemble prediction, we assign each base model a non-negative weight α k such that the weights sum to 1. The weight vector is defined as
α = α 1 α 2 α K , where α k 0 and k = 1 K α k = 1 .
The ensemble prediction p ^ i for the input x i is then given by the weighted sum:
p ^ i = α p i = k = 1 K α k · p k ( x i ) ,
where p ^ i [ 0 , 1 ] is the aggregated probability of the positive class. To learn the optimal weights α , we minimize the empirical risk over a validation set D t e x t v a l = { ( p i , y i ) } i = 1 N t e x t v a l using the binary cross-entropy loss function:
L ( p ^ i , y i ) = y i log ( p ^ i ) + ( 1 y i ) log ( 1 p ^ i ) .
Thus, the optimization problem is
min α Δ K 1 i = 1 N val L ( α p i , y i ) ,
where Δ K 1 denotes the ( K 1 ) -dimensional probability simplex
Δ K 1 = α R K : α k 0 , k = 1 K α k = 1 .
This convex optimization ensures that the ensemble leverages the relative strength of each base classifier in a principled, data-driven manner. The learned weights α reflect the contribution of each model to minimizing predictive error on the validation set.

3.2.3. Stacked Generalization (Stacking)

While weighted averaging assumes linear importance, stacking introduces a meta-learner g ( · ; ϕ ) trained on the predictions of base models to learn a more flexible combination rule. For each sample x i , we construct a meta-feature vector p i , as before. The meta-learner is then trained to predict the true label:
y ^ i = g ( p i ; ϕ ) ,
where g can be any differentiable function, such as logistic regression [57], a shallow neural network [58], or a gradient boosting machine [59]. The training objective is again to minimize the cross-entropy loss:
min ϕ i = 1 N meta L ( g ( p i ; ϕ ) , y i ) .
To prevent information leakage and overfitting, we adopt k-fold cross-validation to generate out-of-fold predictions for training the meta-learner.

3.3. Network Architecture

The architecture of the proposed ensemble learning framework is centered around three pretrained deep ConvNets: ResNet50, NASNet, and MobileNet. These models are widely recognized for their strong performance on large-scale image classification tasks such as ImageNet and serve as the backbone for extracting hierarchical features from structural MRI data. To adapt these models for the task of AD diagnosis using MRI, we employ a two-phase transfer learning strategy comprising feature freezing and fine-tuning, followed by ensemble integration through weighted averaging and stacking.
Each ConvNet is modified by removing its original classification head and appending a new custom classifier suited for binary classification (e.g., AD vs. MCI). The new head consists of fully connected layers with dropout regularization to prevent overfitting. During the initial training phase, known as feature freezing, the convolutional base of each model is held constant, and only the newly added dense layers are trained. This step allows the network to adapt its decision function to the domain-specific features of MRI data without altering the generic low-level feature representations already captured in the base. In the second phase of fine-tuning, we selectively unfreeze the upper convolutional layers of each network and retrain the model using a low learning rate. This enables the network to refine its mid- and high-level feature detectors based on structural patterns in gray and white matter regions, which are critical for distinguishing between stages of cognitive decline. This two-step training process balances the need for transferability and domain specificity, resulting in more accurate and generalizable models.
Once trained, the three ConvNets operate in parallel. Given an input sample x , each network f k produces a scalar output p k ( x ) [ 0 , 1 ] , representing the estimated probability that x belongs to the positive class. These outputs are aggregated into a prediction vector p = [ f 1 ( x ) , f 2 ( x ) , , f K ( x ) ] . The ensemble integration is performed in two stages. First, a weighted averaging scheme combines the base outputs using a learned vector α R K such that α k 0 and k = 1 K α k = 1 . This yields a soft prediction p ^ = α p , which is then thresholded to obtain the final label. Second, to capture non-linear interactions among base model outputs, a meta-learner g ( · ; ϕ ) is trained on out-of-fold predictions via stacked generalization. This meta-learner takes p as input and outputs g ( p ; ϕ ) [ 0 , 1 ] , which is again thresholded to yield a binary prediction. This architecture allows for flexible, accurate, and interpretable predictions. By leveraging the diversity of the base ConvNets and the expressiveness of stacking, the model achieves strong performance in distinguishing between AD, MCI, and NC. Furthermore, its modular design permits future extension to include additional input modalities or classifiers, thereby maintaining adaptability as diagnostic technologies evolve.

3.4. Gradient-Weighted Class Activation Mapping

To improve both classification accuracy and interpretability in AD diagnosis from MRI data, we incorporate two complementary methods: an advanced ensemble learning strategy and a visualization technique using Grad-CAM [60,61]. First, we propose a new ensemble formulation that replaces the simple majority-voting approach used in the original work. Specifically, let K denote the number of base deep neural networks, each yielding a class probability prediction f k ( x ) for input x . These outputs are aggregated using a weighted combination p ^ = k = 1 K α k f k ( x ) , where the weights α k satisfy k = 1 K α k = 1 and α k 0 . Additionally, to capture more complex relationships between base model outputs, we introduce a meta-learner g ( · ) trained on the vector of base predictions p = [ f 1 ( x ) , , f K ( x ) ] , such that the final prediction is y ^ = g ( p ) . This stacked generalization approach allows the ensemble to learn optimal decision boundaries. For interpretability, we employ Grad-CAM to visualize discriminative regions in MRI slices used by the ConvNet to make predictions. Let A k denote the k-th feature map in the last convolutional layer and y c the class score for class c. The importance of each feature map is calculated as α k c = 1 Z i = 1 H j = 1 W y c A i , j k , and the class activation map is then obtained by L Grad - CAM c = ReLU k α k c A k . This heatmap is upsampled and overlaid on the original MRI slice to highlight class-discriminative regions. In the context of AD diagnosis, we aim to highlight neuroanatomical areas such as the hippocampus or temporal lobe, offering both clinical insight and model transparency through Grad-CAM.

3.5. Implementation

The models were implemented using Python 3.14 and Keras with a TensorFlow backend. Training was conducted on an M4 Pro with a 14-core CPU and a 20-core GPU. The learning rate was initialized at 2 × 10 5 with the Adam optimizer, and a batch size of 24 was used. Dropout regularization with a rate of 0.5 was applied to the fully connected layers to reduce overfitting. MRI slices were preprocessed using SPM for motion correction and segmented into gray matter (GM) and white matter (WM). From each volume, 20 representative slices were selected and resized to 224 × 224 for ConvNet input. The ADNI dataset was split into 60% training, 20% validation, and 20% testing sets.
The proposed ensemble framework offers several compelling advantages that make it highly suitable for the complex task of AD classification. First, robustness is achieved through the integration of multiple heterogeneous ConvNets, such as ResNet50, NASNet, and MobileNet. This architectural diversity reduces the likelihood of correlated errors and mitigates model-specific bias and variance, leading to more stable and generalizable predictions. Second, the framework exhibits strong adaptivity by employing stacked generalization. Rather than assuming a fixed linear weighting of base learners, stacking introduces a trainable meta-learner capable of capturing non-linear interactions and context-sensitive dependencies between model outputs. This allows the ensemble to dynamically learn which base predictions to trust under varying conditions. Third, the approach is highly scalable. New models, imaging modalities (such as PET or DTI), or clinical biomarkers can be seamlessly incorporated into the ensemble pipeline without requiring a complete redesign of the learning architecture. This plug-and-play extensibility ensures that the framework can evolve alongside advancements in medical imaging and domain knowledge. Furthermore, by using cross-validation to train the meta-learner on out-of-fold predictions, the framework maintains rigorous generalization performance and reduces overfitting, making it both a theoretically principled and practically effective tool for neuroimaging-based disease diagnosis.
The baseline ensemble (bEnsemble) represents the initial model configuration in which the pretrained convolutional neural networks (ResNet50, NASNet, and MobileNet) were combined through weighted averaging without end-to-end fine-tuning. In contrast, the end-to-end ensemble (eEnsemble) introduced end-to-end fine-tuning of each network, allowing domain-specific adaptation to MRI-based neuroanatomical features. Finally, the proposed hybrid ensemble was compared with other models by integrating weighted averaging and stacked generalization through a meta-learner.

3.6. Gradient-Weighted Class Activation Mapping (Grad-CAM)

Grad-CAM (Gradient-weighted Class Activation Mapping) is an XAI technique that provides visual insight into the decision-making process of deep CNNs by highlighting the image regions most influential to the model’s prediction. Mathematically, Grad-CAM computes the gradient of the target class score y c (before the softmax or sigmoid activation) with respect to the feature maps A k of a selected convolutional layer. The importance weight for each feature map is obtained as the global average of these gradients, given by
α k c = 1 Z i j y c A i j k ,
where Z denotes the total number of pixels in the feature map. The Grad-CAM heatmap is then computed as a weighted combination of the feature maps, followed by a ReLU activation to retain only the positive contributions:
L Grad - CAM c = ReLU k α k c A k .
The resulting heatmap L Grad - CAM c is upsampled and overlaid on the original MRI slice, highlighting the regions that most strongly influence the model’s classification—such as hippocampal or cortical areas associated with Alzheimer’s pathology. In the proposed hybrid deep ensemble framework, Grad-CAM enables model interpretability across multiple transfer learning backbones (e.g., ResNet50, NASNet, and MobileNetV2), allowing clinicians and researchers to verify that the model’s attention aligns with known neuroanatomical biomarkers, thereby enhancing the transparency and reliability of automated Alzheimer’s detection.
In this study, Grad-CAM was selected over SHAP primarily because it is specifically designed for convolutional neural networks and provides spatially localized visual explanations, which are essential in medical imaging tasks. Grad-CAM generates class-discriminative heatmaps by backpropagating gradients from the target class to the final convolutional layers, allowing the model to highlight salient regions within MRI slices that influence predictions. These heatmaps align well with neuroanatomical structures (e.g., hippocampal or cortical areas), enabling clinicians to verify biologically meaningful features.
The above method was implemented using Python 3.14 with TensorFlow and Keras. Three pretrained convolutional neural networks—ResNet50, NASNet, and MobileNet—were fine-tuned on GM and WM MRI slices from the ADNI dataset. Their softmax outputs were combined using a weighted averaging scheme, where weights were optimized via validation performance. The proposed hybrid ensemble framework employs a well-defined stacking meta-learner and a stratified data-partitioning strategy to enhance reproducibility and model generalization. The stacking mechanism utilizes a logistic regression meta-learner trained on out-of-fold predictions from the base convolutional neural networks (CNNs)—ResNet50, NASNet, and MobileNet. Each base model produces a probability vector p i , which serves as the input feature vector for the meta-learner. The meta-learner is trained to predict the true class labels by minimizing the cross-entropy loss in Equation Section 3.2.3 using the logistic regression classifier. This configuration enables the ensemble to capture non-linear interactions between the outputs of base models, improving robustness and predictive performance. As mentioned above, for data stratification, the dataset was partitioned into 60% training, 20% validation, and 20% testing subsets, ensuring balanced representation of Alzheimer’s disease subjects. The training and validation sets were employed for transfer learning, fine-tuning, and ensemble weight optimization, while the test set—primarily comprising original, non-augmented MRI slices—was used for unbiased model evaluation. This stratified approach maintained statistical balance across diagnostic classes, minimized bias, and supported reproducibility in experimental outcomes. For model interpretability, Grad-CAM was applied to the final convolutional layer of each network to generate class-specific heatmaps, which were overlaid on the MRI slices to visualize regions contributing most to AD, MCI, or NC predictions. The overall implementation is given in Algorithm 1.
Algorithm 1 Hybrid Deep Ensemble with XAI for AD vs. MCI
Require: 
MRI slice dataset D = { ( x i , y i ) } , pretrained CNNs { M k } k = 1 K
Ensure: 
Final prediction y ^ and explanation heatmap
  1:
Split D into training, validation, and test sets (60/20/20).
  2:
for each base model M k { ResNet 50 , NASNetMobile , MobileNetV 2 }  do
  3:
    Initialize M k with ImageNet weights.
  4:
    Train new classification head on training data (freeze backbone).
  5:
    Fine-tune top layers of M k with small learning rate.
  6:
    Save predicted probabilities p i k for validation/test sets.
  7:
end for
  8:
Learn ensemble weights α = arg min α i l y i , k α k p i k subject to α k 0 , k α k = 1 ▹ Weighted averaging
  9:
Train meta-learner (logistic regression) on out-of-fold predictions { p i k } ▹ Stacking
  
 
10:
For test sample x:
11:
Compute base model predictions p k ( x ) .
12:
Obtain weighted ensemble p ^ w a = k α k p k ( x ) .
13:
Obtain stacked ensemble p ^ s t = f meta ( p 1 ( x ) , , p K ( x ) ) .
14:
Combine predictions (e.g., average) to form final y ^ .
15:
Apply Grad-CAM on chosen base model to generate explanation heatmap.

4. Results and Discussion

This section analyzes the performance of the proposed ensemble strategy for the early diagnosis of AD, specifically focusing on classification tasks involving AD, MCI, and NC. The experimental results presented in Table 1, Table 2 and Table 3 clearly demonstrate that the improved ensemble approach—utilizing stacking and weighted averaging—offers considerable benefits over baseline models and even previous end-to-end ensemble methods. As shown in Table 1, all three fine-tuned individual models—eResNet50, eNASNet, and eMobileNet—achieved high classification accuracy (above 97%), indicating their strong capability of distinguishing patients with AD from those with MCI. The ensemble of these models in the original study (“eEnsemble”) yielded an accuracy of 97.65%, with perfect specificity and a near-perfect AUC of 1.00. However, the proposed improved ensemble surpassed this performance, achieving an accuracy of 99.21% and a sensitivity of 98.89%. The AUC remained at 1.00, indicating no degradation in the model’s ability to discriminate between classes despite the significant accuracy gain. The key advantage here lies in the meta-learning capability of stacking. By allowing a second-level learner to combine the outputs of base classifiers non-linearly, the stacking model learned to exploit complementary decision boundaries between the ConvNet architectures. Moreover, weighted averaging helped reduce the influence of models with relatively lower sensitivity or specificity, particularly eMobileNet, which, although robust, showed slightly lower sensitivity than eResNet50. The improvement in sensitivity—from 96.00% (eEnsemble) to 98.89% (improved ensemble)—is clinically significant, as it corresponds to better identification of AD in the early stages, potentially enabling earlier intervention and care planning. The specificity remained at 100.00%, indicating zero misclassification of MCI cases as AD, thereby maintaining diagnostic precision and avoiding unnecessary anxiety or treatment escalation.
Table 2 compares the performance of the models on the more challenging MCI vs. NC classification task. The complexity of this task stems from the subtle brain structure differences between normal aging and prodromal stages of AD. The bEnsemble and even the eEnsemble performed adequately, achieving accuracies of 80.23% and 88.37%, respectively. Nevertheless, the proposed improved ensemble increased this performance to 91.02%, setting a new benchmark within this context. This improvement can be attributed to the finer decision boundaries learned via the stacking architecture. Sensitivity improved from 80.56% to 86.11%, indicating a substantial gain in the ability to correctly identify subjects with MCI. This is crucial in real-world applications where early detection of MCI can delay or prevent progression to full-blown AD. Specificity remained high at 96.00%, maintaining the system’s reliability in distinguishing normal individuals from those with early cognitive impairment. The AUC also increased from 0.96 to 0.98, reflecting enhanced classifier confidence and a balance between true positive and false positive rates across multiple thresholds. This underscores the strength of our proposed ensemble design in handling subtle clinical phenotypes, where noise and overlap are frequent.
Table 3 compares our method with several previously published approaches in the literature. The improved ensemble method outperformed all referenced methods across all classification settings. In the AD vs. NC task, the improved ensemble achieved an accuracy of 99.10%, with a sensitivity of 98.80% and a specificity of 100.00%. This is superior to the results of Sarraf et al. [62], who reported an accuracy of 98.84%, and Billones et al. [63], who achieved 98.33% accuracy. While these results were strong, our method’s marginal gains underscore the power of modern ensemble methods and advanced transfer learning from pretrained networks. For the AD vs. MCI classification, the improved ensemble achieved 99.21% accuracy, a substantial gain over the 90.00% reported by Billones et al. and the 84.00% reported by Ortiz et al. [64]. More importantly, our sensitivity of 98.89% far exceeds that of prior works, highlighting the efficacy of deep fine-tuning and ensemble optimization. The specificity was again perfect at 100.00%, reinforcing the clinical reliability of our model. In the MCI vs. NC classification task—generally the most difficult due to less pronounced structural differences—the improved ensemble attained an accuracy of 91.02%, outperforming the 83.14% and 73.02% reported by Ortiz et al. and Suk et al. [26], respectively. Additionally, our sensitivity and specificity were balanced and strong (86.11% and 96.00%), showing that the model avoids both under- and over-diagnosis in critical borderline cases. The performance gains across all tasks confirm the potential of stacking and weighted ensemble learning for enhancing diagnostic models in neuroimaging. From a methodological standpoint, the results validate that (i) end-to-end training yields better features than static pretrained models, (ii) combining multiple architectures (ResNet, NASNet, and MobileNet) captures diverse representational features, and (iii) stacking further refines decision-making by learning meta-level fusion patterns.
Table 1. Performance comparison of various classification methods for AD vs. MCI, including additional evaluation metrics.
Table 1. Performance comparison of various classification methods for AD vs. MCI, including additional evaluation metrics.
ModelsACC (%)SEN (%)SPE (%)AUCPPV (%)F1 (%)
bResNet5097.65100.0094.290.9596.3398.14
eResNet5098.8098.00100.001.00100.0098.99
bNASNet97.6598.0094.290.9596.0897.03
eNASNet98.8296.00100.001.00100.0097.96
bMobileNet97.6498.0094.280.9596.0797.03
eMobileNet97.6596.00100.001.00100.0097.96
bEnsemble98.8298.00100.000.95100.0098.99
eEnsemble97.6596.00100.001.00100.0097.96
Hybrid Ensemble99.2198.89100.001.00100.0099.44
Table 2. Performance comparison of various classification methods for MCI vs. NC, including additional evaluation metrics.
Table 2. Performance comparison of various classification methods for MCI vs. NC, including additional evaluation metrics.
ACC (%)SEN (%)SPE (%)AUCPPV (%)F1 (%)
bResNet5081.3980.5682.000.9181.2580.90
eResNet5086.0583.3388.000.9585.4184.36
bNASNet81.3955.56100.000.92100.0071.43
eNASNet87.2175.0096.000.9593.7583.33
bMobileNet81.3955.56100.000.92100.0071.43
eMobileNet89.5383.3394.000.9591.3087.18
bEnsemble80.2358.3396.000.9187.5070.00
eEnsemble88.3780.5694.000.9690.2485.16
Hybrid Ensemble91.0286.1196.000.9893.4789.69
Table 3. Comparison of classification results with previous studies.
Table 3. Comparison of classification results with previous studies.
ReferenceAD vs. NCAD vs. MCIMCI vs. NC
Billones et al. [63]98.33/98.89/97.7890.00/91.67/97.7891.67/92.22/91.11
Sarraf et al. [62]98.84/–/––/–/––/–/–
Ortiz et al. [64]90.09/86.12/94.1084.00/79.12/89.1283.14/67.26/95.09
Lian et al. [65]90.30/82.40/96.50–/–/––/–/–
Suk et al. [26]91.02/92.72/89.94–/–/–73.02/77.60/68.22
Ji et al. [25]98.59/97.22/100.0097.65/96.00/100.0088.37/80.56/94.00
Hybrid Ensemble99.10/98.80/100.0099.21/98.89/100.0091.02/86.11/96.00
From a clinical perspective, higher sensitivity in detecting MCI and AD directly translates to earlier detection and improved patient management. The reliability demonstrated by high specificity means fewer false alarms, increasing trust in the deployment of such systems in screening scenarios. In addition, expanding the stacking layer to incorporate other modalities such as PET, CSF biomarkers, or genetic data could yield further improvements. Finally, the explainability of the model’s decision-making process remains crucial, and future work should integrate interpretable AI methods to ensure transparency. In conclusion, the results presented in Table 1, Table 2 and Table 3 establish that the proposed improved ensemble strategy provides significant improvements over traditional and recent DL-based methods for Alzheimer’s diagnosis. The gains in accuracy, sensitivity, and AUC particularly highlight the practical feasibility and clinical relevance of the approach for early-stage AD detection.
To further evaluate the classification performance of the proposed models, we analyzed the receiver operating characteristic (ROC) curves of the individual and ensemble classifiers. The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity), providing a comprehensive measure of diagnostic accuracy.
As shown in Figure 3, the base models (eResNet50, eNASNet, and eMobileNet) achieved consistent ROC performance, with AUC values of approximately 0.95. The original ensemble method (eEnsemble), which combines the outputs of these base models using majority voting, slightly improved performance to an AUC of 0.96. In contrast, the proposed hybrid ensemble, which uses a combination of stacking and weighted averaging, outperformed all other models, with an AUC of 0.98. This indicates a superior trade-off between sensitivity and specificity. The hybrid model better leverages the complementary strengths of the base networks by learning optimal fusion weights and applying a meta-classifier that generalizes well across subject-level MRI slices. Notably, the hybrid ensemble demonstrates a sharper rise at the beginning of the curve, suggesting its improved capability of distinguishing early cases with minimal false positives. This feature is particularly important in clinical settings, where early diagnosis of Alzheimer’s disease can significantly influence patient treatment outcomes.
Figure 4 illustrates how the Grad-CAM technique was applied to understand the classification decisions made by the hybrid ensemble network on structural MRI scans. Grad-CAM, introduced by Selvaraju et al. [61], computes the gradients of a target class score with respect to the feature maps of the final convolutional layer, enabling the generation of heatmaps that localize the most influential regions in the input image. In this example, the pretrained hybrid ensemble model was used to classify subjects into no AD or NC, MCI, and AD categories. The bottom row of the figure displays the Grad-CAM overlays for each prediction. For the no AD subject, the attention is centered on non-pathological regions with a diffuse spread. For the MCI case, the highlighted region includes parts of the medial temporal lobe, aligning with early-stage cognitive decline. In the AD case, the activation is clearly focused around the hippocampal and cortical atrophy zones, demonstrating that the network’s attention aligns well with known neuropathological markers. These interpretable visualizations reinforce the clinical relevance of the model and enhance trust in DL-based diagnostic systems.
Moreover, the hybrid ensemble proves to be a robust and highly discriminative approach for classifying AD stages, reinforcing the value of integrating DL with advanced ensemble strategies. Our proposed method integrates three key techniques: transfer learning, weighted averaging, and stacked generalization. Together, these components form a powerful diagnostic pipeline that significantly outperforms conventional ML approaches and standalone convolutional neural networks. The experimental results presented in Table 1, Table 2 and Table 3 highlight the effectiveness of our model across various classification tasks involving AD, MCI, and NC subjects. At the core of our approach is the utilization of diverse pretrained CNN architectures—ResNet50, NASNet, and MobileNet—which were fine-tuned end to end on brain MRI slices. These models capture different representational features of brain structure, providing the ensemble with both depth and breadth in learned features. The weighted averaging strategy ensures that more reliable base classifiers are emphasized during inference, while the stacked generalization mechanism learns higher-level patterns by combining base model predictions through a meta-learner. This two-level architecture enables the model to make more nuanced decisions, particularly in difficult cases such as distinguishing between MCI and NC.
The proposed hybrid deep ensemble model can be effectively integrated into clinical workflows as a supportive diagnostic tool for neurologists and radiologists. In a practical setting, the model would process MRI scans to generate probabilistic classifications distinguishing Alzheimer’s disease, mild cognitive impairment, and normal controls. The accompanying Grad-CAM heatmaps provide visual explanations highlighting critical brain regions, enabling physicians to verify that the model’s focus aligns with known pathological markers such as hippocampal atrophy. By combining quantitative predictions with interpretable visual outputs, the system enhances clinical confidence, aids early diagnosis, and supports data-driven decision-making in routine neuroimaging assessments. Although the ADNI dataset provides high-quality, standardized MRI data, the model’s generalizability across diverse clinical environments remains uncertain. Variations in scanner types, imaging protocols, and demographic distributions across hospitals could influence prediction accuracy. To ensure reliable deployment, future work should validate the model on independent datasets from multiple institutions and explore domain adaptation or federated learning techniques. These approaches would help mitigate site-specific biases and enhance the robustness of the hybrid ensemble model for real-world clinical applications.

5. Conclusions

Our improved ensemble strategy demonstrated superior performance across all evaluated tasks. Specifically, the model achieved 99.21% accuracy in classifying AD vs. MCI and 91.02% in MCI vs. NC—two of the most clinically significant tasks. Notably, the sensitivity for detecting early-stage conditions such as MCI was markedly improved, which is crucial for timely medical intervention. The area under the receiver operating characteristic curve remained consistently high (0.98–1.00), confirming the model’s reliability across different operating thresholds. Beyond raw performance, our methodology offers practical advantages for real-world clinical deployment. By leveraging transfer learning, the model requires less training data and computational resources while still achieving high accuracy. The ensemble structure provides robustness against overfitting and variability in imaging protocols. Additionally, our framework is modular and extensible—new base models or meta-learners can be added to the ensemble without altering the entire system. Despite its strengths, the proposed method is not without limitations. While the ADNI dataset provides a standardized benchmark, generalization to other clinical datasets has yet to be validated. MRI data is known to exhibit inter-center variability due to differences in scanner hardware, imaging protocols, and patient demographics. To address this challenge, future work will explore domain adaptation techniques to make the model robust across institutions. Furthermore, integrating federated learning can help build generalized models without compromising data privacy, especially when training across multiple hospitals or countries. In conclusion, our ensemble methodology presents a significant advancement in computer-aided diagnosis for neurodegenerative diseases. It combines model diversity with intelligent fusion strategies to deliver a system that is not only accurate and stable but also adaptable for clinical use. This work lays the foundation for future extensions that can include multi-modal data, interpretable AI components, and real-world deployment in early screening programs for Alzheimer’s disease.

Author Contributions

F.M.: Conceptualization; Data curation; Methodology; Machine learning analysis; Supervision; Writing—original draft; and Writing—review and editing. K.H.: Data curation; Machine learning analysis; Writing—original draft; and Writing—review and editing. D.D.: Data curation; Writing—original draft; and Writing—review and editing. H.K.: Supervision; Writing—original draft; and Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

This research was conducted on human subject data. Data were obtained from open sources.

Data Availability Statement

MRI data used in my research is publicly available from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (accessed on 11 December 2024).

Acknowledgments

The authors gratefully acknowledge the Alzheimer’s Disease Neuroimaging Initiative (ADNI) for providing data support. The authors also thank the university writing centers for assistance with English language editing and the anonymous reviewers for their valuable feedback and constructive comments, which helped improve the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Code Availability 

For further implementation, the Python 3.14 pseudocodes are available at https://github.com/FahadMostafa91/Hybrid_Deep_Ensemble_Learning_AD.

Ethics Approval 

The authors utilized publicly available data from the ADNI open source database. Therefore, ethical approval was acquired by ADNI.

Clinical Trial Registration 

Authors did not use clinical trial data directly. The authors used publicly available data with proper references in the text.

Generative AI Statement 

Authors declare that no Generative AI was used in the creation of this manuscript.

Abbreviations

The following abbreviations are used in this manuscript:
AbbreviationFull Term
MLML
DLDL
XAIXAI
ADAlzheimer’s Disease
MCIMild Cognitive Impairment
MRIMagnetic Resonance Imaging
ADNIAlzheimer’s Disease Neuroimaging Initiative
NCNormal Controls
Grad-CAMGradient-weighted Class Activation
MMSEMini-Mental State Examination
CDRClinical Dementia Rating
CSFCerebrospinal Fluid
PETPositron Emission Tomography
MoCAMontreal Cognitive Assessment
LSMELong Short-Term Memory
LIMELocal Interpretable Model-Agnostic Explanations
LRPLayer-wise Relevance Propagation
EADCDEnhancing Automated Detection and Classification of Dementia
TIPAITThinking Incapable People Using Artificial Intelligence Techniques
BGGOBinary Greylag Goose Optimization
ISSAImproved Salp Swarm Technique
WNNWavelet Neural Network
ConvNetsConvolutional Neural Networks
GMGray Matter
WMWhite Matter
ROCReceiver Operating Characteristic
AUCArea Under the Curve

References

  1. Brookmeyer, R.; Johnson, E.; Ziegler-Graham, K.; Arrighi, H.M. Forecasting the global burden of Alzheimer’s disease. Alzheimer’s Dement. 2007, 3, 186–191. [Google Scholar] [CrossRef]
  2. Prince, M.; Wimo, A.; Guerchet, M.; Ali, G.C.; Wu, Y.T.; Prina, M. World Alzheimer Report 2015. The Global Impact of Dementia: An Analysis of Prevalence, Incidence, Cost and Trends. Ph.D. Thesis, Alzheimer’s Disease International, London, UK, 2015. [Google Scholar]
  3. Winblad, B.; Amouyel, P.; Andrieu, S.; Ballard, C.; Brayne, C.; Brodaty, H.; Cedazo-Minguez, A.; Dubois, B.; Edvardsson, D.; Feldman, H.; et al. Defeating Alzheimer’s disease and other dementias: A priority for European science and society. Lancet Neurol. 2016, 15, 455–532. [Google Scholar] [CrossRef]
  4. Knopman, D.S.; Amieva, H.; Petersen, R.C.; Chételat, G.; Holtzman, D.M.; Hyman, B.T.; Nixon, R.A.; Jones, D.T. Alzheimer disease. Nat. Rev. Dis. Prim. 2021, 7, 33. [Google Scholar] [CrossRef]
  5. Heisterman, A.A.T. Cognitive disorders. Psychiatric Mental Health Nursing: Evidence-Based Concepts, Skills, and Practices; Lippincott Williams & Wilkins: Philadelphia, PA, USA, 2012. [Google Scholar]
  6. Day, G.S. Diagnosing Alzheimer Disease. Contin. Lifelong Learn. Neurol. 2024, 30, 1584–1613. [Google Scholar] [CrossRef]
  7. Zhang, W.; Li, Y.; Ren, W.; Liu, B. Artificial intelligence technology in Alzheimer’s disease research. Intractable Rare Dis. Res. 2023, 12, 208–212. [Google Scholar] [CrossRef]
  8. Frisoni, G.B.; Fox, N.C.; Jack, C.R., Jr.; Scheltens, P.; Thompson, P.M. The clinical use of structural MRI in Alzheimer disease. Nat. Rev. Neurol. 2010, 6, 67–77. [Google Scholar] [CrossRef] [PubMed]
  9. Jack, C.R., Jr.; Bennett, D.A.; Blennow, K.; Carrillo, M.C.; Dunn, B.; Haeberlein, S.B.; Holtzman, D.M.; Jagust, W.; Jessen, F.; Karlawish, J.; et al. NIA-AA research framework: Toward a biological definition of Alzheimer’s disease. Alzheimer’s Dement. 2018, 14, 535–562. [Google Scholar] [CrossRef] [PubMed]
  10. Reitz, C.; Brayne, C.; Mayeux, R. Epidemiology of Alzheimer disease. Nat. Rev. Neurol. 2011, 7, 137–152. [Google Scholar] [CrossRef] [PubMed]
  11. Henderson, A. The epidemiology of Alzheimer’s disease. Br. Med. Bull. 1986, 42, 3–10. [Google Scholar] [CrossRef]
  12. Zhang, X.X.; Tian, Y.; Wang, Z.T.; Ma, Y.H.; Tan, L.; Yu, J.T. The epidemiology of Alzheimer’s disease modifiable risk factors and prevention. J. Prev. Alzheimer’S Dis. 2021, 8, 313–321. [Google Scholar] [CrossRef]
  13. Morris, J.C. The Clinical Dementia Rating (CDR) current version and scoring rules. Neurology 1993, 43, 2412. [Google Scholar] [CrossRef]
  14. Folstein, M.F.; Folstein, S.E.; McHugh, P.R. “Mini-mental state”: A practical method for grading the cognitive state of patients for the clinician. J. Psychiatr. Res. 1975, 12, 189–198. [Google Scholar] [CrossRef]
  15. Mantzavinos, V.; Alexiou, A. Biomarkers for Alzheimer’s disease diagnosis. Curr. Alzheimer Res. 2017, 14, 1149–1154. [Google Scholar] [CrossRef] [PubMed]
  16. Cuingnet, R.; Gerardin, E.; Tessieras, J.; Auzias, G.; Lehéricy, S.; Habert, M.O.; Chupin, M.; Benali, H.; Colliot, O.; Initiative, A.D.N.; et al. Automatic classification of patients with Alzheimer’s disease from structural MRI: A comparison of ten methods using the ADNI database. Neuroimage 2011, 56, 766–781. [Google Scholar] [CrossRef]
  17. Liu, S.; Masurkar, A.V.; Rusinek, H.; Chen, J.; Zhang, B.; Zhu, W.; Fernandez-Granda, C.; Razavian, N. Generalizable deep learning model for early Alzheimer’s disease detection from structural MRIs. Sci. Rep. 2022, 12, 17106. [Google Scholar] [CrossRef] [PubMed]
  18. Razavian, N.; Blecker, S.; Schmidt, A.M.; Smith-McLallen, A.; Nigam, S.; Sontag, D. Population-level prediction of type 2 diabetes from claims data and analysis of risk factors. Big Data 2015, 3, 277–287. [Google Scholar] [CrossRef] [PubMed]
  19. Snyder, P.J.; Kahle-Wrobleski, K.; Brannan, S.; Miller, D.S.; Schindler, R.J.; DeSanti, S.; Ryan, J.M.; Morrison, G.; Grundman, M.; Chandler, J.; et al. Assessing cognition and function in Alzheimer’s disease clinical trials: Do we have the right tools? Alzheimer’s Dement. 2014, 10, 853–860. [Google Scholar] [CrossRef]
  20. Corey-Bloom, J.; Thal, L.; Galasko, D.; Folstein, M.; Drachman, D.; Raskind, M.; Lanska, D. Diagnosis and evaluation of dementia. Neurology 1995, 45, 211–218. [Google Scholar] [CrossRef]
  21. Larner, A. Screening utility of the Montreal Cognitive Assessment (MoCA): In place of–or as well as–the MMSE? Int. Psychogeriat. 2012, 24, 391–396. [Google Scholar] [CrossRef]
  22. Small, G.W.; Bookheimer, S.Y.; Thompson, P.M.; Cole, G.M.; Huang, S.; Kepe, V.; Barrio, J.R. Current and future uses of neuroimaging for cognitively impaired patients. Lancet Neurol. 2008, 7, 161–172. [Google Scholar] [CrossRef]
  23. Wolf, H.; Jelic, V.; Gertz, H.J.; Nordberg, A.; Julin, P.; Wahlund, L.O. A critical discussion of the role of neuroimaging in mild cognitive impairment. Acta Neurol. Scand. 2003, 107, 52–76. [Google Scholar] [CrossRef]
  24. Tigano, V.; Cascini, G.L.; Sanchez-Castañeda, C.; Péran, P.; Sabatini, U. Neuroimaging and neurolaw: Drawing the future of aging. Front. Endocrinol. 2019, 10, 217. [Google Scholar] [CrossRef]
  25. Ji, H.; Liu, Z.; Yan, W.Q.; Klette, R. Early diagnosis of Alzheimer’s disease using deep learning. In Proceedings of the 2nd International Conference on Control and Computer Vision, Jeju, Republic of Korea, 15–18 June 2019; pp. 87–91. [Google Scholar]
  26. Suk, H.I.; Lee, S.W.; Shen, D.; Alzheimer’s Disease Neuroimaging Initiative. Deep ensemble learning of sparse regression models for brain disease diagnosis. Med. Image Anal. 2017, 37, 101–113. [Google Scholar] [CrossRef]
  27. Shi, J.; Zheng, X.; Li, Y.; Zhang, Q.; Ying, S. Multimodal neuroimaging feature learning with multimodal stacked deep polynomial networks for diagnosis of Alzheimer’s disease. IEEE J. Biomed. Health Inform. 2017, 22, 173–183. [Google Scholar] [CrossRef]
  28. Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Van Der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
  29. Thibeau-Sutre, E.; Collin, S.; Burgos, N.; Colliot, O. Interpretability of machine learning methods applied to neuroimaging. Mach. Learn. Brain Disord. 2023, 27, 655–704. [Google Scholar]
  30. Alsubaie, M.G.; Luo, S.; Shaukat, K. Alzheimer’s disease detection using deep learning on neuroimaging: A systematic review. Mach. Learn. Knowl. Extr. 2024, 6, 464–505. [Google Scholar] [CrossRef]
  31. Bossa, M.N.; Sahli, H. A multidimensional ODE-based model of Alzheimer’s disease progression. Sci. Rep. 2023, 13, 3162. [Google Scholar] [CrossRef]
  32. Cheung, C.Y.; Ran, A.R.; Wang, S.; Chan, V.T.; Sham, K.; Hilal, S.; Venketasubramanian, N.; Cheng, C.Y.; Sabanayagam, C.; Tham, Y.C.; et al. A deep learning model for detection of Alzheimer’s disease based on retinal photographs: A retrospective, multicentre case-control study. Lancet Digit. Health 2022, 4, e806–e815. [Google Scholar] [CrossRef]
  33. Mmadumbu, A.C.; Saeed, F.; Ghaleb, F.; Qasem, S.N. Early detection of Alzheimer’s disease using deep learning methods. Alzheimer’s Dement. 2025, 21, e70175. [Google Scholar] [CrossRef] [PubMed]
  34. Koonce, B. ResNet 50. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Springer: Berlin/Heidelberg, Germany, 2021; pp. 63–72. [Google Scholar]
  35. Abbas, Q.; Gul, A. Detection and classification of malignant melanoma using deep features of NASNet. SN Comput. Sci. 2022, 4, 21. [Google Scholar] [CrossRef]
  36. Nan, Y.; Ju, J.; Hua, Q.; Zhang, H.; Wang, B. A-MobileNet: An approach of facial expression recognition. Alex. Eng. J. 2022, 61, 4435–4444. [Google Scholar] [CrossRef]
  37. Lei, B.; Chen, S.; Ni, D.; Wang, T. Discriminative learning for Alzheimer’s disease diagnosis via canonical correlation analysis and multimodal fusion. Front. Aging Neurosci. 2016, 8, 77. [Google Scholar] [CrossRef]
  38. Cheung, C.Y.; Mok, V.; Foster, P.J.; Trucco, E.; Chen, C.; Wong, T.Y. Retinal imaging in Alzheimer’s disease. J. Neurol. Neurosurg. Psychiat. 2021, 92, 983–994. [Google Scholar] [CrossRef]
  39. Keane, P.A.; Sadda, S.R. Retinal imaging in the twenty-first century: State of the art and future directions. Ophthalmology 2014, 121, 2489–2500. [Google Scholar] [CrossRef]
  40. Junior, K.J.; Carole, K.S.; Theodore Armand, T.P.; Kim, H.C.; Initiative, A.D.N. Alzheimer’s Multiclassification Using Explainable AI Techniques. Appl. Sci. 2024, 14, 8287. [Google Scholar] [CrossRef]
  41. Anzum, H.; Sammo, N.S.; Akhter, S. Leveraging transformers and explainable AI for Alzheimer’s disease interpretability. PLoS ONE 2025, 20, e0322607. [Google Scholar] [CrossRef]
  42. Viswan, V.; Shaffi, N.; Mahmud, M.; Subramanian, K.; Hajamohideen, F. Explainable artificial intelligence in Alzheimer’s disease classification: A systematic review. Cogn. Comput. 2024, 16, 1–44. [Google Scholar] [CrossRef]
  43. Alotaibi, S.D.; Alharbi, A.A. Enhancing automated detection and classification of dementia in individuals with cognitive impairment using artificial intelligence techniques. Sci. Rep. 2025, 15, 24659. [Google Scholar] [CrossRef] [PubMed]
  44. Vlontzou, M.E.; Athanasiou, M.; Dalakleidi, K.V.; Skampardoni, I.; Davatzikos, C.; Nikita, K. A comprehensive interpretable machine learning framework for mild cognitive impairment and Alzheimer’s disease diagnosis. Sci. Rep. 2025, 15, 8410. [Google Scholar] [CrossRef] [PubMed]
  45. Fathi, S.; Ahmadi, A.; Dehnad, A.; Almasi-Dooghaee, M.; Sadegh, M.; Initiative, A.D.N. A deep learning-based ensemble method for early diagnosis of Alzheimer’s disease using MRI images. Neuroinformatics 2024, 22, 89–105. [Google Scholar] [CrossRef]
  46. Cao, K.; Xia, Y.; Yao, J.; Han, X.; Lambert, L.; Zhang, T.; Tang, W.; Jin, G.; Jiang, H.; Fang, X.; et al. Large-scale pancreatic cancer detection via non-contrast CT and deep learning. Nat. Med. 2023, 29, 3033–3043. [Google Scholar] [CrossRef]
  47. Qiu, D.; Ju, J.; Ren, S.; Zhang, T.; Tu, H.; Xie, F. A deep-learning based cascade algorithm for pancreatic tumor segmentation. Front. Oncol. 2024, 14, 1328146. [Google Scholar] [CrossRef]
  48. Ozawa, M.; Sone, M.; Hijioka, S.; Hara, H.; Wakatsuki, Y.; Ishihara, T.; Hattori, S.; Hirano, R.; Ambo, S.; Esaki, M.; et al. Deep learning-based automatic detection of pancreatic ductal adenocarcinoma ≤2 cm with high-resolution computed tomography. Jpn. J. Radiol. 2025, 11, 1870–1877. [Google Scholar] [CrossRef]
  49. Alaca, Y. Machine learning via DARTS-Optimized MobileViT models for pancreatic cancer diagnosis with graph-based deep learning. BMC Med. Inform. Decis. Mak. 2025, 25, 81. [Google Scholar] [CrossRef]
  50. Alaca, Y.; Akmeşe, Ö.F. Pancreatic Tumor Detection From CT Images Converted to Graphs Using Whale Optimization and Classification Algorithms With Transfer Learning. Int. J. Imaging Syst. Technol. 2025, 35, e70040. [Google Scholar] [CrossRef]
  51. Lundervold, A.S.; Lundervold, A. An overview of deep learning in medical imaging focusing on MRI. Z. Fuer Med. Phys. 2019, 29, 102–127. [Google Scholar] [CrossRef] [PubMed]
  52. Hossain, M.I.; Zamzmi, G.; Mouton, P.R.; Salekin, M.S.; Sun, Y.; Goldgof, D. Explainable AI for medical data: Current methods, limitations, and future directions. ACM Comput. Surv. 2025, 57, 1–46. [Google Scholar] [CrossRef]
  53. Rauniyar, A.; Hagos, D.H.; Jha, D.; Håkegård, J.E.; Bagci, U.; Rawat, D.B.; Vlassov, V. Federated learning for medical applications: A taxonomy, current trends, challenges, and future research directions. IEEE Internet Things J. 2023, 11, 7374–7398. [Google Scholar] [CrossRef]
  54. ADNI Data & Samples. Available online: https://adni.loni.usc.edu/data-samples/adni-data/ (accessed on 7 September 2025).
  55. Uraninjo. Augmented Alzheimer MRI Dataset. Available online: https://www.kaggle.com/datasets/uraninjo/augmented-alzheimer-mri-dataset/data (accessed on 14 November 2025).
  56. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 84–90. [Google Scholar] [CrossRef]
  57. Pampel, F.C. Logistic Regression: A Primer; Number 132; Sage publications: London, UK, 2020. [Google Scholar]
  58. Agliari, E.; Alemanno, F.; Barra, A.; De Marzo, G. The emergence of a concept in shallow neural networks. Neural Netw. 2022, 148, 232–253. [Google Scholar] [CrossRef] [PubMed]
  59. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  60. Quach, L.D.; Quoc, K.N.; Quynh, A.N.; Thai-Nghe, N.; Nguyen, T.G. Explainable deep learning models with gradient-weighted class activation mapping for smart agriculture. IEEE Access 2023, 11, 83752–83762. [Google Scholar] [CrossRef]
  61. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
  62. Sarraf, S.; Tofighi, G. Deep learning-based pipeline to recognize Alzheimer’s disease using fMRI data. In Proceedings of the 2016 Future Technologies Conference (FTC), San Francisco, CA, USA, 6–7 December 2016; IEEE: New York, NY, USA, 2016; pp. 816–820. [Google Scholar]
  63. Billones, C.D.; Demetria, O.J.L.D.; Hostallero, D.E.D.; Naval, P.C. DemNet: A convolutional neural network for the detection of Alzheimer’s disease and mild cognitive impairment. In Proceedings of the 2016 IEEE Region 10 Conference (TENCON), Singapore, 4–5 March 2016; IEEE: New York, NY, USA, 2016; pp. 3724–3727. [Google Scholar]
  64. Ortiz, A.; Munilla, J.; Gorriz, J.M.; Ramirez, J. Ensembles of deep learning architectures for the early diagnosis of the Alzheimer’s disease. Int. J. Neural Syst. 2016, 26, 1650025. [Google Scholar] [CrossRef] [PubMed]
  65. Lian, C.; Liu, M.; Zhang, J.; Shen, D. Hierarchical fully convolutional network for joint atrophy localization and Alzheimer’s disease diagnosis using structural MRI. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 880–893. [Google Scholar] [CrossRef] [PubMed]
Figure 1. A toy example of a hybrid deep ensemble learning framework with XAI using ResNet50 and a dense meta-learner for Alzheimer’s disease classification. Grad-CAM highlights key brain regions in both coronal and horizontal MRI sections, distinguishing healthy controls from Alzheimer’s patients.
Figure 1. A toy example of a hybrid deep ensemble learning framework with XAI using ResNet50 and a dense meta-learner for Alzheimer’s disease classification. Grad-CAM highlights key brain regions in both coronal and horizontal MRI sections, distinguishing healthy controls from Alzheimer’s patients.
Appliedmath 05 00171 g001
Figure 2. Workflow of the proposed hybrid deep ensemble model for early diagnosis of Alzheimer’s disease. The pipeline begins with the preprocessing and slicing of MRI images (gray matter and white matter).
Figure 2. Workflow of the proposed hybrid deep ensemble model for early diagnosis of Alzheimer’s disease. The pipeline begins with the preprocessing and slicing of MRI images (gray matter and white matter).
Appliedmath 05 00171 g002
Figure 3. ROC curve comparison for eResNet50, eNasNet, eMobileNet, eEnsemble, and the proposed hybrid ensemble with 10-fold CV.
Figure 3. ROC curve comparison for eResNet50, eNasNet, eMobileNet, eEnsemble, and the proposed hybrid ensemble with 10-fold CV.
Appliedmath 05 00171 g003
Figure 4. Grad-CAM heatmaps for three representative brain MRI slices. Top: original T1-weighted MRI images for No AD, MCI, and AD. Bottom: corresponding Grad-CAM overlays from the hybrid ensemble network highlighting class-discriminative regions.
Figure 4. Grad-CAM heatmaps for three representative brain MRI slices. Top: original T1-weighted MRI images for No AD, MCI, and AD. Bottom: corresponding Grad-CAM overlays from the hybrid ensemble network highlighting class-discriminative regions.
Appliedmath 05 00171 g004
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mostafa, F.; Hossain, K.; Das, D.; Khan, H. Deep Learning Approaches with Explainable AI for Differentiating Alzheimer’s Disease and Mild Cognitive Impairment. AppliedMath 2025, 5, 171. https://doi.org/10.3390/appliedmath5040171

AMA Style

Mostafa F, Hossain K, Das D, Khan H. Deep Learning Approaches with Explainable AI for Differentiating Alzheimer’s Disease and Mild Cognitive Impairment. AppliedMath. 2025; 5(4):171. https://doi.org/10.3390/appliedmath5040171

Chicago/Turabian Style

Mostafa, Fahad, Kannon Hossain, Dip Das, and Hafiz Khan. 2025. "Deep Learning Approaches with Explainable AI for Differentiating Alzheimer’s Disease and Mild Cognitive Impairment" AppliedMath 5, no. 4: 171. https://doi.org/10.3390/appliedmath5040171

APA Style

Mostafa, F., Hossain, K., Das, D., & Khan, H. (2025). Deep Learning Approaches with Explainable AI for Differentiating Alzheimer’s Disease and Mild Cognitive Impairment. AppliedMath, 5(4), 171. https://doi.org/10.3390/appliedmath5040171

Article Metrics

Back to TopTop