1. Introduction
AD is the predominant causal factor contributing to the development of dementia. Alzheimer’s is defined as a degenerative pathology of the brain manifested by a range of debilitating symptoms, most notably memory loss and cognitive decline. These symptoms can reach a severity level that hinders daily functioning. AD is a prevalent condition, constituting 60–80% of all reported cases of [
1] dementia. The high incidence of this condition underscores the extent of cognitive impairment. AD is widely recognized as a costly neurodegenerative disorder that imposes a substantial economic burden. In a study conducted in 2006 [
2], approximately 26.6 million individuals worldwide were estimated to be afflicted by AD.
Given the increasing societal and economic ramifications of AD, its prognosis underscores the importance of preventive measures and interventions. Although certain indicators and manifestations of AD may bear resemblance to the cognitive decline associated with aging, it is essential to recognize that dementia, and in particular AD, are not indicative of a customary or inherent facet of the aging process. The clinical manifestations [
3] of dementia exhibit a progressive pattern. Currently, definitive treatments for AD are lacking. The primary objectives include impeding the advancement of the disease, ameliorating behavioral complications, and enhancing overall life quality [
4,
5]. Existing pharmacological therapies have the potential to temporarily impede the relentless advancement of dementia by early detection of the characteristic indicators. Thus, the pursuit of enhanced therapeutic approaches, preventive strategies, and an ultimate remedy constitute a pivotal and enduring objective.
Recent studies have achieved substantial advancements in detecting and monitoring AD progression before the identification and utilization of biomarkers. Notably, brain imaging technologies are essential for detecting and visualizing the pathophysiological changes associated with AD throughout various periods, ranging from months to decades. In addition to amyloids, many other biomarkers that assess neurodegenerative processes have been utilized. The measures encompass the assessment of tau protein levels in the cerebrospinal fluid (CSF-tau), fluorodeoxyglucose positron emission tomography (FDG-PET) [
6,
7] and sMRI. Reliance on postmortem examinations underscores the need for continuous research to develop accurate, noninvasive diagnostic tools that facilitate early detection and intervention in living patients. Despite the extensive body of research dedicated to AD, there is an urgent need to develop reliable diagnostic tools because of the intricate and challenging nature of diagnosing and treating this condition.
The scale of the problem is significant. The World Alzheimer’s Report estimates that [
8] diagnosed cases will rise from 55 million to 78 million by 2030. This condition is characterized by clinical symptoms, such as memory loss, confusion, and visuospatial abnormalities [
9]. There are ongoing endeavors to enhance the early identification and diagnosis of this condition despite limited treatment options that rely on symptom monitoring alone. One such approach involves the discovery of specific CSF biomarkers [
10]. However, this approach involves intrusive research that can potentially put patients at risk [
11]. Advanced imaging modalities, such as positron emission tomography (PET) and MRI, play a crucial role in facilitating the identification of structural and molecular biomarkers associated with AD [
12,
13]. Due to its noninvasive nature, MRI has emerged as a crucial tool for comprehending the morphological and functional alterations in the brain that are associated with AD. Consequently, MRI is indispensable in clinical practice. There are challenges in integrating large-scale, multimodal, and high-dimensional data from new neuroimaging methods, which has led to a notable increase in the interest in integrative analysis using computational machine learning techniques.
Machine learning algorithms, although successful in illness classification, require labor-intensive and computationally intensive preprocessing techniques. A typical process involves four steps: feature extraction, feature selection, dimension reduction, and selection of a feature-based classification [
14]. Thus, scientists are exploring deep learning (DL) algorithms as potential alternatives. DL algorithms are a specific type of representation learning technique that can generate optimal representations from unprocessed data without requiring prior feature selection [
15,
16,
17]. Utilizing a complicated hierarchical structure with several tiers and sequential nonlinear transformations, DL [
18,
19] has demonstrated promise in various domains, such as medical imaging.
The CNN, with a widely used deep learning architecture, has garnered attention in the field of medical image analysis because of its notable performance in image categorization [
20,
21,
22,
23,
24]. However, the CNN design can be further enhanced to attain a more realistic identification of AD. This study, motivated by the accomplishments of deep learning methods in the field of medical imaging, proposes an enhanced CNN to detect and classify AD by utilizing MRI images.
Despite notable advances in deep learning-based AD classification, several challenges remain. Existing high-performing architectures often involve substantial computational demands, which may limit practical deployment. Additionally, several studies have employed image-level rather than subject-level data partitioning, which can introduce data leakage and overestimate true generalization performance [
18]. Interpretability also remains a concern, as the anatomical basis for model decisions is frequently not examined. The present study attempts to address these aspects by proposing a residual CNN that operates on 2D sMRI slices with subject-level stratified partitioning, 10-fold cross-validation, and Grad-CAM-based visualization. The key contributions include: (1) a relatively lightweight architecture; (2) subject-level data partitioning to mitigate leakage; (3) Grad-CAM analysis to examine anatomically relevant activations; and (4) multiclass classification across AD, EMCI, LMCI, and CN groups using the ADNI dataset.
The subsequent sections of this study are structured as follows.
Section 2 provides an overview of the dataset and the proposed method, further presenting the evaluation criteria. The proposed architecture is presented in detail. The performance is evaluated and compared with other methods in
Section 3 and further discussed. In
Section 4, conclusions are drawn.
2. Materials and Methods
2.1. MRI Acquisition Protocol
The proposed model was tested using ADNI’s dataset. Of 600 subjects, 150 were diagnosed with AD, 150 with EMCI, 150 with LMCI, and 150 were cognitively normal (CN). The 3D MRI scans for each subject had dimensions of 256 × 256 × 170 pixels. Two-dimensional images were extracted from the axial, coronal, and sagittal planes, with blank or non-informative images automatically discarded. All remaining images were resized to 96 × 96 pixels using bilinear interpolation.
To ensure complete subject-level independence and avoid data leakage, the dataset was partitioned based on individual subjects rather than individual image slices. All slices derived from a single subject were kept within the same set. Specifically, a stratified split was performed in each diagnostic category (AD, EMCI, LMCI, CN) so that 70% of subjects were assigned to the training, 15% to the validation, and 15% to the testing. To robustly assess model performance, subject-level 10-fold cross-validation was performed exclusively within the 70% training partition. The validation set (15%) and test set (15%) were held out entirely prior to any fold iteration and were not used during cross-validation at any stage. Within the training partition, subjects were randomly divided into 10 equal folds at the subject level, ensuring that all slices from a single subject remained within the same fold. The model was trained on 9 folds and validated on the remaining fold, rotating until each fold had served as the validation fold once. Final model performance was reported on the held-out test set after the cross-validation procedure was completed.
In our preprocessing pipeline, raw 3D MRI scans (256 × 256 × 170 pixels) from the ADNI dataset are processed in several distinct steps. First, 2D slices are extracted from axial, coronal, and sagittal planes. To eliminate non-informative slices, we apply an intensity-based threshold: any slice with a mean pixel intensity below 5% of the maximum signal, or that fails a connectivity analysis for sufficient brain tissue, is automatically discarded. The remaining slices are then resized to 96 × 96 pixels using bilinear interpolation to support spatial consistency. Next, normalization is carried out by deducting the mean and dividing by the training set’s calculated standard deviation, ensuring that each image has zero mean and unit variance. In addition, to enhance the diversity of the training data and reduce overfitting, we employ data augmentation strategies including random rotations (±10°), horizontal flipping, and random intensity shifts. Standard skull stripping and MNI space registration were not applied, as the intensity-based quality filtering and slice-level normalization were deemed sufficient for the 2D classification approach. This choice reduces preprocessing complexity while maintaining classification performance, though it represents a limitation relative to volumetric methods. This detailed pipeline guarantees reproducibility and robust model training by thoroughly justifying all preprocessing decisions. This study adopts a 2D slice-based processing approach.
2.2. Data and Participants
The T1-weight MRI data, obtained from individuals enrolled in the ADNI over a span of 24 months, were examined. The study sample consisted of 150 individuals diagnosed with AD, 150 with EMCI, 150 with LMCI, and an additional 150 categorized as CN totaling 600 subjects. Demographic details for each group are provided in
Table 1. Before performing quality assessment, the structural MRI images were subjected to preprocessing techniques. This study used data from ADNI, the main goal of which is to determine whether longitudinal MRI and PET imaging, along with other biological markers and comprehensive clinical and neuropsychological evaluations, can be integrated to track the progression of mild cognitive impairment and early-stage AD. The ADNI database “
https://adni.loni.usc.edu (accessed on 20 January 2025)” was established in 2003 as a public–private partnership and can be accessed by approved researchers.
2.3. 2D Slice Extraction from 3D MRI Volumes
The utilization of a 3D convolutional neural network is a rational decision for deep learning models because of the inherent volumetric characteristics of MRI data. The computational effort and time required to train 3D CNN models are considerably higher than those required to train 2D CNN models because of the high-dimensional input. A notable challenge arises from the limited scale of the prevailing medical datasets, hindering the effective training of a deep network from achieving generalization in intricate problem domains. In this study, the 3D MRI images used were unsuitable for direct application in 2D CNN models because of the number of dimensions. Two-dimensional slices were extracted directly from the raw 3D MRI volumes (256 × 256 × 170 voxels) along the sagittal, coronal, and axial planes without prior volumetric resampling, with the outermost sections containing no relevant brain information discarded. The initial and final slices lacking useful information were discarded. The slices were then normalized and resized, thereby obtaining images of mean zero and standard deviation one. The 2D convolutional neural network model was subsequently trained on randomly chosen axial, coronal, and sagittal patch slices.
Figure 1 shows a collection of MRI slices from individuals with different cognitive states.
2.4. Network Architecture
The proposed CNN architecture integrates conventional convolutional layers with residual skip connections. Although 7 × 7 kernels are larger than the 3 × 3 kernels typical in lightweight models, they were selected in the initial convolutional layers to capture broader spatial context from MRI slices, where clinically relevant features such as hippocampal boundaries and ventricular enlargement span larger receptive fields. Despite the larger kernel size, the overall parameter count of 28.93 M remains substantially lower than VGG-16, justifying its relative computational efficiency. While more compact architectures such as MobileNet (~4 M parameters) and EfficientNet-B0 (~5 M parameters) exist, these models were originally designed for natural image classification and have not been optimized for neuroimaging tasks involving subtle morphological differences between diagnostic categories such as EMCI and LMCI. The proposed model with 28.93 M parameters achieves a balance between classification performance and computational efficiency that is appropriate for the complexity of 4-class AD classification from sMRI data. Furthermore, the relatively higher parameter count compared to the most lightweight models is justified by the need to capture fine-grained spatial features across multiple MRI planes that are critical for distinguishing transitional AD stages. Additionally, residual (skip) connections are incorporated to facilitate feature mixing, mitigate the vanishing-gradient problem, and capture both local and global spatial features, interleaved with max pooling, batch normalization, and dropout layers (dropout rate = 0.4). These residual blocks enable the network to learn complex representations by combining outputs from earlier layers with deeper layer activations, as detailed in our explicit architectural diagram, which includes the number of layers, filter dimensions, activation functions (ReLU), and dropout parameters. Furthermore, we compare the total trainable parameters, memory usage, and inference time of our proposed architecture with benchmark models such as VGG-16 and ResNet-50, substantiating our claims of lower computational costs.
Although 3D CNNs can directly analyze volumetric MRI data, our approach converts 3D scans into 2D slices to significantly reduce computational complexity and training time. This 2D processing allows us to leverage well-established image processing and data augmentation techniques—such as rotations, flips, and intensity shifts—while making efficient use of limited training data per class. While this conversion may lead to a partial loss of inter-slice spatial context, the substantial gains in computational efficiency and lower memory requirements justify this design choice. Each input underwent these procedures, thereby enabling the development of classifiers for both binary and multiclass categorization. Network performance was enhanced through the utilization of a collection of neurons that established connections between shift invariance, local connectivity, shared hyperparameters, and convolutional operation. The proposed comprehensive deep CNN framework aimed to detect numerous AD biomarkers. This framework utilizes the complete image volume as the input. In addition, in contrast to the suggested approach, we employed widely recognized CNN architectures, namely VGG-Net and ResNet, for the classification tasks. These models have demonstrated efficacy in a diverse array of applications including image classification, identification, labeling, and detection. In this study, we used cross-entropy as the dataset is balanced, enhancing model sensitivity. In this section, the proposed CNN architecture and associated methodologies employed in this study are described. Although the 2D-slice-based strategy significantly reduces computational burden, it inevitably removes inter-slice spatial dependencies present in full 3D volumes. This may lead to loss of contextual anatomical information; hybrid 2D architectures and slice-position encoding preserve volumetric context while maintaining efficiency.
2.4.1. VGG_Net Model
The network, known as the proprietary CNN model, was first proposed by Simonyan and Zisserman [
25]. The model was developed by the Visual Geometry Group (VGG) and trained on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset. This dataset consists of 1.3 million photos, which were used for training, and an additional 50,000 images, which were used for validation. The dataset covers 1000 widely ranging classes. The VGG-13-19 model is a specific variety of VGG architecture distinguished by its 13–19 interconnected layers. The proposed model routinely demonstrated higher performance than other state-of-the-art models of the same period. The architectural design incorporates densely connected convolutional and fully connected layers, enabling sophisticated feature extraction. We employ average pooling to down-sample feature maps prior to applying the activation, Softmax, for classification. Using VGG-16 as our reference architecture, we then assess performance on the ADNI dataset by distinguishing AD.
Figure 2 illustrates the VGG-16 architectural layout.
2.4.2. ResNet
ResNet quickly established itself as a leading architecture for categorization, localization, and recognition tasks during the ILSVRC competition [
22]. To enhance cognitive ability, researchers investigated the necessity of incorporating additional layers into the neural network. During experimentation, researchers encountered a phenomenon known as degradation, wherein conventional models such as VGG exhibited a decline in performance instead of an improvement when the number of layers exceeded a certain threshold. Researchers have devised the concept of a residual function as a fundamental element within ResNet to address this challenge. In this study, the ResNet model employed the 50-layer non-bottleneck design shown in
Figure 3. The present configuration comprises a series of connections that exhibit a progressive increase in size. These connections can be categorized into two types: identity links (A), which do not involve padding, and projection links (B), which employ convolutions with 1 × 1 filter (kernel) size. The categorization of AD was performed using the ResNet-50 model on the ADNI dataset. The diagram illustrates the fundamental structure of ResNet, with a simplified representation of 34 layers for clarity. Plain network with 34 parameter layers; Residual network with 34 parameter layers. Both networks were trained on the ImageNet dataset [
26].
2.5. Proposed Methodology
Convolutional layers are fundamental to deep CNNs; by integrating activation functions, our deep CNN autonomously learns and extracts discriminative features from whole-brain MRI scans to enable accurate Alzheimer’s disease diagnosis.
Figure 4 outlines the overall pipeline, which encompasses three main phases: CNN processing, slicing of 3D volumes, and scaling of brain volumes. Detailed layer configurations, including dimensions, are provided in
Table 2 as the complete layer-by-layer architecture of the proposed model and
Table S1 Layer-by-layer parameter count derivation for the proposed architecture. All max pooling layers use a 2 × 2 kernel with stride 2, halving the spatial dimensions at each stage. The three residual skip connections (ADD 1, ADD 2, ADD 3) are identity connections that add the input of each convolutional block directly to its output without spatial modification, as the spatial dimensions are preserved within each block. The global residual connection (ADD 4) spans the entire network from input to the final feature map. Since the input (96 × 96 × 1) and the final feature map (12 × 12 × 256) differ in both spatial dimensions and channel depth, a projection convolutional layer with a 1 × 1 kernel and stride 8 is used to match dimensions before the element-wise addition. A Global Average Pooling layer follows ADD 4 to reduce the spatial feature maps to a 1D vector before the fully connected output layer. Our method combines conventional convolutional layers with skip connections, allowing the model to learn and integrate features at multiple hierarchical levels from MRI images.
The CNN model is structured to refine feature extraction in a robust and efficient manner. Initially, convolution layers utilize 7 × 7 filters with 256 channels and employ ReLU activation to capture detailed spatial features from the input images. These convolution layers are followed by max pooling layers configured with specific strides and padding to down-sample the feature maps while preserving the most relevant information. To further stabilize training and reduce overfitting, batch normalization is applied together with dropout, set at a 0.4 rate. The architecture also incorporates residual blocks with skip connections, which enhance gradient flow and allow deeper feature integration by combining outputs from earlier layers with those from later layers. Finally, a fully connected layer serves as the output layer, mapping the extracted features to four distinct classes corresponding to the diagnostic categories in Alzheimer’s disease. The typical folding procedure is depicted in
Figure 5, which also indicates the height and width of the square input feature map in spatial dimensions, and M and N denote the number of input and output feature map channels, respectively. In a convolutional layer, the input feature map I is convolved with the layer’s filters to produce the output G. This feature was removed from the conventional convolutional layer’s convolution kernel size. The height and width of the convolution kernel are indicated by
.
The basic convolution computation procedure is based on feature mapping. Map G can be highlighted using the following equation:
In Equation (1), k represents the convolution kernel, G represents the output feature map, and I denotes the input feature map. Here, i and j index spatial positions within the convolutional kernel, while k and l index spatial positions within the input and output feature maps, respectively. Additionally, M and N denote the channel indices for the input and output feature maps, respectively, allowing the convolution operation to consider multi-channel information.
The following formula calculates the number of trainable parameters F in a standard convolutional layer:
where F is the total number of parameters, M is the number of input channels, N is the number of output channels, and
is the spatial width and height of the convolutional kernel.
The following equation calculates the total computational cost G of a standard convolutional layer:
where G is the computing cost; here,
is the spatial width and height of the input feature map, and M, N, and
are as defined in Equation (2). The key distinction between Equations (2) and (3) is that the parameter count F depends only on the kernel and channel dimensions, whereas the computational cost G additionally scales with the spatial size of the input feature map
.
2.6. Implementation Details
The experiments were executed on an NVIDIA RTX 3090 GPU running Ubuntu 20.04-x64 with Python 3.9.13, leveraging TensorFlow and Keras for model implementation. A two-dimensional CNN was developed using 2D slices extracted from 3D structural MRI scans
Figure 6, with full reproducibility ensured through open-source preprocessing scripts and detailed data split configurations available from the corresponding author upon reasonable request. Categorical cross-entropy was used as the loss function, as it is appropriate for multiclass classification tasks involving more than two classes and is distinct from binary cross-entropy which applies only to binary classification scenarios. The use of categorical cross-entropy is further justified by the balanced class distribution across all four diagnostic categories shown in
Table 3.
Weight initialization was performed using the Xavier (Glorot) method to maintain gradient stability. The preprocessing pipeline automatically extracted 2D slices from the axial, coronal, and sagittal views, discarded slices whose mean intensity was below 5% of the maximum and resized the remaining slices to 96 × 96 pixels via bilinear interpolation. Each image was normalized by subtracting the training-set mean and dividing by the training-set standard deviation, and data augmentation techniques including random rotations (±10°), horizontal flips, and random intensity shifts were used to enhance model generalization. For robust evaluation, performance metrics were computed with 95% confidence intervals derived from 1000 bootstrap resamples, and statistical significance was assessed via p-values when comparing against benchmark architectures. This integrated description and accompanying table consolidate our implementation and hyperparameter configurations in a concise format, ensuring that our methodology is both reproducible and statistically robust.
2.7. Performance Evalution
The terms true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are used to describe the projected outcomes of the diagnostic tasks. A positive sample that is correctly predicted in advance is sometimes referred to as a “true-positive sample”; a negative sample that is correctly predicted or anticipated is sometimes referred to as a true negative. FP is used to denote the misclassification of a negative sample as a positive sample. In instances where symbol FN is present, there is a tendency for a positive sample to be erroneously classified as a negative sample. The diagnostic model was evaluated using a set of widely utilized measures, including the F1 score, accuracy, specificity, sensitivity, and precision. The accuracy (4) of a diagnostic test is determined by the correctly identified samples of all test samples. To compare our method with baselines more rigorously, we applied paired statistical tests including McNemar’s test for classification.
As shown in Equation (5), specificity is calculated for the number of subjects that were correctly identified in Equation (6); sensitivity refers to the identification of the specified class in all positive samples. In the proposed method, in the context of AD patients, sensitivity is also called recall.
As shown in Equation (7), precision is calculated as the ratio of true positive predictions to the total number of positive predictions.
As shown in Equation (8), the F1 score is the average value of precision and sensitivity.
Multiclass classification performance was evaluated using a confusion matrix as shown in
Table 4. This matrix displays the predicted versus the actual outputs for each class, allowing for a detailed assessment of the classifier’s performance. To support the interpretability claim, we generated Gradient-weighted Class Activation Mapping visualizations for correctly classified AD, LMCI, EMCI, and CN samples. The final convolutional layer’s feature maps were used to compute activation heatmaps, which were overlaid on their corresponding MRI slices. This analysis helps identify discriminative anatomical regions influencing model decisions, particularly the hippocampal formation, entorhinal cortex, and ventricular enlargement structures known to be associated with AD pathology.