Next Article in Journal
Online On-Device Adaptation of Linguistic Fuzzy Models for TinyML Systems
Previous Article in Journal
An Adaptative Wavelet Time–Frequency Transform with Mamba Network for OFDM Automatic Modulation Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Deep Learning Approach for Alzheimer’s Disease Detection: Attention-Driven Convolutional Neural Networks with Multi-Activation Fusion

1
School of Information and Physical Sciences, The University of Newcastle, Newcastle 2308, Australia
2
Department of Computer Science, College of Khurma University College, Taif University, Taif 21944, Saudi Arabia
3
Centre for Artificial Intelligence Research and Optimisation, Design and Creative Technology Vertical, Torrens University, Ultimo 2007, Australia
4
Data61, Commonwealth Scientific and Industrial Research Organisation, Canberra 3169, Australia
*
Authors to whom correspondence should be addressed.
AI 2025, 6(12), 324; https://doi.org/10.3390/ai6120324
Submission received: 22 October 2025 / Revised: 3 December 2025 / Accepted: 4 December 2025 / Published: 10 December 2025

Abstract

Alzheimer’s disease (AD) affects over 50 million people worldwide, making early and accurate diagnosis essential for effective treatment and care planning. Diagnosing AD through neuroimaging continues to face challenges, including reliance on subjective clinical evaluations, the need for manual feature extraction, and limited generalisability across diverse populations. Recent advances in deep learning, especially convolutional neural networks (CNNs) and vision transformers, have improved diagnostic performance, but many models still depend on large labelled datasets and high computational resources. This study introduces an attention-enhanced CNN with a multi-activation fusion (MAF) module and evaluates it using the Alzheimer’s Disease Neuroimaging Initiative dataset. The channel attention mechanism helps the model focus on the most important brain regions in 3D MRI scans, while the MAF module, inspired by multi-head attention, uses parallel fully connected layers with different activation functions to capture varied and complementary feature patterns. This design improves feature representation and increases robustness across heterogeneous patient groups. The proposed model achieved 92.1% accuracy and 0.99 AUC, with precision, recall, and F1-scores of 91.3%, 89.3%, and 92%, respectively. Ten-fold cross-validation confirmed its reliability, showing consistent performance with 91.23% accuracy, 0.93 AUC, 90.29% precision, and 88.30% recall. Comparative analysis also shows that the model outperforms several state-of-the-art deep learning approaches for AD classification. Overall, these findings highlight the potential of combining attention mechanisms with multi-activation modules to improve automated AD diagnosis and enhance diagnostic reliability.

1. Introduction

Alzheimer’s disease (AD) is a progressive neurodegenerative disorder caused by the abnormal accumulation of proteins in the brain [1]. This protein buildup gradually damages neurons, leading to memory loss, cognitive decline, and difficulty performing everyday tasks [2]. As the disease progresses, it severely affects intellectual and social functioning and ultimately reduces a person’s independence [2,3]. AD is the most common form of dementia and poses a major global health challenge [4,5]. Early and accurate diagnosis is essential for initiating appropriate treatment and improving patient outcomes. Currently, over fifty million people worldwide are living with AD [6], and this number will continue to grow as the global population ages. The prevalence of AD varies widely based on factors such as age, genetic inheritance, and lifestyle. According to Alzheimer’s Disease International, China and India currently represent approximately two-thirds of global AD cases [7]. Improved healthcare in these countries has contributed to longer life expectancy among individuals with AD, who previously had limited access to treatment and care. The prevalence of dementia in low-income countries is approximately 7% [8], and for people aged 65 years and older, this rate is similar to that found in high-income countries [8]. Figure 1 illustrates the distinction between healthy brain tissue and that affected by AD. In AD, brain tissue volume progressively decreases over time, with this reduction being accompanied by enlarged ventricular spaces and significant atrophy of the cerebral cortex and hippocampus.
The pathological hallmarks of AD include the formation of amyloid plaques and neurofibrillary tangles. These pathological formations reduce the number of functional nerve cells in the brain and restrict communication between brain cells, leading to increased neural damage. This process results in shrinkage of the hippocampus and brain lobes, as well as enlargement of the ventricles [9]. The exact cause of Alzheimer’s disease remains unknown. However, studies indicate that a combination of genetic, environmental, and lifestyle factors influences its development. Despite ongoing research, no medications or therapies are currently available to prevent or cure dementia [10,11]. Early diagnosis can identify individuals with mild cognitive impairment (MCI), an early form of AD that is treatable in its initial stages [10]. Accurate identification and diagnosis of AD is, therefore, critical for physicians [8]. Physicians may use neuroimaging techniques to identify the initial phases of AD; however, these methods have limited precision. These techniques provide valuable insights into the brain by revealing important details about its structure and function. Computed tomography (CT) scans are among the most commonly used neuroimaging techniques, utilising X-rays to provide a comprehensive view of the brain [12].
CT scans can help detect cognitive impairment caused by conditions such as stroke or tumours, but they are generally insufficient for detecting the subtle changes associated with AD [13]. Positron emission tomography (PET) involves injecting a radioactive tracer into the bloodstream, where it accumulates in metabolically active areas of the brain. However, PET scans require ionising radiation and are more expensive than other imaging procedures. Magnetic resonance imaging (MRI) is a versatile and informative neuroimaging method for AD detection. Unlike CT scans, MRI creates comprehensive brain images using electromagnetic fields and radiofrequency waves instead of ionising radiation. This makes MRI a safer alternative, especially for repeated scans needed to monitor disease progression. MRI is particularly effective in detecting subtle structural changes, such as temporal lobe atrophy, which affects an area critical for memory function [14].
Traditional diagnostic approaches for AD have primarily relied on clinical assessments, cognitive testing, and analysis of structural MRI or PET scans. However, these methods are often time-consuming, subjective, or have limited sensitivity to early-stage changes [10,13]. In recent years, machine learning (ML) techniques have been explored for automated AD classification. Early studies employed handcrafted features extracted from MRI scans, followed by conventional classifiers such as support vector machines (SVMs), random forests, and k-nearest neighbours algorithms [3,15]. Although moderately effective to some extent, these approaches suffered from limited feature representation and required extensive pre-processing. Deep learning (DL) models, particularly convolutional neural networks (CNNs), have demonstrated significant improvements by learning hierarchical features directly from imaging data [16,17,18]. Both 2D and 3D CNNs have been applied to capture spatial features from MRI volumes. While 2D CNNs are computationally efficient, they lose inter-slice spatial context that may be important for accurate diagnosis.
In contrast, 3D CNNs can preserve volumetric information and have shown superior performance in AD classification tasks [19,20]. More recently, attention mechanisms and transformer-based architectures have been introduced to enhance model interpretability and to focus on the most relevant brain regions [21]. These models improve spatial feature prioritisation, but are often computationally demanding and require large datasets. Additionally, limited research has investigated the role of activation function diversity in enhancing model performance. Most existing models rely on a single activation function, potentially overlooking the complementary advantages of different nonlinear functions. Fusion strategies and parallel activation branches, such as the approach proposed in this study, remain relatively unexplored in the context of neuroimaging applications.
Despite these advances, several critical gaps remain in the current approaches for automated AD diagnosis. First, existing attention mechanisms primarily focus on spatial regions rather than leveraging channel-specific attention to enhance feature discriminability across different brain structures. Second, the potential of combining multiple activation functions to capture complementary feature representations has been largely unexplored in neuroimaging applications. Third, most current models rely heavily on pre-trained weights or transfer learning, which may not capture the specific patterns unique to AD pathology. These limitations motivate the need for a novel approach that addresses feature channel prioritisation, activation function diversity, and domain-specific learning.
This study presents a deep learning approach for diagnosing Alzheimer’s disease (AD) at the initial stage using structural MRI data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The pipeline comprises four key stages: (i) preprocessing of ADNI MRI scans through skull stripping, spatial normalisation, and resizing; (ii) volumetric feature extraction via a lightweight 3D Convolutional Neural Network (3D-CNN); (iii) channel-specific attention to emphasise discriminative brain regions; and (iv) a multi-activation fusion (MAF) block integrating GELU, SiLU, and ReLU functions to enhance nonlinear feature representation. For convenient referencing, all acronyms are listed in Table 1.
Despite progress in 3D CNN and attention-based models, key limitations remain in MRI-based AD classification. Existing methods typically employ spatial attention while overlooking channel-wise feature recalibration, and they rely on a single activation function, which restricts nonlinear diversity for modelling subtle structural changes. Furthermore, many recent architectures depend on heavy pre-trained transformer models that are not optimised for volumetric neuroimaging. These gaps motivate the need for a lightweight 3D framework that integrates channel-wise attention with activation-level diversity to improve representational capacity and diagnostic robustness.
The aims of this work are to develop an efficient 3D architecture capable of detecting subtle AD-related structural changes, reduce computational complexity relative to transformer-based models, and improve robustness through activation-fusion and attention-enhanced feature learning.
The primary findings of this work are listed below:
  • The study introduces a hybrid model that combines a 3D CNN and a vision transformer (ViT) approach of attention-driven on extracted feature channels, effectively capturing spatial dependencies and outperforming traditional CNNs and transformer-based methods in 3D image classification tasks.
  • The proposed MAF block employs GELU, SiLU, and ReLU activation functions in parallel to capture diverse and complementary feature representations, analogous to the multi-head structure in Vision Transformers. This design enhances fine-grained feature discrimination and improves the model’s adaptability to the heterogeneity present in MRI-based AD datasets.
  • The proposed model is trained end-to-end from scratch, avoiding the reliance on pre-trained weights and thereby learning highly specific, meaningful patterns for AD detection. This approach reduces computational constraints and improves generalizability across unseen data.
The integration of attention-driven convolutional blocks and MAF aims to enhance both spatial feature prioritisation and the richness of nonlinear representations, thereby enabling more accurate and discriminative analysis of Alzheimer related brain structures.
The remainder of the work is arranged as follows: Section 2 summarises related research on the proposed study. Section 3 presents the methodology of the study and explains the proposed model for the classification of AD, detailing the dataset and the components of the approach. Section 4 outlines a comprehensive visualisation process and evaluates the model against the latest approaches. Finally, Section 5 concludes the study, and Section 6 highlights the limitations and future objectives.

2. Related Work

In recent years, integrating CNNs with transformers has become a significant research focus. This is particularly true for complex medical challenges such as Alzheimer’s Disease. CNNs have been extensively employed for their efficiency in capturing spatial information from imaging data. However, their shortcomings in grasping temporal dynamics can impede thorough analyses. Researchers have proposed various innovative solutions to enhance the efficacy of models in interpreting medical imaging. This section will examine studies on the architecture of transformers and CNNs.

2.1. Transformers in CNNs

Researchers have been exploring ML and DL methods to analyse 2D and 3D data, which has led to the development of transformer networks. Studies such as the ViT [22] and the Data-efficient Image Transformer [23] have been influential in image classification, the Detection Transformer [24] for object detection, Video Vision Transformer [25] for video processing, Video Transformer Network [26] for video analysis, and transformer-based segmentation methods [27] for medical applications. Notably, ViT addressed the image classification challenge by segmenting images into non-overlapping patches and applying visual tokenisation to each patch. ViT demonstrated that a transformer model trained on massive datasets could achieve superior results in image analysis. However, ViT’s performance declines when training data are insufficient due to its low inductive bias. DeiT [23] mitigated this problem by incorporating a regularisation and enhancement pipeline into ImageNet1K. Furthermore, transformer methods [22] have been applied in medical image segmentation [28,29], 2D medical image classification, image denoising and image reconstruction [30].
To harness the benefits of transformers, channel-wise attention mechanisms over CNN layers have been demonstrated in the field of computer vision [31,32,33]. Various ablation studies have shown that CNN-transformer hybrids achieve competitive performance in the computer vision domain compared to other network combinations, including multilayer perceptrons. These outcomes demonstrate that the combination of CNNs and transformers, each fulfilling different roles, can efficiently perform vision tasks.
Recent transformer-based and hybrid CNN-Transformer models used in medical imaging impose substantial computational demands. For instance, the Vision Transformer (ViT-B/16) comprises 86 million parameters and requires about 55 GFLOPs per image, while Swin-B contains 88 million parameters and 47 GFLOPs. Video-based transformers are even more resource-intensive: ViViT models reach 89 million parameters with 398–450 GFLOPs, and VTN includes 75 million parameters, requiring roughly 329 GFLOPs and ~45 ms inference per video on a Tesla V100 GPU. Even strong 3D CNN baselines, such as 3D-ResNet18 (33.3 M parameters, 109 GFLOPs) and 3D-DenseNet121 (7.98 M parameters, >100 GFLOPs), remain computationally heavy. These quantitative comparisons underscore the novelty of our approach, which delivers competitive performance through a substantially more lightweight 3D architecture.

2.2. Role of Activation Functions

DL has demonstrated considerable promise in medical imaging. However, existing approaches for AD detection face several technological limitations. One critical factor is the choice of activation function, which significantly influences how neural networks learn and represent complex patterns. Most current models rely on a single activation function throughout the network, which may constrain their capacity to capture the diverse and subtle structural variations associated with neurodegeneration. Furthermore, the lack of interpretable attention mechanisms limits clinical usability, as it becomes difficult to identify the brain regions that most influence the model’s diagnostic decisions [34,35]. This lack of transparency presents a barrier to clinical trust and hinders integration into real-world diagnostic workflows. These limitations underscore the need for more advanced architectures that enhance diagnostic accuracy while offering interpretable insights that are consistent with clinical reasoning.
Although channel-attention mechanisms such as CNN-Transformer improve CNN sensitivity to informative feature channels, they do not address the limited nonlinear diversity inherent in single-activation networks. Similarly, hybrid CNN-Transformer architectures enhance global context modelling but remain computationally expensive and dependent on large annotated datasets. A major challenge in neuroimaging, Alzheimer’s, is that the existing AD classification framework simultaneously enhances channel-wise feature discrimination and enriches activation-level diversity. This gap is critical, as Alzheimer-related structural changes are subtle and heterogeneous, requiring both selective feature amplification and diverse nonlinear transformations. The proposed ADCB introduces lightweight 3D channel attention optimised for MRI volumes, while the MAF block incorporates parallel activation branches (ReLU, SiLU, GELU) to capture complementary feature patterns. This combination addresses a previously unaddressed gap by jointly improving attention-driven feature selection and nonlinear expressiveness within a computationally efficient architecture.

2.3. Hybrid and Ensemble Architectures

Various DL models have been developed for AD classification to create efficient diagnostic systems [36]. Jang et al. [37] proposed an innovative approach for the classification of 3D MRI data, integrating CNNs with transformer architectures. In their method, the 3D MRI images were initially processed using a 3D CNN to extract relevant features. Subsequently, the extracted features were passed through transformer blocks, with each block considering all the 2D slices of the 3D image. Jo et al. [38] evaluated various DL techniques for AD classification using multimodal neuroimaging data by combining recent ML models for classification with a stacked auto-encoder for feature selection. The finest performance was observed when multimodal neuroimaging and fluid biomarkers were combined. However, the study emphasises that DL approaches are still evolving and must incorporate more hybrid data types and explainable methods to improve transparency and understanding of disease mechanisms.
Mora-Rubio et al. [39] discuss various deep learning architectures, including EfficientNet, DenseNet, and Vision Transformer, that have been employed to classify MRI scans across different AD stages. EfficientNet, in particular, is known for achieving better classification performance with fewer features by simultaneously scaling the network’s depth, width, and resolution. The study reports detection percentages for different stages of AD, achieving approximately 89% for AD vs. Control, 80% for Late MCI vs. Control, 66% for MCI vs. Control, and 67% for Early MCI vs. Control. Similarly, Ravi et al. [40] utilise the Alzheimer’s Disease Neuroimaging Initiative (ADNI) sMRI dataset to classify AD stages using deep learning algorithms, particularly Convolutional Neural Networks (CNNs). The best-performing model, ResNet-50v2, achieved an accuracy of 91.84%.
Ma et al. [41] conducted research on AD, exploring the use of a deep Q-network (DQN) to identify AD patients through brain imaging data from 1360 subjects. Key features such as the amplitude of low-frequency fluctuation and fractional amplitude of low-frequency fluctuation were analysed, and a DQN classifier was trained to differentiate between AD patients and healthy controls. The results showed a high accuracy of 86.66%, indicating that DQN could effectively assist in diagnosing AD by analysing local brain activity. Conversely, Francis et al. [42] proposed a model that leverages a Squeeze-and-Excitation Network (SENet) designed to enhance feature extraction by focusing on channel interdependencies. Necessary preprocessing techniques, such as skull removal and image normalisation, were employed to improve image quality, enabling more accurate feature extraction. This approach ensures that the network prioritises the most significant features for classification.
Suh et al. [43] developed a deep learning algorithm using 3D T1-weighted magnetic resonance images for brain segmentation and AD classification. The approach involved a CNN and an XGBoost classifier, which showed strong diagnostic capabilities. Feng et al. [44] applied 3D-CNN and 3D-CNN-SVM models to MRI data for AD classification. The 3D-CNN-SVM model demonstrated superior performance in classification tasks. Li et al. [45] explored the use of 4D fMRI data for AD detection with a C3d-LSTM model, which combined 3D CNNs for spatial feature extraction and LSTM for temporal information capture, yielding better results compared to methods that use only 2D or 3D fMRI data.

2.4. Advanced Deep Learning Techniques

Attention mechanism techniques were applied by George et al. [46]. This work presents a unique 3D CNN framework including attention mechanisms to classify Alzheimer’s disease from MRI scans. Furthermore, the proposed model utilises both channel and spatial attention mechanisms to enhance feature extraction, so obtaining an average precision of 79% for three classes and 87% for distinguishing AD from other classes. At the same time, Wen et al. [47] reviewed more than 30 studies using CNN to classify AD from anatomical magnetic resonance imaging. The review identified four main approaches: CNNs at the 2D slice level, 3D patch level, ROI-based, and 3D subject level CNNs. The paper presented an open-source framework for a thorough assessment and underlined the difficulties with reproducibility and data leakage in current research.
Zhang et al. [48] presented a densely connected CNN enhanced with a connection-wise attention mechanism for Alzheimer’s disease classification. The study underscored the potential of attention mechanisms in improving classification performance while noting the challenges of high computational demands and the need for large datasets. Using SegNet to identify AD-relevant brain features from structural MRI (sMRI), Buvaneswari et al. [49] present a deep learning-based segmentation method using ResNet-101, subsequently classifying AD. The study is limited by the relatively small dataset and the potential for overfitting due to the high complexity of the models used.
An et al. [50] introduce a deep ensemble learning framework that integrates multiple DL models to classify AD. The framework ranks base classifiers using a deep belief network and features using sparse auto-encoders. Despite its promising results, the complexity and computational demands of the model limit its practical application in clinical settings. Similarly, Ortiz et al. [51] explore the use of deep belief networks (DBNs) trained on 3D patches of brain regions defined by the Automated Anatomical Labelling (AAL) atlas. The method also performed well in classifying subjects with MCI. However, the model’s reliance on 3D patches and the complexity of the ensemble approach may pose challenges for real-time applications. Various other studies used advanced architectures for the diagnosis of AD; Lim et al. [52] propose a multiclass classification method using 3D T1-weighted brain MRI images and a CNN built from scratch, VGG-16 and ResNet-50 models, emphasising the potential of CNNs for early-stage AD detection. One of the challenges is the computational load of training deep CNNs and the need for huge data sets. Liu et al. [53] propose a multitask CNN model for joint hippocampal segmentation and AD classification. It demonstrates that the multi-model approach outperforms single-model methods. Though two major limitations are the complexity of the multitask model and the necessity of large computational resources. Yee et al. [54] proposed a 3D CNN model for classifying AD using MRI images. Training and evaluation of the proposed model on 1500 MRI images taken from the ADNI database revealed that DL models are sufficient for early-stage AD diagnosis. However, such models suffer from generalisation issues due to demographic homogeneity.

2.5. Challenges in Medical Imaging

Accurately classifying medical images presents significant challenges due to the complexities involved in obtaining medical datasets [55]. Unlike other types of datasets, medical datasets are compiled by experts and include sensitive patient information that cannot be publicly shared. For example, the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [56] and the OASIS [57] require researchers to submit an application and agree to specific terms before they can access the datasets. This restriction ensures that the data are used solely for research purposes [58,59,60,61,62]. Furthermore, medical datasets are often highly imbalanced because it is challenging to collect equal samples from patients with and without diseases. Addressing this imbalance poses considerable challenges [63,64,65,66].
Table 2 compares state-of-the-art classification models for Alzheimer’s disease, demonstrating that both conventional deep learning and transfer learning can achieve high precision even with unbalanced datasets.

3. Proposed Methodology

This section describes the detailed flow and evaluation metrics used in this study. The methodology is visually presented in Figure 2.
As illustrated in Figure 2, the methodology diagram presents the overall workflow of the proposed model. It begins with data preprocessing to prepare the MRI inputs, followed by a series of learning and feature extraction stages using attention-driven convolutional blocks. These extracted features are then refined through a multi-activation fusion process to capture diverse representations.

3.1. Dataset Overview

The dataset utilised in this study was obtained from the ADNI. The training dataset comprised 1175 MRI samples, with the class-wise distribution detailed in Table 3. All scans were acquired using a 1.5 Tesla (1.5 T) MRI scanner to ensure uniform imaging conditions. Restricting the dataset to a single field strength avoids the systematic variability introduced when mixing 1.5 T and 3 T acquisitions, thereby providing more homogeneous imaging conditions and supporting reliable model training and evaluation. The independent ADNI test set consisted of 117 cases, including 68 cognitively normal (CN) subjects and 49 AD patients. This separation ensured a clear distinction between training and testing cohorts, thereby supporting an unbiased evaluation of the model’s generalisation capability. Patients in the dataset range in age from 56 to 91. Figure 3 shows the age distribution with respect to the group. This study used publicly available, fully anonymised MRI data from ADNI. No new data were collected for this research.

3.2. Image Pre-Processing

The collected dataset contained images with skull information, which is irrelevant for Alzheimer’s classification. Variations in depth, height, and width were observed across the images. To address these challenges, preprocessing techniques similar to those described in [37] were implemented. Skull stripping was applied to the entire dataset using the high-definition brain extraction tool (HD-BET) [69], which generated a mask to identify brain tissues. By overlaying this mask on the original images, the skull regions were effectively removed.
Subsequently, all images were standardised to the same voxel spacing of 1.75   mm × 1.75   mm × 1.75   mm and resized to a matrix size of 128 × 128 × 128 . The voxel intensities were normalised through the zero-mean unit-variance approach to achieve uniformity across the dataset. This method is applied independently to each split. A 3D CNN block was used for the preprocessed input data. This network accepted input images of size 128 × 128 × 128 and extracted 3D feature representations with C channels, where C denotes the number of feature channels, through convolutional operations, as mentioned in Figure 4.
Although ADNI MRI scans can be affected by scanner-related noise, motion artefacts, and variability in acquisition protocols, the preprocessing pipeline was designed to reduce these effects. Voxel-wise normalisation and uniform resampling help minimise intensity fluctuations and geometric inconsistencies across subjects. In addition, the channel-specific attention mechanism can down-weight unstable or noise-dominated feature channels, improving robustness to residual artefacts in the ADNI dataset

3.3. Data Split

In order to ensure reliability in the evaluation and prevent any data leakage, the dataset was split into subsets before training. This was done using a strict patient-based strategy within the set of the total of 1175 MRI scans, 10% (117 scans were allocated to the validation set, and another 10% (117 scans) were reserved as an independent test set. The remaining 941 scans were used to train the model. Across these splits, the training set contained 265 unique subjects, the validation set contained 30 unique subjects, and the test set contained 33 unique subjects. Because ADNI includes multiple timepoints and repeated acquisitions for many participants, the number of MRI scans is larger than the number of subjects in each subset. This patient-based splitting approach ensured that all scans belonging to the same subject were assigned to a single subset only. This prevented any overlap between the training, validation, and test sets and guaranteed subject-level independence. The validation set was used during training to optimise model parameters and reduce overfitting, while the independent test set—comprising completely unseen subjects—was used only once for the final evaluation. This strategy provides an unbiased assessment of the model’s generalisation capability. The same strict patient-based assignment was also applied during the 10-fold cross-validation, ensuring that all scans from each subject were always placed within a single fold.

3.4. Proposed Model Architecture

The detailed model architecture is visually presented in Figure 4, providing an overview of the attention-driven 3D CNN and its key components. This research proposes a novel CNN model that focuses on channel-wise attention mechanisms from scratch to perform accurate Alzheimer’s disease classification. Along with its superior performance, the model is highly optimised compared to other architectures. The proposed architecture consists of five attention-driven convolutional blocks (ADCB), where each ADCB block uses a Gaussian error linear unit (GELU) function for layer activation with a kernel size of (3, 3, 3) at each 3D-convolution layer. Furthermore, the architecture includes a multiactivation fusion block and a sigmoid activation function in the output layer. Moreover, L2 regularisation was applied to the kernels of each layer, with a lambda value of 0.02.
The proposed model for AD classification is detailed in terms of its network architecture, and a comprehensive model summary with parameter counts is presented in Table 4. Additionally, the hyperparameters used for effective training of the model are presented in Table 5.
The combination of attention mechanisms and MAF plays a pivotal role in enhancing the model’s ability to distinguish subtle structural changes, including hippocampal atrophy and cortical thinning in MRI volumes. The channel-wise attention mechanisms enable the network to selectively focus on clinically significant brain regions, such as the hippocampus and cortex, by assigning higher weights to the most informative feature channels. Meanwhile, the MAF block enriches the learned representations by processing the same feature vector through multiple activation functions—ReLU, SiLU, and GELU—with each function capturing distinct nonlinear patterns. This design ensures that spatially localised patterns and diverse nonlinear characteristics are emphasised, resulting in improved robustness and classification performance in heterogeneous patient populations.

3.5. Attention-Driven Convolution Block (ADCB)

The ADCB block employs a series of layers for feature extraction. After the 3D-CNN layer extracts the features, the results are fed into an attention block. Here, GAP is applied to compute a vector, which is then multiplied by each channel. This process increases the weight of the channels that the attention mechanism focuses on more. The output is subsequently passed through a batch normalisation layer, followed by a max-pooling layer. A detailed explanation of each layer’s structure is provided below.

3.5.1. 3D Convolutional Neural Network (3D CNN)

To extract features for 3D images, a 3D-CNN layer was utilised. Each MRI I R L × W × H is processed using the 3D CNN layer, where L, W, and H represent the length, width, and height of the input image, respectively. The layer D 3 D : R L × W × H R C 3 D × L × W × H is designed with multiple layers of 3 × 3 × 3 convolution kernels, along with GELU activation functions.
The spatial features extracted through this process retain the structure of the input image, ensuring the preservation of volumetric information. After applying the 3D CNN layer, the extracted 3D feature X R C 3 D × L × W × H is obtained, where C 3 D is the count of feature channels.
This block helps to capture spatial hierarchies in three dimensions, which are crucial for distinguishing between complex patterns in Alzheimer’s MRI scans. Figure 4 illustrates the 3D CNN architecture used in this study.

3.5.2. Channel Attention

The channel attention mechanism enhances feature representation by emphasising the most relevant feature channels within the 3D feature maps. Following feature extraction from the 3D-CNN layer, GAP is first applied to aggregate spatial information and produce a channel descriptor vector of dimension R C . This descriptor captures the global contextual importance of each feature channel.
To model nonlinear interactions between channels, the aggregated vector is passed through two fully connected (FC) layers with a bottleneck structure similar to the Squeeze-and-Excitation (SE) block [70]. The first fully connected (FC) layer reduces the dimensionality by a ratio r (set to 8 in this study), followed by a ReLU activation to introduce nonlinearity, and the second FC layer restores it to the original channel dimension. Finally, a sigmoid activation is applied to generate the channel attention weights A R C :
A = σ ( W 2 · δ ( W 1 · z ) )
where z is the GAP output, W 1 and W 2 are learnable parameters of the two FC layers, δ ( · ) represents the ReLU activation, and σ ( · ) denotes the sigmoid function.
The resulting attention weights A are used to recalibrate the original feature map F through channel-wise multiplication:
F = A · F
where · represents element-wise multiplication between the attention weights and feature maps.
Compared to classic channel attention mechanisms such as SE-Net, the proposed ADCB implementation integrates this lightweight yet expressive channel attention directly into the 3D convolutional pipeline. This design enables efficient parameter usage while maintaining the ability to capture nonlinear inter-channel dependencies critical for distinguishing subtle anatomical variations in MRI data associated with Alzheimer’s disease.

3.5.3. 3D Batch Normalization (3D-BN)

To enhance and expedite the training process, 3D-BN was applied following each 3D convolution layer. The output of the 3D convolution layer was normalised using the BN layer for each minibatch by adjusting the mean and variance of the feature maps accordingly.
X ^ = X μ σ 2 + ϵ
where σ and μ represent the variance and mean of the feature maps, and to prevent division by zero, ϵ is a very negligible constant. After normalisation, the learnable scaling ( γ ) and offset ( β ) parameters are applied, enabling the network to adjust the normalised output as needed.
The use of 3D batch normalisation reduces internal covariate shift and makes the network less sensitive to initialisation, thereby improving generalisation.

3.5.4. 3D MaxPooling

3D max-pooling layers were employed to shrink the spatial dimensions and preserve only the most significant features. The operation essentially down-samples the data by choosing the maximum value from non-overlapping 2 × 2 × 2 sections within the feature maps.
By preserving the most significant features and reducing computational load, the 3D max-pooling operation contributes to hierarchical feature extraction while minimising overfitting.

3.6. Global Average Pooling (GAP)

GAP was utilised as a down-sampling technique to aggregate spatial information across the feature maps. For a feature map F R C × H × W × D , GAP computes the mean of all spatial elements for each channel:
z c = 1 H W D i = 1 H j = 1 W k = 1 D F c , i , j , k
where z c is the aggregated value for channel c, the resulting vector Z R C serves as a condensed global descriptor that is both computationally efficient and effective in reducing spatial redundancy, as depicted in Figure 5.

3.7. Activation Functions

3.7.1. Rectified Linear Unit (ReLU)

Regarding the most frequently employed activation functions, ReLU is known for its simplicity and effectiveness. It is described as:
f ( x ) = max ( 0 , x ) ,
Here, x is the activation function’s input. ReLU gives the model nonlinearity so that it may learn complex relationships between inputs and outputs. ReLU also guarantees that gradients stay non-zero for positive input values, helping to prevent the vanishing gradient problem. It can also be challenged by the “dying ReLU,” in which case some neurons may become inactive for the whole training cycle, should they regularly produce zero.

3.7.2. Sigmoid Linear Unit (SiLU)

Often referred to as the Swish activation function, SiLU can be described as:
f ( x ) = x · σ ( x ) ,
where σ ( x ) is the sigmoid function:
σ ( x ) = 1 1 + e x ,
SiLU combines the properties of both the sigmoid and linear functions. This results in smooth, non-monotonic behaviour that enhances gradient flow during training, especially in biomedical imaging. Research has shown that this activation function performs well across a variety of deep learning tasks by facilitating better feature representation and improving learning dynamics compared to other activation functions.

3.7.3. Gaussian Error Linear Unit (GELU)

The GELU function is a smooth approximation of the ReLU function with stochastic regularisation properties. GELU is mathematically written as:
f ( x ) = x · Φ ( x ) ,
where Φ ( x ) is the cumulative distribution function of the standard normal distribution:
Φ ( x ) = 1 2 1 + erf x 2 ,
Alternatively, GELU can be approximated for computational efficiency as:
f ( x ) = 0.5 x 1 + tanh 2 π x + 0.044715 x 3 ,
GELU allows for smooth activation that retains input values based on their significance, unlike ReLU, which truncates all negative values to zero. This property makes GELU especially effective in transformer-based architectures and large-scale models.
The hyperparameters used in the proposed model were selected through a controlled validation-based tuning procedure. Each hyperparameter was explored within a predefined range informed by common practice in 3D CNN and attention-based architectures. Specifically, the L2-regularisation coefficient ( λ ) was evaluated over the range {0.001, 0.005, 0.01, 0.02, 0.05}, with λ = 0.02 providing the best trade-off between stability and overfitting control. The channel-attention expansion factor was tested over {4, 6, 8, 10}, and the value of 8 offered the most consistent convergence behaviour.
The initial learning rate of 0.01 in Table 5 represents the starting point of a cosineannealing schedule. During fine-tuning, this learning rate was gradually reduced to a minimum of 1 × 10 5 , allowing the model to refine its weights with minimal oscillation. Other hyperparameters (batch size, optimiser parameters, activation combinations) were tested within narrow ranges and fixed once optimal validation performance was observed. This systematic tuning process ensured that all selected values reflected validation-driven optimisation rather than arbitrary choices.

3.8. Multi-Activation Fusion (MAF) Block

The MAF block improved the network’s capacity to record various characteristic features. Three FC layers processed the GAP feature vector. Each layer used a different activation function: GELU, SiLU, and ReLU. These activation functions were selected for their ability to capture nonlinear relationships in the data.
The outputs from the three dense layers were concatenated to form a unified feature vector:
F M A F = Concat ( F G E L U , F S i L U , F R e L U )
This fusion technique utilises the strengths of every activation function. It offers a more expressive representation of the MRI data. Before classification, an FC layer further reduced the fused vector’s dimensions. A dropout layer was added to prevent overfitting.
The MAF component integrates multiple activation functions (GELU, ReLU, and SiLU) to capture complementary nonlinear transformations in MRI data, enriching feature representations and enhancing diagnostic accuracy. This fusion mitigates the limitations of any single activation function. For example, ReLU introduces sparsity but suppresses all negative values, whereas GELU and SiLU retain graded responses in both positive and negative domains. This diversity improves the network’s robustness and generalisation, particularly in detecting subtle regional variations that are characteristic of early-stage AD.
The GELU activation is probabilistic, weighting each input by the Gaussian cumulative distribution function. Unlike deterministic activations such as ReLU and SiLU, GELU preserves small negative or near-zero values with probability proportional to their magnitude. This probabilistic smoothing allows weak structural patterns—often diffuse in the early stages of Alzheimer’s disease—to be retained rather than abruptly discarded. Although GELU does not directly enhance interpretability in the same way as attention mechanisms, its smoother activation profile yields more stable and anatomically consistent feature responses, indirectly supporting interpretability in MRI-based analysis.
Overall, this fusion strategy broadens the expressive capacity of the model by combining activation functions with distinct gradient behaviours. Their parallel application expands the effective function space the network can represent, improving its ability to learn subtle structural variations in brain MRI data. This contributes to more stable convergence and higher classification accuracy. The MAF block, therefore, spans sparse (ReLU), smooth (SiLU), and probabilistic (GELU) activation regimes, promoting robust learning under heterogeneous imaging conditions.
While conceptually analogous to multi-headed attention in its aim to capture diverse aspects of feature representations, the MAF block differs fundamentally in mechanism. Multi-headed attention partitions feature embeddings into subspaces and learns attention weights to model contextual relationships. In contrast, MAF enhances nonlinearity diversity by applying multiple activation functions to the same feature vector, thereby enriching representational capacity without introducing attention parameters or inter-feature dependencies. Thus, MAF focuses on functional diversity across activations, whereas multi-headed attention emphasises relational diversity across features.

3.9. Implementation Details

The decision to employ five convolutional attention (ADCB) blocks was guided by empirical evaluation as well as the inherent spatial constraints of the 3D MRI volumes (128 × 128 × 128). Each block includes a convolution and a 3D max-pooling operation that halves the spatial resolution, meaning that after five sequential pooling stages, the feature map reaches a size of 2 × 2 × 2 with 512 feature channels. This represents the deepest feasible spatial compression before collapsing to non-viable dimensions (e.g., 1 × 1 × 1) or losing meaningful volumetric context.
Preliminary experiments with three and four blocks showed that shallower configurations preserved larger spatial grids but failed to capture high-level structural dependencies relevant to Alzheimer’s pathology, producing weaker discrimination between AD and non-AD patterns. Furthermore, reducing the architecture to four blocks was found to be computationally unfavourable due to the excessively large tensor size produced before the dense layers. Specifically, an architecture with four blocks yields a 2 × 2 × 2 × 256 feature map, which, when flattened and multiplied with the subsequent 512-unit dense layer, results in 28,311,552 parameters—substantially increasing memory consumption and training cost without corresponding gains in performance. This parameter explosion not only makes the four-block configuration inefficient but also raises the risk of overfitting.
Conversely, extending the model to six blocks was not possible because the additional pooling step would reduce the spatial dimension below 2 × 2 × 2, causing the tensor to collapse and preventing stable feature extraction. Therefore, five blocks provided the optimal balance between depth, representational capacity, and computational feasibility for 3D MRI-based AD classification.
On the other hand, the use of a uniform ( 3 × 3 × 3 ) kernel across all convolutional layers was motivated by its strong performance in 3D medical imaging tasks and by theoretical advantages demonstrated in volumetric CNN literature. A ( 3 × 3 × 3 ) kernel provides the smallest receptive field capable of capturing local anatomical variations while maintaining a manageable number of trainable parameters, which is crucial for end-to-end training without pre-training. Larger kernels (e.g., ( 5 × 5 × 5 ) were avoided because they dramatically increase computational load and risk over-smoothing fine-grained structural cues characteristic of AD-related atrophy. Smaller kernels (e.g., 1 × 1 × 1 ), although beneficial for channel mixing, are insufficient for modelling spatial continuity in 3D neuroimaging. Using a consistent kernel size also stabilises optimisation and ensures that later layers, especially after multiple down-sampling stages, maintain a consistent and interpretable receptive field relative to the original MRI volume. Thus, the ( 3 × 3 × 3 ) kernel choice offers an effective compromise between spatial context, computational efficiency, and model generalizability.

4. Results and Discussion

The experimental setup utilised a high-performance personal computer to execute and evaluate the proposed model. The system was configured with two Intel Xeon 2687W v4 CPUs, providing significant computational power for parallel processing, supported by 64 GB of RAM to handle the substantial memory requirements of deep learning operations. Additionally, the graphical computations were accelerated using an NVIDIA RTX-3090 GPU with 24 GB of dedicated VRAM, ensuring efficient handling of complex 3D convolutional operations and large-scale data.
A test set, especially created by splitting the dataset before training, was used for model evaluation. This method guarantees that the test data stays invisible during the training process, offering an objective evaluation of the generalizability and performance of the model. The assessment concentrated on several performance criteria, so as to capture from several angles the efficiency and dependability of the model.
Using several evaluation criteria, the aim was to fully benchmark the performance of the model and pinpoint its possible limits and strengths. These measures highlight the model’s capacity to balance sensitivity, precision, and general classification accuracy, thereby providing a complete awareness of its predictive qualities. The success and dependability of the model’s training, as well as its applicability to real-world Alzheimer’s disease detection situations, depend much on the analysis of these results.
The main metrics applied to assess the classifier are discussed in the following part, together with their relevance and how they help to understand the performance of the model.

4.1. Accuracy

Examining the performance of a classification model requires first considering accuracy. Although accuracy is crucial for a first review, to fully appreciate the performance of the model, one should take other factors into account as well, such as precision, recall, and the F1 score. It is found by the ratio of properly categorised events to the dataset’s overall count. Equation (12) shows the computation accuracy formula.
A c c u r a c y = T P + T N T P + F N + F P + T N .
  • True Positive (TP): The count of positive cases the model correctly identified as positive.
  • True Negative (TN): The count of negative cases the model correctly categorised as negative.
  • False Positive (FP): The count of negative cases the model misinterpreted as positive.
  • False Negative (FN): The count of positive cases the model misclassified as negative.

4.2. Precision

Evaluating the accuracy of a model’s positive predictions depends critically on precision. It is computed as the ratio of TP forecasts to the overall count of positive predictions, so encompassing both TP and FP. A high precision score indicates a low rate of false positive errors, indicating the dependability of the model in producing positive identifications.
P r e c i s i o n = T P T P + F P ,
If the precision is equal to 1, it signifies perfect accuracy in recognising positive samples, meaning the model correctly identifies all positive instances without misclassifying any negative samples as positive.

4.3. Recall

When assessing a model’s performance in spotting positive events, recall, often known as sensitivity or the true positive rate, is a crucial indicator. It shows the model’s capacity to appropriately point out TP by showing the percentage of real positive cases the model detects. Although it ignores FP, a high recall value shows that the model efficiently classifies most positive instances.
R e c a l l = T P T P + F N ,
This metric is especially significant in scenarios where the primary objective is to capture as many TP as possible, even if it results in more FP.

4.4. Area Under Curve (AUC)

Another metric for evaluating model performance is the AUC. Mathematically, it can be expressed as:
A U C = 0 1 TPR FPR 1 ( t ) d t ,
In this context, FPR signifies the false positive rate, and FPR 1 ( t ) refers to the inverse of the false positive rate at the threshold t. The AUC score ranges from 0 to 1, with higher values indicating better model performance.

4.5. Loss Function

The loss function used in this model is the weighted binary cross-entropy (BCE). It is typically used for binary classification tasks and penalises incorrect predictions based on their probability. The weight assigned to each class can be adjusted to address class imbalance, giving greater importance to minority class samples. The weighted BCE loss is mathematically defined as:
L o s s = 1 N i = 1 N w 1 y i log ( p i ) + w 2 ( 1 y i ) log ( 1 p i ) ,
In this context, N represents the number of batch samples, y i refers to the true label for each sample, and p i denotes the predicted probability of class 1. The weights for the positive and negative classes are represented by w 1 and w 2 , respectively. This function helps the model to prioritise accurate predictions for the underrepresented class, especially in situations of class imbalance.
In this study, the class weights were computed based on the inverse of class frequencies in the training dataset to counteract the imbalance between AD and CN samples. Specifically, the weights were calculated as:
w c = N 2 × N c ,
where N c denotes the number of samples in class c. This formulation ensures that both classes contribute equally during optimisation, despite the unequal number of samples. Preliminary experiments confirmed that these analytically derived weights achieved stable convergence without the need for additional empirical tuning. This function helps the model to prioritise accurate predictions for the underrepresented class, thereby improving sensitivity to Alzheimer’s cases while maintaining balanced overall performance.

4.6. Self-Comparisons and Model Performance Analysis

The proposed model involved an iterative process of tuning multiple hyperparameters and carefully analysing its performance at each step. During this process, all intermediate performance values were computed exclusively on the validation set, and the independent test set was not accessed until the final model was completed. Initially, the model was trained on a simple network configuration, yielding a validation precision of 100% but a recall of only 2.35%. This stark disparity highlighted issues such as data imbalance and the need for a more suitable architecture aligned with the dataset’s nuances.
To address these challenges, the input data was refined by removing skull regions using the HD-BET tool [69]. This preprocessing step was followed by data augmentation techniques, including rotations (±15 ), shear transformations (shear ratio: 0.1), and translations (±5 pixels). The architecture was then restructured using a hybrid strategy combining CNNs and vision transformers, incorporating five 3D-CNN blocks with channeland pixel-level attention, batch normalisation, and 3D max-pooling (MaxPool3D) layers.
These modifications improved the validation accuracy to 75.12%, which further increased to 76.35% with augmented data. However, training stability remained an issue, as the loss curve exhibited significant fluctuations and poor convergence. To mitigate this, a “reduce-on-plateau” learning rate scheduler was employed. Since the minority Alzheimer’s class remained difficult to learn, data augmentation alone proved insufficient. A weighted loss strategy was therefore implemented using analytically derived class weights (Equation (16)). This adjustment resulted in validation accuracies of 79.10% (without augmentation) and 74.92% (with augmentation).
Removing the spatial attention mechanism further increased the validation accuracy to 86.84%, indicating its limited utility for this task. Although the GradCAM visualisation confirmed that the model focused on clinically relevant regions, the classifier block underutilised the extracted features. To address this, the MAF block was introduced, yielding an improved validation accuracy of 84.33%. To reduce overfitting, L2 regularisation ( λ = 0.02) and 20% dropout were applied within and after the MAF block (Figure 4). Only after finalising this architecture was the model retrained from scratch on the training set and evaluated once on the independent hold-out test set. This final evaluation achieved a test accuracy of 92.1%, demonstrating the robustness and effectiveness of the proposed approach.
Through this systematic and validation-driven refinement process, the final model addressed the initial challenges and achieved state-of-the-art performance.

4.7. Evaluation with Recent Studies

Recent advancements in biomedical imaging have showcased significant improvements facilitated by the application of deep learning models. These state-of-the-art models, such as 3D-MobileNetV2, 3D-VGG19, Inception V3-3D, 3D-DenseNet121, 3D-VGG16, 3D-ResNet18, 3D-RegNet, and M3T, have been extensively employed for tasks like medical image classification and segmentation. Each of these architectures brings unique features, such as depthwise separable convolutions in MobileNetV2 for efficiency or densely connected layers in DenseNet121 for gradient flow and feature reuse, contributing to performance enhancement in diverse biomedical applications.
All models in the comparative analysis were trained and evaluated using the same ADNI dataset, identical preprocessing steps, and executed on the same hardware. This setup ensures a fair and consistent benchmarking environment. Detailed performance metrics, including accuracy, AUC, precision, recall, F1-score, and computational cost, were calculated and analysed for each model. The proposed framework combines 3D-CNNs, attention mechanisms, and an MAF block, aiming to outperform existing architectures by enhancing feature extraction and prioritising diagnostically relevant information through channel-wise attention modules.
Table 6 summarises the findings of these relative analyses. All experiments were conducted under identical conditions using a single NVIDIA RTX 3090 GPU with 24 GB memory, the same batch size, and identical preprocessing and data loading pipelines to ensure fairness in computational cost evaluation. This consistent setup allows for an equitable comparison of both accuracy and efficiency among models. Emphasising the better accuracy and efficiency of the proposed framework, the table offers a complete picture of how each model performed in the test set. This work shows how the proposed method achieves state-of-the-art performance in the classification of Alzheimer’s disease by methodically assessing recent advances. Moreover, the next parts go over a comparison of the self-experimentation of the model.
The results reported in Table 6 indicate that the proposed model achieves strong overall performance, with an accuracy of 92.10% and an AUC of 0.99. In addition, the model achieved a sensitivity (recall) of 89.3% and a specificity of 94.1% on the independent ADNI test set. When considered alongside the previously reviewed methods in Table 2, it becomes clear that many existing CNN-based and hybrid architectures achieve comparable accuracy only with substantially higher computational cost or larger parameter counts. In contrast, the proposed attention-driven 3D CNN with MAF attains competitive or superior performance while maintaining a lightweight design trained entirely from scratch. The improvements in sensitivity and F1-score further suggest that the combination of channel-wise attention and multi-activation fusion enables the model to capture subtle AD-related structural variations more effectively than several state-of-the-art approaches. These findings highlight the contribution of this work by demonstrating that high diagnostic performance can be achieved with significantly reduced architectural complexity, offering a more efficient and scalable solution for MRI-based AD classification.

4.8. Evaluation with Different Activation Functions

The motivation behind using multiple activation functions lies in their complementary nonlinear properties, which are particularly advantageous in neuroimaging tasks where both positive and negative voxel intensities contain clinically meaningful information. ReLU introduces sparsity and works well for high-dimensional data; however, it completely discards negative values, potentially removing subtle but important variations in MRI intensity patterns. SiLU, in contrast, is a smooth and non-monotonic activation that preserves negative inputs with gentle slopes, enabling more stable gradient flow in deeper architectures. GELU probabilistically retains inputs based on their magnitude, offering smoother gating behaviour than ReLU and introducing distinctive curvature that is effective for modelling fine-grained anatomical differences.
Although the SiLU and GELU activation curves appear visually similar around the interval 2 x 2 , their gradients, curvature behaviour, and treatment of low-magnitude inputs differ in meaningful ways. These differences become more relevant in high-dimensional MRI feature spaces, where small nonlinear variations can amplify important structural details. As a result, SiLU and GELU provide complementary transformations when applied in the AD-related representational power of the MAF block. To evaluate this fusion strategy, we conducted ablation experiments using several activation-function combinations across the three MAF branches. The performance metrics—including Accuracy, AUC, Precision, Recall, and F1-Score—are reported in Table 7. The best performance was consistently achieved when combining ReLU, GELU, and SiLU, which outperformed all other configurations across every metric. In contrast, using the same activation function in all branches (e.g., ReLU × 3) led to noticeably weaker performance, suggesting limited nonlinear diversity.
These findings confirm that mixing nonlinearities allows the model to capture a broader range of MRI feature patterns than any single activation alone. To ensure the robustness of these observations, each experiment was repeated five times with different random seeds. The superior performance of the ReLU + GELU + SiLU configuration remained consistent across all runs. Furthermore, this behaviour was supported by the 10-fold cross-validation results presented in Section 4.10. Collectively, these results demonstrate that diverse activation functions within the MAF block generate richer and more expressive feature representations, ultimately improving the model’s ability to detect subtle AD-related structural changes.

4.9. Evaluation of Proposed Model

The evaluation of the proposed model revealed its exceptional performance, marked by significant improvements throughout the training process. The model achieved a validation accuracy of 96.36% with a corresponding validation loss of 0.2822, as illustrated in Figure 6 and Figure 7, respectively.
In the validation dataset, the model attained a precision of 95.56%, a recall of 95.56%, and an area under the curve (AUC) of 99.54%. These metrics reflect the model’s ability to balance precision and recall effectively while maintaining high discriminative power. The progression of these metrics during training and validation is visualised in Figure 8, Figure 9 and Figure 10, which highlight the consistent and robust performance of the model.
This comprehensive evaluation demonstrates that the proposed model not only excels in accuracy but also maintains a high level of reliability and generalisation, confirming its suitability for the task at hand.

4.10. K-Fold Cross Validation

To evaluate the generalisability of the proposed approach, a 10-fold cross-validation strategy was implemented. This method ensures a comprehensive evaluation of the model using multiple subsets of the data, helping to reduce the risk of overfitting and produce a more reliable performance assessment. In the K-fold approach, the dataset described in Section 3.1 is divided into ten subsets. During each fold, one subset is reserved for testing, another for validation, and the remaining eight subsets for training. Once an iteration concludes, the subsequent subset is selected for testing, whilst the others are redistributed to serve as validation and training sets. Importantly, all folds were constructed using a strict patient-based strategy, ensuring that all scans belonging to the same subject remained within a single fold and never appeared across multiple folds during cross-validation.
The dataset is split into K equal parts in k-fold validation—that is, 10 in this case. One part is set aside for validation each iteration; the other K 1 parts are used for training. This procedure is repeated K times, such that every subset exactly once acts as the validation set. A strong estimate of the performance of the proposed model is obtained by averaging the evaluation measures over all K repetitions.
The K-Fold Cross Validation process has a mathematical form shown as:
Validation Metric = 1 K k = 1 K Metric k ,
where Metric k represents the evaluation metric for the k t h fold, and K is the number of folds (10 in this case).
Following comprehensive experimentation and adjustments, the optimal model underwent evaluation through 10-fold Cross-Validation. The findings indicated the model’s robust performance, with a mean test loss of 0.7713, a mean test accuracy of 91.23%, a mean test AUC of 93.75%, a mean test precision of 90.29%, and a mean test recall of 88.30%. These metrics underscore the model’s robustness and dependability for classifying Alzheimer’s disease. Employing 10-fold cross-validation supports the credibility of the performance metrics, mitigating bias from specific data divisions.
The higher AUC on the independent test set (0.99) reflects the relative homogeneity of that split, whereas the lower cross-validation AUC (0.93) results from increased scanner and demographic variability across folds. Consequently, cross-validation provides a more conservative and realistic estimate of model performance under heterogeneous clinical conditions.

4.11. WILCOXON Signed-Rank Test

A statistical analysis was performed to assess the significance of the results and determine whether they were due to random chance. The researchers used Wilcoxon signed-rank tests to compute p-values for each model comparison. This test is commonly utilised for comparing paired samples when the data do not meet normal distribution assumptions. By evaluating pairwise differences across multiple observations, the analysis aimed to assess variations in population median ranks.
The findings, summarised in Table 8, indicate that the proposed model significantly outperformed the other models tested. Specifically, the p-values for comparisons between the proposed model and the other five models were all less than 0.05, indicating that the proposed model provides statistically significant improvements over its competitors.

4.12. Uncertainty Quantification and Cross-Validation Variability

To provide a rigorous assessment of model performance, we report uncertainty estimates for both the independent test set and the 10-fold cross-validation procedure.

Independent Test-Set Confidence Intervals

For all proportion-based metrics (accuracy, sensitivity, specificity, precision, and F1-score), 95% confidence intervals were computed using the Wilson binomial method based on the observed confusion-matrix counts. The resulting confidence intervals are summarised in Table 9. ROC AUC confidence intervals were computed using DeLong’s analytic method, which provides a non-parametric estimate of the standard error with minimal computational overhead.
To characterise performance variability across data partitions, fold-wise results are reported together with their Wilson 95% confidence intervals, as shown in Table 10. These values illustrate the performance dispersion introduced by the 10-fold cross-validation procedure.
Finally, Table 11 summarises the mean and standard deviation of each metric across the 10 folds, providing an aggregate measure of cross-validation variability. These results demonstrate that the model exhibits stable performance across folds with limited variability.

4.13. Visualisation Through Gradient-Weighted Class Activation Map (Grad-CAM)

Grad-CAM is a visual tool for deep learning model decision-making interpretation. It offers an understanding of which areas of an input image most affect the predictions of the model. Within the framework of medical imaging, especially the 3D MRI scan analysis for AD, activation maps produced by Grad-CAM are essential for comprehending the spatial areas impacting the network’s classification. These visualisations are indispensable since they let viewers confirm whether the model emphasises anatomically important areas known to show structural changes in AD patients, including the hippocampus, ventricles, and cortex.
Computing the gradients of the target class score with regard to the convolutional feature maps of the model drives the Grad-CAM process. These gradients are then weighted and aggregated to create a coarse heatmap emphasising the image’s salient areas. Superimposed on the original input, this heatmap provides a clear visual depiction of the areas most influencing the classification choice. For example, in Figure 11, the activation map highlights regions of the brain, focusing primarily on AD-specific structures such as the hippocampus and ventricular areas, as well as regions showing cortical shrinkage.
Figure 11a displays heatmaps that show activated regions are widely distributed throughout the brain. This indicates that the proposed model can comprehensively analyse AD-related abnormalities across the brain. The capability to generate such extensive activation areas is one of the key strengths of transformer networks, attributed to their high receptive field.
Furthermore, Figure 11b presents the activation map for an AD case in a 3D Maccuracyemplate. The heatmap primarily focuses on the hippocampus in the coronal plane and the ventricle region in the axial view. In particular, the right hippocampus shows a stronger emphasis than the left, which is consistent with previous studies reporting more significant shrinkage in the right hippocampus of AD patients [71,72]. These findings confirm that the proposed model effectively captures and highlights AD-related structural alterations in the brain.
Although AD–CN classification has an AUC of 92.75%, a precision of 90.29%, the clinically significant challenges relate to the early detection of MCI and the prediction of conversion from MCI to AD. These tasks generally require longitudinal data, multimodal biomarkers, and harmonised multi-centre cohorts, which fall beyond the scope of the present work. The AD–CN setting used here provides a controlled framework for evaluating the proposed architectural components and analysing feature representation.
Future work will extend the framework to more clinically oriented scenarios, including MCI subtyping, multimodal integration, cross-scanner harmonisation, and longitudinal prediction. Further steps, such as external validation, enhanced interpretability, and alignment with clinical workflows, will be required to support eventual clinical deployment.

5. Conclusions

This study introduced an attention-driven 3D convolutional neural network enhanced with an MAF module for AD classification using structural MRI data. Leveraging 1175 MRI scans drawn from the ADNI dataset, the proposed model integrates channel-wise attention with activation-level diversity to capture a richer set of discriminative structural features. The framework achieved an accuracy of 92.1%, an AUC of 0.99, a precision of 91.3%, a recall of 89.3%, and an F1-score of 92%. Its reliability was further supported by 10-fold patient-level cross-validation, yielding an average accuracy of 91.23%, an AUC of 92.75%, a precision of 90.29%, and a recall of 88.30%.
Compared with recent deep learning approaches, the proposed model demonstrates competitive or superior performance while maintaining a relatively lightweight architecture. These results highlight the value of combining channel attention with multi-activation fusion for improved structural MRI interpretation. Overall, this work underscores the potential of deep learning to support early identification of AD-related brain changes and contribute to more accurate computer-aided diagnosis pipelines.
The accompanying codebase further enhances transparency and supports reproducibility, enabling future research to refine and extend the presented approach.

6. Limitations and Future Directions

Although the proposed model demonstrates strong classification performance, several limitations must be acknowledged. First, the evaluation relied exclusively on the ADNI dataset. While ADNI provides high-quality, well-structured imaging data, the lack of external validation limits conclusions about generalisability. Differences in scanner protocols, preprocessing pipelines, and demographic characteristics prevented the incorporation of datasets such as OASIS or AIBL, and these variations also contributed to the performance gap between the independent test-set AUC (0.99) and the cross-validation AUC (0.93). This discrepancy reflects ongoing challenges in robust generalisation under heterogeneous imaging conditions.
Second, despite class balancing strategies, dataset imbalance and limited representation of diverse populations remain potential sources of bias. Third, although computationally lighter than many transformer-based architectures, the proposed framework still requires substantial GPU resources, posing barriers to deployment in resource-limited clinical environments. Finally, interpretability remains a key challenge in deep learning for medical imaging; improved transparency is essential for clinical adoption.
Future work will focus on several directions. Expanding validation to external and demographically diverse cohorts will be critical for assessing population-level robustness. Integrating additional modalities—such as PET scans, genetic markers, and fluid biomarkers—may enhance diagnostic accuracy and enable multimodal disease staging. Advances in model explainability, including region-level attribution and clinically guided saliency methods, can improve clinician trust and interpretability. Additionally, evaluating the framework in longitudinal settings could support the detection of MCI, the prediction of conversion to AD, and the assessment of disease progression over time. Ultimately, these efforts aim to move the proposed system closer to real-world clinical integration.

Author Contributions

M.G.A.: Conceptualisation, methodology, software implementation, experimentation, and original draft preparation. S.L., W.Z., K.S. and J.L.: Conceptualisation, supervision, manuscript review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study because it used fully anonymised, publicly available data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). All ADNI procedures were conducted in accordance with the Declaration of Helsinki and were approved by the Institutional Review Boards of the participating institutions.

Informed Consent Statement

Written informed consent was obtained from all participants by the ADNI investigators as part of the original study. No new human data were collected for this research.

Data Availability Statement

This study used publicly available datasets for Alzheimer’s disease classification. Structural MRI data were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), accessible upon registration at https://adni.loni.usc.edu (21 October 2025). The source code used for data preprocessing, model training, and evaluation is openly available on GitHub at: https://github.com/MAlsubaie/Attention-Driven-CNNs-for-AD-Detection-via-MAF (21 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Siuly, S.; Zhang, Y. Medical big data: Neurological diseases diagnosis through medical data analysis. Data Sci. Eng. 2016, 1, 54–64. [Google Scholar] [CrossRef]
  2. Gong, C.H.; Sato, S. Can mild cognitive impairment with depression be improved merely by exercises of recall memories accompanying everyday conversation? A longitudinal study 2016–2019. Qual. Ageing Older Adults 2022, 23, 26–35. [Google Scholar] [CrossRef]
  3. Doi, K. Computer-aided diagnosis in medical imaging: Historical review, current status and future potential. Comput. Med. Imaging Graph. 2007, 31, 198–211. [Google Scholar] [CrossRef]
  4. Tiwari, S.; Atluri, V.; Kaushik, A.; Yndart, A.; Nair, M. Alzheimer’s disease: Pathogenesis, diagnostics, and therapeutics. Int. J. Nanomed. 2019, 14, 5541–5554. [Google Scholar] [CrossRef] [PubMed]
  5. Tufail, A.B.; Ma, Y.K.; Zhang, Q.N. Binary classification of Alzheimer’s disease using sMRI imaging modality and deep learning. J. Digit. Imaging 2020, 33, 1073–1090. [Google Scholar] [CrossRef]
  6. Serge, P.; Miquel, V.; Vernice, J. Neurodegeneration: What is it and where are we. J. Clin. Investig. 2003, 111, 3À10. [Google Scholar] [CrossRef]
  7. Sosa-Ortiz, A.L.; Acosta-Castillo, I.; Prince, M.J. Epidemiology of dementias and Alzheimer’s disease. Arch. Med. Res. 2012, 43, 600–608. [Google Scholar] [CrossRef]
  8. Nichols, E.; Steinmetz, J.D.; Vollset, S.E.; Fukutaki, K.; Chalek, J.; Abd-Allah, F.; Abdoli, A.; Abualhasan, A.; Abu-Gharbieh, E.; Akram, T.T.; et al. Estimation of the global prevalence of dementia in 2019 and forecasted prevalence in 2050: An analysis for the Global Burden of Disease Study 2019. Lancet Public Health 2022, 7, e105–e125. [Google Scholar] [CrossRef]
  9. Apostolova, L.G.; Green, A.E.; Babakchanian, S.; Hwang, K.S.; Chou, Y.Y.; Toga, A.W.; Thompson, P.M. Hippocampal atrophy and ventricular enlargement in normal aging, mild cognitive impairment (MCI), and Alzheimer Disease. Alzheimer Dis. Assoc. Disord. 2012, 26, 17–27. [Google Scholar] [CrossRef]
  10. De la Torre, J.C. Alzheimer’s disease is incurable but preventable. J. Alzheimer’s Dis. 2010, 20, 861–870. [Google Scholar] [CrossRef] [PubMed]
  11. Casey, D.A.; Antimisiaris, D.; O’Brien, J. Drugs for Alzheimer’s disease: Are they effective? Pharm. Ther. 2010, 35, 208. [Google Scholar]
  12. Shubayr, N.; Alashban, Y. Estimation of radiation doses and lifetime attributable risk of radiation-induced cancer in the uterus and prostate from abdomen pelvis CT examinations. Front. Public Health 2023, 10, 1094328. [Google Scholar] [CrossRef]
  13. Haass, C.; Suárez-Calvet, M.; Kleinberger, G.; Araque Caballero, M.Á.; Ewers, M. F5-02-04: CSF STREM2 Levels Increase in Early Stages of Autosomal Dominant Alzheimer’s Disease (ADAD) and are Associated with Markers of Neuronal Injury. Alzheimer’s Dement. 2016, 12, P369–P370. [Google Scholar] [CrossRef]
  14. Dulcu, I. Automatism and Voluntariness: Towards a New Framework for Assigning Criminal Responsibility. Ph.D. Thesis, University of Sussex, Brighton, UK, 2021. [Google Scholar]
  15. Shahbaz, M.; Ali, S.; Guergachi, A.; Niazi, A.; Umer, A. Classification of Alzheimer’s Disease Using Machine Learning Techniques. In Proceedings of the 8th International Conference on Data Science, Technology and Applications (DATA 2019), Prague, Czech Republic, 26–28 July 2019; pp. 296–303. [Google Scholar]
  16. Ebrahimighahnavieh, M.A.; Luo, S.; Chiong, R. Deep learning to detect Alzheimer’s disease from neuroimaging: A systematic literature review. Comput. Methods Programs Biomed. 2020, 187, 105242. [Google Scholar] [CrossRef] [PubMed]
  17. Tanveer, M.; Richhariya, B.; Khan, R.U.; Rashid, A.H.; Khanna, P.; Prasad, M.; Lin, C.T. Machine learning techniques for the diagnosis of Alzheimer’s disease: A review. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2020, 16, 1–35. [Google Scholar] [CrossRef]
  18. Alsubaie, M.G.; Luo, S.; Shaukat, K. Alzheimer’s disease detection using deep learning on neuroimaging: A systematic review. Mach. Learn. Knowl. Extr. 2024, 6, 464–505. [Google Scholar] [CrossRef]
  19. Nawaz, A.; Anwar, S.M.; Liaqat, R.; Iqbal, J.; Bagci, U.; Majid, M. Deep convolutional neural network based classification of Alzheimer’s disease using MRI data. In Proceedings of the 2020 IEEE 23rd International Multitopic Conference (INMIC), Bahawalpur, Pakistan, 5–7 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
  20. Wu, H.; Luo, J.; Lu, X.; Zeng, Y. 3D transfer learning network for classification of Alzheimer’s disease with MRI. Int. J. Mach. Learn. Cybern. 2022, 13, 1997–2011. [Google Scholar] [CrossRef]
  21. Liang, G.; Xing, X.; Liu, L.; Zhang, Y.; Ying, Q.; Lin, A.L.; Jacobs, N. Alzheimer’s disease classification using 2d convolutional neural networks. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Guadalajara, Mexico, 1–5 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3008–3012. [Google Scholar]
  22. Yin, H.; Vahdat, A.; Alvarez, J.M.; Mallya, A.; Kautz, J.; Molchanov, P. A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10809–10818. [Google Scholar]
  23. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
  24. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
  25. Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
  26. Neimark, D.; Bar, O.; Zohar, M.; Asselmann, D. Video transformer network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3163–3172. [Google Scholar]
  27. Valanarasu, J.M.J.; Oza, P.; Hacihaliloglu, I.; Patel, V.M. Medical transformer: Gated axial-attention for medical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part I 24. Springer: Cham, Switzerland, 2021; pp. 36–46. [Google Scholar]
  28. Eo, T.; Jun, Y.; Kim, T.; Jang, J.; Lee, H.J.; Hwang, D. KIKI-net: Cross-domain convolutional neural networks for reconstructing undersampled magnetic resonance images. Magn. Reson. Med. 2018, 80, 2188–2201. [Google Scholar] [CrossRef]
  29. Wyburd, M.K.; Dinsdale, N.K.; Namburete, A.I.; Jenkinson, M. TEDS-Net: Enforcing diffeomorphisms in spatial transformers to guarantee topology preservation in segmentations. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; Springer: Cham, Switzerland, 2021; pp. 250–260. [Google Scholar]
  30. Korkmaz, Y.; Dar, S.U.; Yurt, M.; Özbey, M.; Cukur, T. Unsupervised MRI reconstruction via zero-shot learned adversarial transformers. IEEE Trans. Med. Imaging 2022, 41, 1747–1763. [Google Scholar] [CrossRef]
  31. Qin, Z.; Liu, Z.; Guo, Q.; Zhu, P. 3D convolutional neural networks with hybrid attention mechanism for early diagnosis of Alzheimer’s disease. Biomed. Signal Process. Control 2022, 77, 103828. [Google Scholar] [CrossRef]
  32. Chen, L.; Wan, L. CTUNet: Automatic pancreas segmentation using a channel-wise transformer and 3D U-Net. Vis. Comput. 2023, 39, 5229–5243. [Google Scholar] [CrossRef]
  33. Zhu, X.; Wang, X.; Shi, Y.; Ren, S.; Wang, W. Channel-wise attention mechanism in the 3D convolutional network for lung nodule detection. Electronics 2022, 11, 1600. [Google Scholar] [CrossRef]
  34. Xu, J.; Yuan, C.; Ma, X.; Shang, H.; Shi, X.; Zhu, X. Interpretable medical deep framework by logits-constraint attention guiding graph-based multi-scale fusion for Alzheimer’s disease analysis. Pattern Recognit. 2024, 152, 110450. [Google Scholar] [CrossRef]
  35. Zhu, J.; Tan, Y.; Lin, R.; Miao, J.; Fan, X.; Zhu, Y.; Liang, P.; Gong, J.; He, H. Efficient self-attention mechanism and structural distilling model for Alzheimer’s disease diagnosis. Comput. Biol. Med. 2022, 147, 105737. [Google Scholar] [CrossRef]
  36. Maqsood, M.; Nazir, F.; Khan, U.; Aadil, F.; Jamal, H.; Mehmood, I.; Song, O.y. Transfer learning assisted classification and detection of Alzheimer’s disease stages using 3D MRI scans. Sensors 2019, 19, 2645. [Google Scholar] [CrossRef]
  37. Jang, J.; Hwang, D. M3T: Three-dimensional Medical image classifier using Multi-plane and Multi-slice Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20718–20729. [Google Scholar]
  38. Jo, T.; Nho, K.; Saykin, A.J. Deep learning in Alzheimer’s disease: Diagnostic classification and prognostic prediction using neuroimaging data. Front. Aging Neurosci. 2019, 11, 220. [Google Scholar] [CrossRef]
  39. Mora-Rubio, A.; Bravo-Ortíz, M.A.; Arredondo, S.Q.; Torres, J.M.S.; Ruz, G.A.; Tabares-Soto, R. Classification of Alzheimer’s disease stages from magnetic resonance images using deep learning. PeerJ Comput. Sci. 2023, 9, e1490. [Google Scholar] [CrossRef]
  40. Srividhya, L.; Vishvanathan, S.; Ravi, V.; Gopalakrishnan, E.A.; Kp, S. Deep learning-based approach for multi-stage diagnosis of Alzheimer’s disease. Multimed. Tools Appl. 2024, 83, 16799–16822. [Google Scholar]
  41. Ma, H.; Wang, Y.; Hao, Z.; Yu, Y.; Jia, X.; Li, M.; Chen, L. Classification of Alzheimer’s disease: Application of a transfer learning deep Q-network method. Eur. J. Neurosci. 2024, 59, 2118–2127. [Google Scholar] [CrossRef]
  42. Francis, A.; Pandian, S.; Sagayam, K.M.; Dang, L.; Anitha, J.; Dinh, L.; Pomplun, M.; Dang, H. Early detection of Alzheimer’s disease using squeeze and excitation network with local binary pattern descriptor. Pattern Anal. Appl. 2024, 27, 54. [Google Scholar] [CrossRef]
  43. Suh, C.; Shim, W.; Kim, S.; Roh, J.; Lee, J.H.; Kim, M.J.; Park, S.; Jung, W.; Sung, J.; Jahng, G.H.; et al. Development and validation of a deep learning–based automatic brain segmentation and classification algorithm for Alzheimer disease using 3D T1-weighted volumetric images. Am. J. Neuroradiol. 2020, 41, 2227–2234. [Google Scholar] [CrossRef]
  44. Feng, C.; Elazab, A.; Yang, P.; Wang, T.; Zhou, F.; Hu, H.; Xiao, X.; Lei, B. Deep learning framework for Alzheimer’s disease diagnosis via 3D-CNN and FSBi-LSTM. IEEE Access 2019, 7, 63605–63618. [Google Scholar] [CrossRef]
  45. Li, W.; Lin, X.; Chen, X. Detecting Alzheimer’s disease Based on 4D fMRI: An exploration under deep learning framework. Neurocomputing 2020, 388, 280–287. [Google Scholar] [CrossRef]
  46. George, A.; Abraham, B.; George, N.; Shine, L.; Ramachandran, S. An Efficient 3D CNN Framework with Attention Mechanisms for Alzheimer’s Disease Classification. Comput. Syst. Sci. Eng. 2023, 47, 2097–2118. [Google Scholar] [CrossRef]
  47. Wen, J.; Thibeau-Sutre, E.; Diaz-Melo, M.; Samper-González, J.; Routier, A.; Bottani, S.; Dormont, D.; Durrleman, S.; Burgos, N.; Colliot, O.; et al. Convolutional neural networks for classification of Alzheimer’s disease: Overview and reproducible evaluation. Med. Image Anal. 2020, 63, 101694. [Google Scholar] [CrossRef]
  48. Zhang, J.; Zheng, B.; Gao, A.; Feng, X.; Liang, D.; Long, X. A 3D densely connected convolution neural network with connection-wise attention mechanism for Alzheimer’s disease classification. Magn. Reson. Imaging 2021, 78, 119–126. [Google Scholar] [CrossRef] [PubMed]
  49. Buvaneswari, P.; Gayathri, R. Deep learning-based segmentation in classification of Alzheimer’s disease. Arab. J. Sci. Eng. 2021, 46, 5373–5383. [Google Scholar] [CrossRef]
  50. An, N.; Ding, H.; Yang, J.; Au, R.; Ang, T.F. Deep ensemble learning for Alzheimer’s disease classification. J. Biomed. Inform. 2020, 105, 103411. [Google Scholar] [CrossRef]
  51. Ortiz, A.; Munilla, J.; Gorriz, J.M.; Ramirez, J. Ensembles of deep learning architectures for the early diagnosis of the Alzheimer’s disease. Int. J. Neural Syst. 2016, 26, 1650025. [Google Scholar] [CrossRef]
  52. Lim, B.Y.; Lai, K.W.; Haiskin, K.; Kulathilake, K.S.H.; Ong, Z.C.; Hum, Y.C.; Dhanalakshmi, S.; Wu, X.; Zuo, X. Deep learning model for prediction of progressive mild cognitive impairment to Alzheimer’s disease using structural MRI. Front. Aging Neurosci. 2022, 14, 876202. [Google Scholar] [CrossRef]
  53. Liu, M.; Li, F.; Yan, H.; Wang, K.; Ma, Y.; Alzheimer’s Disease Neuroimaging Initiative; Shen, L.; Xu, M. A multi-model deep convolutional neural network for automatic hippocampus segmentation and classification in Alzheimer’s disease. Neuroimage 2020, 208, 116459. [Google Scholar] [CrossRef]
  54. Yee, E. 3D Convolutional Neural Networks for Alzheimer’s Disease Classification. Master’s Thesis, Simon Fraser University, Burnaby, BC, Canada, 2020. Available online: https://summit.sfu.ca/item/20357 (accessed on 6 March 2025).
  55. Alzubaidi, L.; Fadhel, M.A.; Al-Shamma, O.; Zhang, J.; Santamaría, J.; Duan, Y.; Oleiwi, S.R. Towards a better understanding of transfer learning for medical imaging: A case study. Appl. Sci. 2020, 10, 4523. [Google Scholar] [CrossRef]
  56. Weiner, M.W.; Veitch, D.P.; Aisen, P.S.; Beckett, L.A.; Cairns, N.J.; Green, R.C.; Harvey, D.; Jack, C.R.; Jagust, W.; Liu, E.; et al. The Alzheimer’s disease neuroimaging initiative: A review of papers published since its inception. Alzheimer’s Dement. 2013, 9, e111–e194. [Google Scholar] [CrossRef] [PubMed]
  57. Marcus, D.; Wang, T.; Parker, J.; Csernansky, J.G.; Morris, J.C.; Buckner, R.L. OASIS: Cross-sectional, MRI data in young, middle aged, nondemented, and demented, older adults. J. Cogn. Neurosci. 2007, 19, 1498–1507. [Google Scholar] [CrossRef]
  58. Li, J.; Fong, S.; Mohammed, S.; Fiaidhi, J. Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J. Supercomput. 2016, 72, 3708–3728. [Google Scholar] [CrossRef]
  59. El-Aal, S.A.; Ghali, N.I. A proposed recognition system for Alzheimer’s disease based on deep learning and optimization algorithms. J. Southwest Jiaotong Univ. 2021, 56, 241–252. [Google Scholar] [CrossRef]
  60. Mohammed, B.A.; Senan, E.M.; Rassem, T.H.; Makbol, N.M.; Alanazi, A.A.; Al-Mekhlafi, Z.G.; Almurayziq, T.S.; Ghaleb, F.A. Multi-method analysis of medical records and MRI images for early diagnosis of dementia and Alzheimer’s disease based on deep learning and hybrid methods. Electronics 2021, 10, 2860. [Google Scholar] [CrossRef]
  61. Pradhan, A.; Gige, J.; Eliazer, M. Detection of Alzheimer’s disease (AD) in MRI images using deep learning. Int. J. Eng. Res. Technol. (IJERT) 2021, 10, 580–585. [Google Scholar]
  62. Vasukidevi, G.; Ushasukhanya, S.; Mahalakshmi, P. Efficient image classification for Alzheimer’s disease prediction using capsule network. Ann. Rom. Soc. Cell Biol. 2021, 25, 806–815. [Google Scholar]
  63. Battineni, G.; Chintalapudi, N.; Amenta, F.; Traini, E. Deep learning type convolution neural network architecture for multiclass classification of Alzheimer’s disease. In Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021)-Volume 2: BIOIMAGING, Virtual, 11–13 February 2021; pp. 209–215. [Google Scholar]
  64. Islam, J.; Zhang, Y. Brain MRI analysis for Alzheimer’s disease diagnosis using an ensemble system of deep convolutional neural networks. Brain Inform. 2018, 5, 2. [Google Scholar] [CrossRef]
  65. Islam, J.; Zhang, Y. An ensemble of deep convolutional neural networks for Alzheimer’s disease detection and classification. arXiv 2017, arXiv:1712.01675. [Google Scholar]
  66. Islam, J.; Zhang, Y. Early diagnosis of Alzheimer’s disease: A neuroimaging study with deep learning architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1881–1883. [Google Scholar]
  67. Hu, Z.; Wang, Y.; Xiao, L. Alzheimer’s disease diagnosis by 3D-SEConvNeXt. J. Big Data 2025, 12, 15. [Google Scholar] [CrossRef]
  68. Khan, I.J.; Amin, M.F.B.; Deepu, M.D.S.; Hira, H.K.; Mahmud, A.; Chowdhury, A.M.; Islam, S.; Mukta, M.S.H.; Shatabda, S.; Initiative, A.D.N.; et al. Enhanced ROI guided deep learning model for Alzheimer’s detection using 3D MRI images. Inform. Med. Unlocked 2025, 56, 101650. [Google Scholar] [CrossRef]
  69. Isensee, F.; Schell, M.; Pflueger, I.; Brugnara, G.; Bonekamp, D.; Neuberger, U.; Wick, A.; Schlemmer, H.P.; Heiland, S.; Wick, W.; et al. Automated brain extraction of multisequence MRI using artificial neural networks. Hum. Brain Mapp. 2019, 40, 4952–4964. [Google Scholar] [CrossRef]
  70. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  71. Barnes, J.; Scahill, R.I.; Schott, J.M.; Frost, C.; Rossor, M.N.; Fox, N.C. Does Alzheimer’s disease affect hippocampal asymmetry? Evidence from a cross-sectional and longitudinal volumetric MRI study. Dement. Geriatr. Cogn. Disord. 2005, 19, 338–344. [Google Scholar] [CrossRef] [PubMed]
  72. Geroldi, C.; Laakso, M.; DeCarli, C.; Beltramello, A.; Bianchetti, A.; Soininen, H.; Trabucchi, M.; Frisoni, G.B. Apolipoprotein E genotype and hippocampal asymmetry in Alzheimer’s disease: A volumetric MRI study. J. Neurol. Neurosurg. Psychiatry 2000, 68, 93–96. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Healthy vs. diseased brain structure.
Figure 1. Healthy vs. diseased brain structure.
Ai 06 00324 g001
Figure 2. Methodology diagram of the proposed model.
Figure 2. Methodology diagram of the proposed model.
Ai 06 00324 g002
Figure 3. Dataset description of class distribution with respect to age.
Figure 3. Dataset description of class distribution with respect to age.
Ai 06 00324 g003
Figure 4. Architecture diagram of the proposed approach using channel-wise attention and the MAF technique.
Figure 4. Architecture diagram of the proposed approach using channel-wise attention and the MAF technique.
Ai 06 00324 g004
Figure 5. Global Average Pooling.
Figure 5. Global Average Pooling.
Ai 06 00324 g005
Figure 6. Training & Validation Accuracy of the proposed model.
Figure 6. Training & Validation Accuracy of the proposed model.
Ai 06 00324 g006
Figure 7. Training & Validation Loss of the proposed model.
Figure 7. Training & Validation Loss of the proposed model.
Ai 06 00324 g007
Figure 8. Training & Validation Recall of the proposed model.
Figure 8. Training & Validation Recall of the proposed model.
Ai 06 00324 g008
Figure 9. Training & Validation AUC of the proposed model.
Figure 9. Training & Validation AUC of the proposed model.
Ai 06 00324 g009
Figure 10. Training & Validation Precision of the proposed model.
Figure 10. Training & Validation Precision of the proposed model.
Ai 06 00324 g010
Figure 11. Visualisation of AD-related activation maps. The results include multi-plane images and activation maps generated from the ADNI test dataset. The heatmap uses a jet colourmap, where red indicates high activation values (close to 1) and blue represents low activation values (close to 0).
Figure 11. Visualisation of AD-related activation maps. The results include multi-plane images and activation maps generated from the ADNI test dataset. The heatmap uses a jet colourmap, where red indicates high activation values (close to 1) and blue represents low activation values (close to 0).
Ai 06 00324 g011
Table 1. List of Acronyms.
Table 1. List of Acronyms.
AcronymDefinitionAcronymDefinition
ADAlzheimer’s diseaseMCImild cognitive impairment
ADCBattention-driven convolutional blockMAFmulti-activation fusion
ADNIAlzheimer’s disease neuroimaging initiativeMRImagnetic resonance imaging
AUCarea under the curveReLUrectified linear unit
CNNconvolutional neural networkSiLUsigmoid linear unit
CNcognitively normalViTvision transformer
DLdeep learningGAPglobal average pooling
GELUGaussian error linear unitHD-BEThigh-definition brain extraction tool
FCfully connectedBNbatch normalisation
3D-BN3D batch normalisationSEsqueeze-and-excitation
BCEbinary cross-entropySGDstochastic gradient descent
L2L2 regularisationOASISopen access series of imaging studies
AIBLAustralian imaging biomarkers and lifestyle studyGPUgraphics processing unit
VRAMvideo random-access memory
Table 2. Summary of Research on AD Detection Using Machine Learning and Deep Learning Techniques.
Table 2. Summary of Research on AD Detection Using Machine Learning and Deep Learning Techniques.
RefApproachesSourceModalitiesYearAccuracy
Mora-Rubio et al. [39]EfficientNet, DenseNet, Vision TransformerMRI ScansMRI202489%
Ravi et al. [40]ResNet-50v2 (CNN)ADNIMRI202491.84%
Ma et al. [41]DQN (CNN)ADNIMRI202486.66%
Francis et al. [42]SENetADNIMRI202486%
George et al. [46]3D-CNN + TransformerADNIMRI202387%
Jang et al. [37]3D CNN + TransformerADNI, OASISMRI202291.61%
Hu et al. [67]3D-SEConvNeXt3ADNIMRI202589.71%
Yee et al. [54]3D CNN modelADNI (1500 images)MRI202092%
Khan et al. [68]3D-ResNet50ADNIMRI202588%
Suh et al. [43]3D-CNN + LSTM3D T1-weighted MRIMRI202087% AUC 1
Li et al. [45]C3d-LSTM model4D fMRIfMRI202089.47%
An et al. [50]Sparse autoencoders + MLNACCMRI202083.9%
Liu et al. [53]Multi-task CNN modelADNIMRI202088.9%
Ortiz et al. [51]Deep belief networks3D patches (AAL)MRI201690% (NC/AD), 0.95 AUC 2
1 Reported as Area Under the Curve (AUC), not classification accuracy. 2 Results given for Normal Control vs. AD classification; AUC also reported
Table 3. Class Distribution of Images.
Table 3. Class Distribution of Images.
Class NameNo. of Images
Cognitive Normal (CN)699
Alzheimer’s Disease (AD)476
Table 4. Detailed Model Summary.
Table 4. Detailed Model Summary.
Layer TypeOutput ShapeParameters
Input Layer(None, 128, 128, 128, 1)0
Attention-Driven Block-1(None, 63, 63, 63, 32)2080
Attention-Driven Block-2(None, 30, 30, 30, 64)59,776
Attention-Driven Block-3(None, 14, 14, 14, 128)238,336
Attention-Driven Block-4(None, 6, 6, 6, 256)951,808
Attention-Driven Block-5(None, 2, 2, 2, 512)3,804,160
Global Average Pooling(None, 512)0
Multi-Activation Fusion Block(None, 1536)787,968
Dropout(None, 1536)0
Hidden Layer(None, 128)196,736
Output Layer(None, 1)129
Total Parameters 6,040,993
Trainable Parameters 6,039,009
Non-Trainable Parameters 1984
Note: The output shape is (None, 1) because the model uses a single sigmoid neuron for binary classification (AD vs. CN). Although the task involves two classes, a one-dimensional output is sufficient, as the sigmoid predicts the probability of the positive class, while the probability of the second class is its complement. This design is standard practice for binary medical imaging classifiers.
Table 5. Training Configuration Parameters.
Table 5. Training Configuration Parameters.
Exp. #Parameter NameParameter Value
1OptimizerSGD
2Initial Learning Rate0.01
3Batch Size4
4Number of Epochs100
Table 6. Comparison of Different Models for Alzheimer’s Disease Classification.
Table 6. Comparison of Different Models for Alzheimer’s Disease Classification.
Exp#ModelAcc (%)AUC (%)Prec (%)Rec (%)F1 (%)GPU (s)CPU (s)Param (M)
1MobileNetV271.2372.5069.5050.3070.600.14034.8442.747
2VGG1970.8970.1069.4050.1070.000.73545.3849221
3Inception V371.4570.6070.2052.3070.800.10200.554213
4DenseNet12172.1870.8071.2051.9071.500.10820.658711.617
5VGG1673.4272.3072.8052.6073.100.68924.851195
6ResNet1876.3478.4074.2060.5075.300.12620.722533.437
7RegNet79.6581.2077.9065.4078.700.12620.795132
8M3T84.2385.1083.5071.4083.700.52052.1028.18
93D-CNN + Attention86.8493.1889.3988.0587.100.10120.55785.15
10Proposed Framework (3D-CNN + Attention + MAF)92.1099.0091.3089.3092.000.10280.57726.04
All values are reported as percentages unless otherwise stated.
Table 7. Comparison of Different Activation Function Combinations.
Table 7. Comparison of Different Activation Function Combinations.
Exp#Activation ConfigurationAcc (%)AUC (%)Prec (%)Rec (%)F1 (%)
1ReLU × 382.4690.5686.5784.0685.29
2ReLU × 2 + SiLU85.9693.2285.0786.5585.69
3ReLU × 2 + GELU84.2191.7682.0990.1685.94
4GELU × 2 + SiLU88.6096.0988.0689.2590.08
5ReLU + GELU + SiLU92.1099.0091.3089.3092.00
All values are reported as percentages.
Table 8. Statistical Results with Different Models.
Table 8. Statistical Results with Different Models.
Pairwise Comparisonp-ValueS < 0.05
MobileNetV20.0000031pass
VGG190.0000022pass
Inception V30.0000015pass
DenseNet1210.0000006pass
VGG160.0000092pass
ResNet180.0000021pass
RegNet0.000016pass
M3T0.000253pass
3D-CNN + Attention0.003172pass
Table 9. Testset performance with 95% confidence intervals.
Table 9. Testset performance with 95% confidence intervals.
MetricEstimate95% CI Low95% CI High
Accuracy0.92110.85670.9579
Sensitivity0.89360.77410.9537
Specificity0.94030.85630.9765
Precision0.91300.79680.9657
F1-Score0.90320.83520.9590
ROC AUC (DeLong)0.9630.9210.9940
Table 10. Fold-wise performance metrics with Wilson 95% confidence intervals.
Table 10. Fold-wise performance metrics with Wilson 95% confidence intervals.
FoldSensitivityCI LowCI HighSpecificityCI LowCI HighPrecisionCI LowCI High
10.89360.77410.95370.92540.83690.96770.89360.77410.9537
20.91490.80070.96640.92540.83690.96770.89580.77830.9547
30.89360.77410.95370.92540.83690.96770.89360.77410.9537
40.91490.80070.96640.92540.83690.96770.89580.77830.9547
50.89360.77410.95370.94030.85630.97650.91300.79680.9657
60.87230.74830.94020.89550.79970.94850.85420.72830.9275
70.85110.72310.92590.94030.85630.97650.90910.78840.9641
80.85110.72310.92590.95520.87640.98470.93020.81390.9759
90.89360.77410.95370.94030.85630.97650.91300.79680.9657
100.85110.72310.92590.95520.87640.98470.93020.81390.9759
Table 11. Mean and standard deviation of 10-fold cross-validation performance metrics.
Table 11. Mean and standard deviation of 10-fold cross-validation performance metrics.
MetricMeanSD
Sensitivity0.8830.025
Specificity0.9330.018
Precision0.9030.022
F1 Score0.8920.013
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alsubaie, M.G.; Luo, S.; Shaukat, K.; Zhang, W.; Li, J. A Novel Deep Learning Approach for Alzheimer’s Disease Detection: Attention-Driven Convolutional Neural Networks with Multi-Activation Fusion. AI 2025, 6, 324. https://doi.org/10.3390/ai6120324

AMA Style

Alsubaie MG, Luo S, Shaukat K, Zhang W, Li J. A Novel Deep Learning Approach for Alzheimer’s Disease Detection: Attention-Driven Convolutional Neural Networks with Multi-Activation Fusion. AI. 2025; 6(12):324. https://doi.org/10.3390/ai6120324

Chicago/Turabian Style

Alsubaie, Mohammed G., Suhuai Luo, Kamran Shaukat, Weijia Zhang, and Jiaming Li. 2025. "A Novel Deep Learning Approach for Alzheimer’s Disease Detection: Attention-Driven Convolutional Neural Networks with Multi-Activation Fusion" AI 6, no. 12: 324. https://doi.org/10.3390/ai6120324

APA Style

Alsubaie, M. G., Luo, S., Shaukat, K., Zhang, W., & Li, J. (2025). A Novel Deep Learning Approach for Alzheimer’s Disease Detection: Attention-Driven Convolutional Neural Networks with Multi-Activation Fusion. AI, 6(12), 324. https://doi.org/10.3390/ai6120324

Article Metrics

Back to TopTop