1. Introduction
Alzheimer’s disease (AD) is a progressive neurodegenerative disorder caused by the abnormal accumulation of proteins in the brain [
1]. This protein buildup gradually damages neurons, leading to memory loss, cognitive decline, and difficulty performing everyday tasks [
2]. As the disease progresses, it severely affects intellectual and social functioning and ultimately reduces a person’s independence [
2,
3]. AD is the most common form of dementia and poses a major global health challenge [
4,
5]. Early and accurate diagnosis is essential for initiating appropriate treatment and improving patient outcomes. Currently, over fifty million people worldwide are living with AD [
6], and this number will continue to grow as the global population ages. The prevalence of AD varies widely based on factors such as age, genetic inheritance, and lifestyle. According to Alzheimer’s Disease International, China and India currently represent approximately two-thirds of global AD cases [
7]. Improved healthcare in these countries has contributed to longer life expectancy among individuals with AD, who previously had limited access to treatment and care. The prevalence of dementia in low-income countries is approximately 7% [
8], and for people aged 65 years and older, this rate is similar to that found in high-income countries [
8].
Figure 1 illustrates the distinction between healthy brain tissue and that affected by AD. In AD, brain tissue volume progressively decreases over time, with this reduction being accompanied by enlarged ventricular spaces and significant atrophy of the cerebral cortex and hippocampus.
The pathological hallmarks of AD include the formation of amyloid plaques and neurofibrillary tangles. These pathological formations reduce the number of functional nerve cells in the brain and restrict communication between brain cells, leading to increased neural damage. This process results in shrinkage of the hippocampus and brain lobes, as well as enlargement of the ventricles [
9]. The exact cause of Alzheimer’s disease remains unknown. However, studies indicate that a combination of genetic, environmental, and lifestyle factors influences its development. Despite ongoing research, no medications or therapies are currently available to prevent or cure dementia [
10,
11]. Early diagnosis can identify individuals with mild cognitive impairment (MCI), an early form of AD that is treatable in its initial stages [
10]. Accurate identification and diagnosis of AD is, therefore, critical for physicians [
8]. Physicians may use neuroimaging techniques to identify the initial phases of AD; however, these methods have limited precision. These techniques provide valuable insights into the brain by revealing important details about its structure and function. Computed tomography (CT) scans are among the most commonly used neuroimaging techniques, utilising X-rays to provide a comprehensive view of the brain [
12].
CT scans can help detect cognitive impairment caused by conditions such as stroke or tumours, but they are generally insufficient for detecting the subtle changes associated with AD [
13]. Positron emission tomography (PET) involves injecting a radioactive tracer into the bloodstream, where it accumulates in metabolically active areas of the brain. However, PET scans require ionising radiation and are more expensive than other imaging procedures. Magnetic resonance imaging (MRI) is a versatile and informative neuroimaging method for AD detection. Unlike CT scans, MRI creates comprehensive brain images using electromagnetic fields and radiofrequency waves instead of ionising radiation. This makes MRI a safer alternative, especially for repeated scans needed to monitor disease progression. MRI is particularly effective in detecting subtle structural changes, such as temporal lobe atrophy, which affects an area critical for memory function [
14].
Traditional diagnostic approaches for AD have primarily relied on clinical assessments, cognitive testing, and analysis of structural MRI or PET scans. However, these methods are often time-consuming, subjective, or have limited sensitivity to early-stage changes [
10,
13]. In recent years, machine learning (ML) techniques have been explored for automated AD classification. Early studies employed handcrafted features extracted from MRI scans, followed by conventional classifiers such as support vector machines (SVMs), random forests, and k-nearest neighbours algorithms [
3,
15]. Although moderately effective to some extent, these approaches suffered from limited feature representation and required extensive pre-processing. Deep learning (DL) models, particularly convolutional neural networks (CNNs), have demonstrated significant improvements by learning hierarchical features directly from imaging data [
16,
17,
18]. Both 2D and 3D CNNs have been applied to capture spatial features from MRI volumes. While 2D CNNs are computationally efficient, they lose inter-slice spatial context that may be important for accurate diagnosis.
In contrast, 3D CNNs can preserve volumetric information and have shown superior performance in AD classification tasks [
19,
20]. More recently, attention mechanisms and transformer-based architectures have been introduced to enhance model interpretability and to focus on the most relevant brain regions [
21]. These models improve spatial feature prioritisation, but are often computationally demanding and require large datasets. Additionally, limited research has investigated the role of activation function diversity in enhancing model performance. Most existing models rely on a single activation function, potentially overlooking the complementary advantages of different nonlinear functions. Fusion strategies and parallel activation branches, such as the approach proposed in this study, remain relatively unexplored in the context of neuroimaging applications.
Despite these advances, several critical gaps remain in the current approaches for automated AD diagnosis. First, existing attention mechanisms primarily focus on spatial regions rather than leveraging channel-specific attention to enhance feature discriminability across different brain structures. Second, the potential of combining multiple activation functions to capture complementary feature representations has been largely unexplored in neuroimaging applications. Third, most current models rely heavily on pre-trained weights or transfer learning, which may not capture the specific patterns unique to AD pathology. These limitations motivate the need for a novel approach that addresses feature channel prioritisation, activation function diversity, and domain-specific learning.
This study presents a deep learning approach for diagnosing Alzheimer’s disease (AD) at the initial stage using structural MRI data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The pipeline comprises four key stages: (i) preprocessing of ADNI MRI scans through skull stripping, spatial normalisation, and resizing; (ii) volumetric feature extraction via a lightweight 3D Convolutional Neural Network (3D-CNN); (iii) channel-specific attention to emphasise discriminative brain regions; and (iv) a multi-activation fusion (MAF) block integrating GELU, SiLU, and ReLU functions to enhance nonlinear feature representation. For convenient referencing, all acronyms are listed in
Table 1.
Despite progress in 3D CNN and attention-based models, key limitations remain in MRI-based AD classification. Existing methods typically employ spatial attention while overlooking channel-wise feature recalibration, and they rely on a single activation function, which restricts nonlinear diversity for modelling subtle structural changes. Furthermore, many recent architectures depend on heavy pre-trained transformer models that are not optimised for volumetric neuroimaging. These gaps motivate the need for a lightweight 3D framework that integrates channel-wise attention with activation-level diversity to improve representational capacity and diagnostic robustness.
The aims of this work are to develop an efficient 3D architecture capable of detecting subtle AD-related structural changes, reduce computational complexity relative to transformer-based models, and improve robustness through activation-fusion and attention-enhanced feature learning.
The primary findings of this work are listed below:
The study introduces a hybrid model that combines a 3D CNN and a vision transformer (ViT) approach of attention-driven on extracted feature channels, effectively capturing spatial dependencies and outperforming traditional CNNs and transformer-based methods in 3D image classification tasks.
The proposed MAF block employs GELU, SiLU, and ReLU activation functions in parallel to capture diverse and complementary feature representations, analogous to the multi-head structure in Vision Transformers. This design enhances fine-grained feature discrimination and improves the model’s adaptability to the heterogeneity present in MRI-based AD datasets.
The proposed model is trained end-to-end from scratch, avoiding the reliance on pre-trained weights and thereby learning highly specific, meaningful patterns for AD detection. This approach reduces computational constraints and improves generalizability across unseen data.
The integration of attention-driven convolutional blocks and MAF aims to enhance both spatial feature prioritisation and the richness of nonlinear representations, thereby enabling more accurate and discriminative analysis of Alzheimer related brain structures.
The remainder of the work is arranged as follows:
Section 2 summarises related research on the proposed study.
Section 3 presents the methodology of the study and explains the proposed model for the classification of AD, detailing the dataset and the components of the approach.
Section 4 outlines a comprehensive visualisation process and evaluates the model against the latest approaches. Finally,
Section 5 concludes the study, and
Section 6 highlights the limitations and future objectives.
3. Proposed Methodology
This section describes the detailed flow and evaluation metrics used in this study. The methodology is visually presented in
Figure 2.
As illustrated in
Figure 2, the methodology diagram presents the overall workflow of the proposed model. It begins with data preprocessing to prepare the MRI inputs, followed by a series of learning and feature extraction stages using attention-driven convolutional blocks. These extracted features are then refined through a multi-activation fusion process to capture diverse representations.
3.1. Dataset Overview
The dataset utilised in this study was obtained from the ADNI. The training dataset comprised 1175 MRI samples, with the class-wise distribution detailed in
Table 3. All scans were acquired using a 1.5 Tesla (1.5 T) MRI scanner to ensure uniform imaging conditions. Restricting the dataset to a single field strength avoids the systematic variability introduced when mixing 1.5 T and 3 T acquisitions, thereby providing more homogeneous imaging conditions and supporting reliable model training and evaluation. The independent ADNI test set consisted of 117 cases, including 68 cognitively normal (CN) subjects and 49 AD patients. This separation ensured a clear distinction between training and testing cohorts, thereby supporting an unbiased evaluation of the model’s generalisation capability. Patients in the dataset range in age from 56 to 91.
Figure 3 shows the age distribution with respect to the group. This study used publicly available, fully anonymised MRI data from ADNI. No new data were collected for this research.
3.2. Image Pre-Processing
The collected dataset contained images with skull information, which is irrelevant for Alzheimer’s classification. Variations in depth, height, and width were observed across the images. To address these challenges, preprocessing techniques similar to those described in [
37] were implemented. Skull stripping was applied to the entire dataset using the high-definition brain extraction tool (HD-BET) [
69], which generated a mask to identify brain tissues. By overlaying this mask on the original images, the skull regions were effectively removed.
Subsequently, all images were standardised to the same voxel spacing of
and resized to a matrix size of
. The voxel intensities were normalised through the zero-mean unit-variance approach to achieve uniformity across the dataset. This method is applied independently to each split. A 3D CNN block was used for the preprocessed input data. This network accepted input images of size
and extracted 3D feature representations with
C channels, where
C denotes the number of feature channels, through convolutional operations, as mentioned in
Figure 4.
Although ADNI MRI scans can be affected by scanner-related noise, motion artefacts, and variability in acquisition protocols, the preprocessing pipeline was designed to reduce these effects. Voxel-wise normalisation and uniform resampling help minimise intensity fluctuations and geometric inconsistencies across subjects. In addition, the channel-specific attention mechanism can down-weight unstable or noise-dominated feature channels, improving robustness to residual artefacts in the ADNI dataset
3.3. Data Split
In order to ensure reliability in the evaluation and prevent any data leakage, the dataset was split into subsets before training. This was done using a strict patient-based strategy within the set of the total of 1175 MRI scans, 10% (117 scans were allocated to the validation set, and another 10% (117 scans) were reserved as an independent test set. The remaining 941 scans were used to train the model. Across these splits, the training set contained 265 unique subjects, the validation set contained 30 unique subjects, and the test set contained 33 unique subjects. Because ADNI includes multiple timepoints and repeated acquisitions for many participants, the number of MRI scans is larger than the number of subjects in each subset. This patient-based splitting approach ensured that all scans belonging to the same subject were assigned to a single subset only. This prevented any overlap between the training, validation, and test sets and guaranteed subject-level independence. The validation set was used during training to optimise model parameters and reduce overfitting, while the independent test set—comprising completely unseen subjects—was used only once for the final evaluation. This strategy provides an unbiased assessment of the model’s generalisation capability. The same strict patient-based assignment was also applied during the 10-fold cross-validation, ensuring that all scans from each subject were always placed within a single fold.
3.4. Proposed Model Architecture
The detailed model architecture is visually presented in
Figure 4, providing an overview of the attention-driven 3D CNN and its key components. This research proposes a novel CNN model that focuses on channel-wise attention mechanisms from scratch to perform accurate Alzheimer’s disease classification. Along with its superior performance, the model is highly optimised compared to other architectures. The proposed architecture consists of five attention-driven convolutional blocks (ADCB), where each ADCB block uses a Gaussian error linear unit (GELU) function for layer activation with a kernel size of (3, 3, 3) at each 3D-convolution layer. Furthermore, the architecture includes a multiactivation fusion block and a sigmoid activation function in the output layer. Moreover, L2 regularisation was applied to the kernels of each layer, with a lambda value of 0.02.
The proposed model for AD classification is detailed in terms of its network architecture, and a comprehensive model summary with parameter counts is presented in
Table 4. Additionally, the hyperparameters used for effective training of the model are presented in
Table 5.
The combination of attention mechanisms and MAF plays a pivotal role in enhancing the model’s ability to distinguish subtle structural changes, including hippocampal atrophy and cortical thinning in MRI volumes. The channel-wise attention mechanisms enable the network to selectively focus on clinically significant brain regions, such as the hippocampus and cortex, by assigning higher weights to the most informative feature channels. Meanwhile, the MAF block enriches the learned representations by processing the same feature vector through multiple activation functions—ReLU, SiLU, and GELU—with each function capturing distinct nonlinear patterns. This design ensures that spatially localised patterns and diverse nonlinear characteristics are emphasised, resulting in improved robustness and classification performance in heterogeneous patient populations.
3.5. Attention-Driven Convolution Block (ADCB)
The ADCB block employs a series of layers for feature extraction. After the 3D-CNN layer extracts the features, the results are fed into an attention block. Here, GAP is applied to compute a vector, which is then multiplied by each channel. This process increases the weight of the channels that the attention mechanism focuses on more. The output is subsequently passed through a batch normalisation layer, followed by a max-pooling layer. A detailed explanation of each layer’s structure is provided below.
3.5.1. 3D Convolutional Neural Network (3D CNN)
To extract features for 3D images, a 3D-CNN layer was utilised. Each MRI is processed using the 3D CNN layer, where L, W, and H represent the length, width, and height of the input image, respectively. The layer is designed with multiple layers of convolution kernels, along with GELU activation functions.
The spatial features extracted through this process retain the structure of the input image, ensuring the preservation of volumetric information. After applying the 3D CNN layer, the extracted 3D feature is obtained, where is the count of feature channels.
This block helps to capture spatial hierarchies in three dimensions, which are crucial for distinguishing between complex patterns in Alzheimer’s MRI scans.
Figure 4 illustrates the 3D CNN architecture used in this study.
3.5.2. Channel Attention
The channel attention mechanism enhances feature representation by emphasising the most relevant feature channels within the 3D feature maps. Following feature extraction from the 3D-CNN layer, GAP is first applied to aggregate spatial information and produce a channel descriptor vector of dimension . This descriptor captures the global contextual importance of each feature channel.
To model nonlinear interactions between channels, the aggregated vector is passed through two fully connected (FC) layers with a bottleneck structure similar to the Squeeze-and-Excitation (SE) block [
70]. The first fully connected (FC) layer reduces the dimensionality by a ratio
r (set to 8 in this study), followed by a ReLU activation to introduce nonlinearity, and the second FC layer restores it to the original channel dimension. Finally, a sigmoid activation is applied to generate the channel attention weights
:
where
z is the GAP output,
and
are learnable parameters of the two FC layers,
represents the ReLU activation, and
denotes the sigmoid function.
The resulting attention weights
A are used to recalibrate the original feature map
F through channel-wise multiplication:
where · represents element-wise multiplication between the attention weights and feature maps.
Compared to classic channel attention mechanisms such as SE-Net, the proposed ADCB implementation integrates this lightweight yet expressive channel attention directly into the 3D convolutional pipeline. This design enables efficient parameter usage while maintaining the ability to capture nonlinear inter-channel dependencies critical for distinguishing subtle anatomical variations in MRI data associated with Alzheimer’s disease.
3.5.3. 3D Batch Normalization (3D-BN)
To enhance and expedite the training process, 3D-BN was applied following each 3D convolution layer. The output of the 3D convolution layer was normalised using the BN layer for each minibatch by adjusting the mean and variance of the feature maps accordingly.
where
and
represent the variance and mean of the feature maps, and to prevent division by zero,
is a very negligible constant. After normalisation, the learnable scaling (
) and offset (
) parameters are applied, enabling the network to adjust the normalised output as needed.
The use of 3D batch normalisation reduces internal covariate shift and makes the network less sensitive to initialisation, thereby improving generalisation.
3.5.4. 3D MaxPooling
3D max-pooling layers were employed to shrink the spatial dimensions and preserve only the most significant features. The operation essentially down-samples the data by choosing the maximum value from non-overlapping sections within the feature maps.
By preserving the most significant features and reducing computational load, the 3D max-pooling operation contributes to hierarchical feature extraction while minimising overfitting.
3.6. Global Average Pooling (GAP)
GAP was utilised as a down-sampling technique to aggregate spatial information across the feature maps. For a feature map
, GAP computes the mean of all spatial elements for each channel:
where
is the aggregated value for channel
c, the resulting vector
serves as a condensed global descriptor that is both computationally efficient and effective in reducing spatial redundancy, as depicted in
Figure 5.
3.7. Activation Functions
3.7.1. Rectified Linear Unit (ReLU)
Regarding the most frequently employed activation functions, ReLU is known for its simplicity and effectiveness. It is described as:
Here,
x is the activation function’s input. ReLU gives the model nonlinearity so that it may learn complex relationships between inputs and outputs. ReLU also guarantees that gradients stay non-zero for positive input values, helping to prevent the vanishing gradient problem. It can also be challenged by the “dying ReLU,” in which case some neurons may become inactive for the whole training cycle, should they regularly produce zero.
3.7.2. Sigmoid Linear Unit (SiLU)
Often referred to as the Swish activation function, SiLU can be described as:
where
is the sigmoid function:
SiLU combines the properties of both the sigmoid and linear functions. This results in smooth, non-monotonic behaviour that enhances gradient flow during training, especially in biomedical imaging. Research has shown that this activation function performs well across a variety of deep learning tasks by facilitating better feature representation and improving learning dynamics compared to other activation functions.
3.7.3. Gaussian Error Linear Unit (GELU)
The GELU function is a smooth approximation of the ReLU function with stochastic regularisation properties. GELU is mathematically written as:
where
is the cumulative distribution function of the standard normal distribution:
Alternatively, GELU can be approximated for computational efficiency as:
GELU allows for smooth activation that retains input values based on their significance, unlike ReLU, which truncates all negative values to zero. This property makes GELU especially effective in transformer-based architectures and large-scale models.
The hyperparameters used in the proposed model were selected through a controlled validation-based tuning procedure. Each hyperparameter was explored within a predefined range informed by common practice in 3D CNN and attention-based architectures. Specifically, the L2-regularisation coefficient () was evaluated over the range {0.001, 0.005, 0.01, 0.02, 0.05}, with providing the best trade-off between stability and overfitting control. The channel-attention expansion factor was tested over {4, 6, 8, 10}, and the value of 8 offered the most consistent convergence behaviour.
The initial learning rate of 0.01 in
Table 5 represents the starting point of a cosineannealing schedule. During fine-tuning, this learning rate was gradually reduced to a minimum of
, allowing the model to refine its weights with minimal oscillation. Other hyperparameters (batch size, optimiser parameters, activation combinations) were tested within narrow ranges and fixed once optimal validation performance was observed. This systematic tuning process ensured that all selected values reflected validation-driven optimisation rather than arbitrary choices.
3.8. Multi-Activation Fusion (MAF) Block
The MAF block improved the network’s capacity to record various characteristic features. Three FC layers processed the GAP feature vector. Each layer used a different activation function: GELU, SiLU, and ReLU. These activation functions were selected for their ability to capture nonlinear relationships in the data.
The outputs from the three dense layers were concatenated to form a unified feature vector:
This fusion technique utilises the strengths of every activation function. It offers a more expressive representation of the MRI data. Before classification, an FC layer further reduced the fused vector’s dimensions. A dropout layer was added to prevent overfitting.
The MAF component integrates multiple activation functions (GELU, ReLU, and SiLU) to capture complementary nonlinear transformations in MRI data, enriching feature representations and enhancing diagnostic accuracy. This fusion mitigates the limitations of any single activation function. For example, ReLU introduces sparsity but suppresses all negative values, whereas GELU and SiLU retain graded responses in both positive and negative domains. This diversity improves the network’s robustness and generalisation, particularly in detecting subtle regional variations that are characteristic of early-stage AD.
The GELU activation is probabilistic, weighting each input by the Gaussian cumulative distribution function. Unlike deterministic activations such as ReLU and SiLU, GELU preserves small negative or near-zero values with probability proportional to their magnitude. This probabilistic smoothing allows weak structural patterns—often diffuse in the early stages of Alzheimer’s disease—to be retained rather than abruptly discarded. Although GELU does not directly enhance interpretability in the same way as attention mechanisms, its smoother activation profile yields more stable and anatomically consistent feature responses, indirectly supporting interpretability in MRI-based analysis.
Overall, this fusion strategy broadens the expressive capacity of the model by combining activation functions with distinct gradient behaviours. Their parallel application expands the effective function space the network can represent, improving its ability to learn subtle structural variations in brain MRI data. This contributes to more stable convergence and higher classification accuracy. The MAF block, therefore, spans sparse (ReLU), smooth (SiLU), and probabilistic (GELU) activation regimes, promoting robust learning under heterogeneous imaging conditions.
While conceptually analogous to multi-headed attention in its aim to capture diverse aspects of feature representations, the MAF block differs fundamentally in mechanism. Multi-headed attention partitions feature embeddings into subspaces and learns attention weights to model contextual relationships. In contrast, MAF enhances nonlinearity diversity by applying multiple activation functions to the same feature vector, thereby enriching representational capacity without introducing attention parameters or inter-feature dependencies. Thus, MAF focuses on functional diversity across activations, whereas multi-headed attention emphasises relational diversity across features.
3.9. Implementation Details
The decision to employ five convolutional attention (ADCB) blocks was guided by empirical evaluation as well as the inherent spatial constraints of the 3D MRI volumes (128 × 128 × 128). Each block includes a convolution and a 3D max-pooling operation that halves the spatial resolution, meaning that after five sequential pooling stages, the feature map reaches a size of 2 × 2 × 2 with 512 feature channels. This represents the deepest feasible spatial compression before collapsing to non-viable dimensions (e.g., 1 × 1 × 1) or losing meaningful volumetric context.
Preliminary experiments with three and four blocks showed that shallower configurations preserved larger spatial grids but failed to capture high-level structural dependencies relevant to Alzheimer’s pathology, producing weaker discrimination between AD and non-AD patterns. Furthermore, reducing the architecture to four blocks was found to be computationally unfavourable due to the excessively large tensor size produced before the dense layers. Specifically, an architecture with four blocks yields a 2 × 2 × 2 × 256 feature map, which, when flattened and multiplied with the subsequent 512-unit dense layer, results in 28,311,552 parameters—substantially increasing memory consumption and training cost without corresponding gains in performance. This parameter explosion not only makes the four-block configuration inefficient but also raises the risk of overfitting.
Conversely, extending the model to six blocks was not possible because the additional pooling step would reduce the spatial dimension below 2 × 2 × 2, causing the tensor to collapse and preventing stable feature extraction. Therefore, five blocks provided the optimal balance between depth, representational capacity, and computational feasibility for 3D MRI-based AD classification.
On the other hand, the use of a uniform kernel across all convolutional layers was motivated by its strong performance in 3D medical imaging tasks and by theoretical advantages demonstrated in volumetric CNN literature. A kernel provides the smallest receptive field capable of capturing local anatomical variations while maintaining a manageable number of trainable parameters, which is crucial for end-to-end training without pre-training. Larger kernels (e.g., ) were avoided because they dramatically increase computational load and risk over-smoothing fine-grained structural cues characteristic of AD-related atrophy. Smaller kernels (e.g., ), although beneficial for channel mixing, are insufficient for modelling spatial continuity in 3D neuroimaging. Using a consistent kernel size also stabilises optimisation and ensures that later layers, especially after multiple down-sampling stages, maintain a consistent and interpretable receptive field relative to the original MRI volume. Thus, the kernel choice offers an effective compromise between spatial context, computational efficiency, and model generalizability.
4. Results and Discussion
The experimental setup utilised a high-performance personal computer to execute and evaluate the proposed model. The system was configured with two Intel Xeon 2687W v4 CPUs, providing significant computational power for parallel processing, supported by 64 GB of RAM to handle the substantial memory requirements of deep learning operations. Additionally, the graphical computations were accelerated using an NVIDIA RTX-3090 GPU with 24 GB of dedicated VRAM, ensuring efficient handling of complex 3D convolutional operations and large-scale data.
A test set, especially created by splitting the dataset before training, was used for model evaluation. This method guarantees that the test data stays invisible during the training process, offering an objective evaluation of the generalizability and performance of the model. The assessment concentrated on several performance criteria, so as to capture from several angles the efficiency and dependability of the model.
Using several evaluation criteria, the aim was to fully benchmark the performance of the model and pinpoint its possible limits and strengths. These measures highlight the model’s capacity to balance sensitivity, precision, and general classification accuracy, thereby providing a complete awareness of its predictive qualities. The success and dependability of the model’s training, as well as its applicability to real-world Alzheimer’s disease detection situations, depend much on the analysis of these results.
The main metrics applied to assess the classifier are discussed in the following part, together with their relevance and how they help to understand the performance of the model.
4.1. Accuracy
Examining the performance of a classification model requires first considering accuracy. Although accuracy is crucial for a first review, to fully appreciate the performance of the model, one should take other factors into account as well, such as precision, recall, and the F1 score. It is found by the ratio of properly categorised events to the dataset’s overall count. Equation (
12) shows the computation accuracy formula.
True Positive (TP): The count of positive cases the model correctly identified as positive.
True Negative (TN): The count of negative cases the model correctly categorised as negative.
False Positive (FP): The count of negative cases the model misinterpreted as positive.
False Negative (FN): The count of positive cases the model misclassified as negative.
4.2. Precision
Evaluating the accuracy of a model’s positive predictions depends critically on precision. It is computed as the ratio of TP forecasts to the overall count of positive predictions, so encompassing both TP and FP. A high precision score indicates a low rate of false positive errors, indicating the dependability of the model in producing positive identifications.
If the precision is equal to 1, it signifies perfect accuracy in recognising positive samples, meaning the model correctly identifies all positive instances without misclassifying any negative samples as positive.
4.3. Recall
When assessing a model’s performance in spotting positive events, recall, often known as sensitivity or the true positive rate, is a crucial indicator. It shows the model’s capacity to appropriately point out TP by showing the percentage of real positive cases the model detects. Although it ignores FP, a high recall value shows that the model efficiently classifies most positive instances.
This metric is especially significant in scenarios where the primary objective is to capture as many TP as possible, even if it results in more FP.
4.4. Area Under Curve (AUC)
Another metric for evaluating model performance is the AUC. Mathematically, it can be expressed as:
In this context, FPR signifies the false positive rate, and refers to the inverse of the false positive rate at the threshold t. The AUC score ranges from 0 to 1, with higher values indicating better model performance.
4.5. Loss Function
The loss function used in this model is the weighted binary cross-entropy (BCE). It is typically used for binary classification tasks and penalises incorrect predictions based on their probability. The weight assigned to each class can be adjusted to address class imbalance, giving greater importance to minority class samples. The weighted BCE loss is mathematically defined as:
In this context, N represents the number of batch samples, refers to the true label for each sample, and denotes the predicted probability of class 1. The weights for the positive and negative classes are represented by and , respectively. This function helps the model to prioritise accurate predictions for the underrepresented class, especially in situations of class imbalance.
In this study, the class weights were computed based on the inverse of class frequencies in the training dataset to counteract the imbalance between AD and CN samples. Specifically, the weights were calculated as:
where
denotes the number of samples in class
c. This formulation ensures that both classes contribute equally during optimisation, despite the unequal number of samples. Preliminary experiments confirmed that these analytically derived weights achieved stable convergence without the need for additional empirical tuning. This function helps the model to prioritise accurate predictions for the underrepresented class, thereby improving sensitivity to Alzheimer’s cases while maintaining balanced overall performance.
4.6. Self-Comparisons and Model Performance Analysis
The proposed model involved an iterative process of tuning multiple hyperparameters and carefully analysing its performance at each step. During this process, all intermediate performance values were computed exclusively on the validation set, and the independent test set was not accessed until the final model was completed. Initially, the model was trained on a simple network configuration, yielding a validation precision of 100% but a recall of only 2.35%. This stark disparity highlighted issues such as data imbalance and the need for a more suitable architecture aligned with the dataset’s nuances.
To address these challenges, the input data was refined by removing skull regions using the HD-BET tool [
69]. This preprocessing step was followed by data augmentation techniques, including rotations (±15
), shear transformations (shear ratio: 0.1), and translations (±5 pixels). The architecture was then restructured using a hybrid strategy combining CNNs and vision transformers, incorporating five 3D-CNN blocks with channeland pixel-level attention, batch normalisation, and 3D max-pooling (MaxPool3D) layers.
These modifications improved the validation accuracy to 75.12%, which further increased to 76.35% with augmented data. However, training stability remained an issue, as the loss curve exhibited significant fluctuations and poor convergence. To mitigate this, a “reduce-on-plateau” learning rate scheduler was employed. Since the minority Alzheimer’s class remained difficult to learn, data augmentation alone proved insufficient. A weighted loss strategy was therefore implemented using analytically derived class weights (Equation (
16)). This adjustment resulted in validation accuracies of 79.10% (without augmentation) and 74.92% (with augmentation).
Removing the spatial attention mechanism further increased the validation accuracy to 86.84%, indicating its limited utility for this task. Although the GradCAM visualisation confirmed that the model focused on clinically relevant regions, the classifier block underutilised the extracted features. To address this, the MAF block was introduced, yielding an improved validation accuracy of 84.33%. To reduce overfitting, L2 regularisation (
= 0.02) and 20% dropout were applied within and after the MAF block (
Figure 4). Only after finalising this architecture was the model retrained from scratch on the training set and evaluated once on the independent hold-out test set. This final evaluation achieved a test accuracy of 92.1%, demonstrating the robustness and effectiveness of the proposed approach.
Through this systematic and validation-driven refinement process, the final model addressed the initial challenges and achieved state-of-the-art performance.
4.7. Evaluation with Recent Studies
Recent advancements in biomedical imaging have showcased significant improvements facilitated by the application of deep learning models. These state-of-the-art models, such as 3D-MobileNetV2, 3D-VGG19, Inception V3-3D, 3D-DenseNet121, 3D-VGG16, 3D-ResNet18, 3D-RegNet, and M3T, have been extensively employed for tasks like medical image classification and segmentation. Each of these architectures brings unique features, such as depthwise separable convolutions in MobileNetV2 for efficiency or densely connected layers in DenseNet121 for gradient flow and feature reuse, contributing to performance enhancement in diverse biomedical applications.
All models in the comparative analysis were trained and evaluated using the same ADNI dataset, identical preprocessing steps, and executed on the same hardware. This setup ensures a fair and consistent benchmarking environment. Detailed performance metrics, including accuracy, AUC, precision, recall, F1-score, and computational cost, were calculated and analysed for each model. The proposed framework combines 3D-CNNs, attention mechanisms, and an MAF block, aiming to outperform existing architectures by enhancing feature extraction and prioritising diagnostically relevant information through channel-wise attention modules.
Table 6 summarises the findings of these relative analyses. All experiments were conducted under identical conditions using a single NVIDIA RTX 3090 GPU with 24 GB memory, the same batch size, and identical preprocessing and data loading pipelines to ensure fairness in computational cost evaluation. This consistent setup allows for an equitable comparison of both accuracy and efficiency among models. Emphasising the better accuracy and efficiency of the proposed framework, the table offers a complete picture of how each model performed in the test set. This work shows how the proposed method achieves state-of-the-art performance in the classification of Alzheimer’s disease by methodically assessing recent advances. Moreover, the next parts go over a comparison of the self-experimentation of the model.
The results reported in
Table 6 indicate that the proposed model achieves strong overall performance, with an accuracy of 92.10% and an AUC of 0.99. In addition, the model achieved a sensitivity (recall) of 89.3% and a specificity of 94.1% on the independent ADNI test set. When considered alongside the previously reviewed methods in
Table 2, it becomes clear that many existing CNN-based and hybrid architectures achieve comparable accuracy only with substantially higher computational cost or larger parameter counts. In contrast, the proposed attention-driven 3D CNN with MAF attains competitive or superior performance while maintaining a lightweight design trained entirely from scratch. The improvements in sensitivity and F1-score further suggest that the combination of channel-wise attention and multi-activation fusion enables the model to capture subtle AD-related structural variations more effectively than several state-of-the-art approaches. These findings highlight the contribution of this work by demonstrating that high diagnostic performance can be achieved with significantly reduced architectural complexity, offering a more efficient and scalable solution for MRI-based AD classification.
4.8. Evaluation with Different Activation Functions
The motivation behind using multiple activation functions lies in their complementary nonlinear properties, which are particularly advantageous in neuroimaging tasks where both positive and negative voxel intensities contain clinically meaningful information. ReLU introduces sparsity and works well for high-dimensional data; however, it completely discards negative values, potentially removing subtle but important variations in MRI intensity patterns. SiLU, in contrast, is a smooth and non-monotonic activation that preserves negative inputs with gentle slopes, enabling more stable gradient flow in deeper architectures. GELU probabilistically retains inputs based on their magnitude, offering smoother gating behaviour than ReLU and introducing distinctive curvature that is effective for modelling fine-grained anatomical differences.
Although the SiLU and GELU activation curves appear visually similar around the interval
, their gradients, curvature behaviour, and treatment of low-magnitude inputs differ in meaningful ways. These differences become more relevant in high-dimensional MRI feature spaces, where small nonlinear variations can amplify important structural details. As a result, SiLU and GELU provide complementary transformations when applied in the AD-related representational power of the MAF block. To evaluate this fusion strategy, we conducted ablation experiments using several activation-function combinations across the three MAF branches. The performance metrics—including Accuracy, AUC, Precision, Recall, and F1-Score—are reported in
Table 7. The best performance was consistently achieved when combining ReLU, GELU, and SiLU, which outperformed all other configurations across every metric. In contrast, using the same activation function in all branches (e.g., ReLU × 3) led to noticeably weaker performance, suggesting limited nonlinear diversity.
These findings confirm that mixing nonlinearities allows the model to capture a broader range of MRI feature patterns than any single activation alone. To ensure the robustness of these observations, each experiment was repeated five times with different random seeds. The superior performance of the ReLU + GELU + SiLU configuration remained consistent across all runs. Furthermore, this behaviour was supported by the 10-fold cross-validation results presented in
Section 4.10. Collectively, these results demonstrate that diverse activation functions within the MAF block generate richer and more expressive feature representations, ultimately improving the model’s ability to detect subtle AD-related structural changes.
4.9. Evaluation of Proposed Model
The evaluation of the proposed model revealed its exceptional performance, marked by significant improvements throughout the training process. The model achieved a validation accuracy of 96.36% with a corresponding validation loss of 0.2822, as illustrated in
Figure 6 and
Figure 7, respectively.
In the validation dataset, the model attained a precision of 95.56%, a recall of 95.56%, and an area under the curve (AUC) of 99.54%. These metrics reflect the model’s ability to balance precision and recall effectively while maintaining high discriminative power. The progression of these metrics during training and validation is visualised in
Figure 8,
Figure 9 and
Figure 10, which highlight the consistent and robust performance of the model.
This comprehensive evaluation demonstrates that the proposed model not only excels in accuracy but also maintains a high level of reliability and generalisation, confirming its suitability for the task at hand.
4.10. K-Fold Cross Validation
To evaluate the generalisability of the proposed approach, a 10-fold cross-validation strategy was implemented. This method ensures a comprehensive evaluation of the model using multiple subsets of the data, helping to reduce the risk of overfitting and produce a more reliable performance assessment. In the K-fold approach, the dataset described in
Section 3.1 is divided into ten subsets. During each fold, one subset is reserved for testing, another for validation, and the remaining eight subsets for training. Once an iteration concludes, the subsequent subset is selected for testing, whilst the others are redistributed to serve as validation and training sets. Importantly, all folds were constructed using a strict patient-based strategy, ensuring that all scans belonging to the same subject remained within a single fold and never appeared across multiple folds during cross-validation.
The dataset is split into K equal parts in k-fold validation—that is, 10 in this case. One part is set aside for validation each iteration; the other parts are used for training. This procedure is repeated K times, such that every subset exactly once acts as the validation set. A strong estimate of the performance of the proposed model is obtained by averaging the evaluation measures over all K repetitions.
The K-Fold Cross Validation process has a mathematical form shown as:
where
represents the evaluation metric for the
fold, and
K is the number of folds (10 in this case).
Following comprehensive experimentation and adjustments, the optimal model underwent evaluation through 10-fold Cross-Validation. The findings indicated the model’s robust performance, with a mean test loss of 0.7713, a mean test accuracy of 91.23%, a mean test AUC of 93.75%, a mean test precision of 90.29%, and a mean test recall of 88.30%. These metrics underscore the model’s robustness and dependability for classifying Alzheimer’s disease. Employing 10-fold cross-validation supports the credibility of the performance metrics, mitigating bias from specific data divisions.
The higher AUC on the independent test set (0.99) reflects the relative homogeneity of that split, whereas the lower cross-validation AUC (0.93) results from increased scanner and demographic variability across folds. Consequently, cross-validation provides a more conservative and realistic estimate of model performance under heterogeneous clinical conditions.
4.11. WILCOXON Signed-Rank Test
A statistical analysis was performed to assess the significance of the results and determine whether they were due to random chance. The researchers used Wilcoxon signed-rank tests to compute p-values for each model comparison. This test is commonly utilised for comparing paired samples when the data do not meet normal distribution assumptions. By evaluating pairwise differences across multiple observations, the analysis aimed to assess variations in population median ranks.
The findings, summarised in
Table 8, indicate that the proposed model significantly outperformed the other models tested. Specifically, the
p-values for comparisons between the proposed model and the other five models were all less than 0.05, indicating that the proposed model provides statistically significant improvements over its competitors.
4.12. Uncertainty Quantification and Cross-Validation Variability
To provide a rigorous assessment of model performance, we report uncertainty estimates for both the independent test set and the 10-fold cross-validation procedure.
Independent Test-Set Confidence Intervals
For all proportion-based metrics (accuracy, sensitivity, specificity, precision, and F1-score), 95% confidence intervals were computed using the Wilson binomial method based on the observed confusion-matrix counts. The resulting confidence intervals are summarised in
Table 9. ROC AUC confidence intervals were computed using DeLong’s analytic method, which provides a non-parametric estimate of the standard error with minimal computational overhead.
To characterise performance variability across data partitions, fold-wise results are reported together with their Wilson 95% confidence intervals, as shown in
Table 10. These values illustrate the performance dispersion introduced by the 10-fold cross-validation procedure.
Finally,
Table 11 summarises the mean and standard deviation of each metric across the 10 folds, providing an aggregate measure of cross-validation variability. These results demonstrate that the model exhibits stable performance across folds with limited variability.
4.13. Visualisation Through Gradient-Weighted Class Activation Map (Grad-CAM)
Grad-CAM is a visual tool for deep learning model decision-making interpretation. It offers an understanding of which areas of an input image most affect the predictions of the model. Within the framework of medical imaging, especially the 3D MRI scan analysis for AD, activation maps produced by Grad-CAM are essential for comprehending the spatial areas impacting the network’s classification. These visualisations are indispensable since they let viewers confirm whether the model emphasises anatomically important areas known to show structural changes in AD patients, including the hippocampus, ventricles, and cortex.
Computing the gradients of the target class score with regard to the convolutional feature maps of the model drives the Grad-CAM process. These gradients are then weighted and aggregated to create a coarse heatmap emphasising the image’s salient areas. Superimposed on the original input, this heatmap provides a clear visual depiction of the areas most influencing the classification choice. For example, in
Figure 11, the activation map highlights regions of the brain, focusing primarily on AD-specific structures such as the hippocampus and ventricular areas, as well as regions showing cortical shrinkage.
Figure 11a displays heatmaps that show activated regions are widely distributed throughout the brain. This indicates that the proposed model can comprehensively analyse AD-related abnormalities across the brain. The capability to generate such extensive activation areas is one of the key strengths of transformer networks, attributed to their high receptive field.
Furthermore,
Figure 11b presents the activation map for an AD case in a 3D Maccuracyemplate. The heatmap primarily focuses on the hippocampus in the coronal plane and the ventricle region in the axial view. In particular, the right hippocampus shows a stronger emphasis than the left, which is consistent with previous studies reporting more significant shrinkage in the right hippocampus of AD patients [
71,
72]. These findings confirm that the proposed model effectively captures and highlights AD-related structural alterations in the brain.
Although AD–CN classification has an AUC of 92.75%, a precision of 90.29%, the clinically significant challenges relate to the early detection of MCI and the prediction of conversion from MCI to AD. These tasks generally require longitudinal data, multimodal biomarkers, and harmonised multi-centre cohorts, which fall beyond the scope of the present work. The AD–CN setting used here provides a controlled framework for evaluating the proposed architectural components and analysing feature representation.
Future work will extend the framework to more clinically oriented scenarios, including MCI subtyping, multimodal integration, cross-scanner harmonisation, and longitudinal prediction. Further steps, such as external validation, enhanced interpretability, and alignment with clinical workflows, will be required to support eventual clinical deployment.
5. Conclusions
This study introduced an attention-driven 3D convolutional neural network enhanced with an MAF module for AD classification using structural MRI data. Leveraging 1175 MRI scans drawn from the ADNI dataset, the proposed model integrates channel-wise attention with activation-level diversity to capture a richer set of discriminative structural features. The framework achieved an accuracy of 92.1%, an AUC of 0.99, a precision of 91.3%, a recall of 89.3%, and an F1-score of 92%. Its reliability was further supported by 10-fold patient-level cross-validation, yielding an average accuracy of 91.23%, an AUC of 92.75%, a precision of 90.29%, and a recall of 88.30%.
Compared with recent deep learning approaches, the proposed model demonstrates competitive or superior performance while maintaining a relatively lightweight architecture. These results highlight the value of combining channel attention with multi-activation fusion for improved structural MRI interpretation. Overall, this work underscores the potential of deep learning to support early identification of AD-related brain changes and contribute to more accurate computer-aided diagnosis pipelines.
The accompanying codebase further enhances transparency and supports reproducibility, enabling future research to refine and extend the presented approach.
6. Limitations and Future Directions
Although the proposed model demonstrates strong classification performance, several limitations must be acknowledged. First, the evaluation relied exclusively on the ADNI dataset. While ADNI provides high-quality, well-structured imaging data, the lack of external validation limits conclusions about generalisability. Differences in scanner protocols, preprocessing pipelines, and demographic characteristics prevented the incorporation of datasets such as OASIS or AIBL, and these variations also contributed to the performance gap between the independent test-set AUC (0.99) and the cross-validation AUC (0.93). This discrepancy reflects ongoing challenges in robust generalisation under heterogeneous imaging conditions.
Second, despite class balancing strategies, dataset imbalance and limited representation of diverse populations remain potential sources of bias. Third, although computationally lighter than many transformer-based architectures, the proposed framework still requires substantial GPU resources, posing barriers to deployment in resource-limited clinical environments. Finally, interpretability remains a key challenge in deep learning for medical imaging; improved transparency is essential for clinical adoption.
Future work will focus on several directions. Expanding validation to external and demographically diverse cohorts will be critical for assessing population-level robustness. Integrating additional modalities—such as PET scans, genetic markers, and fluid biomarkers—may enhance diagnostic accuracy and enable multimodal disease staging. Advances in model explainability, including region-level attribution and clinically guided saliency methods, can improve clinician trust and interpretability. Additionally, evaluating the framework in longitudinal settings could support the detection of MCI, the prediction of conversion to AD, and the assessment of disease progression over time. Ultimately, these efforts aim to move the proposed system closer to real-world clinical integration.