3.2. Preprocessing
Dataset preprocessing is a crucial step in computer vision, enhancing images and removing noisy factors to facilitate critical feature extraction and improve classification accuracy. In this study, we implemented a preprocessing step to produce high-quality images and boost the model’s ability to generalize across the varied brain images encountered in different patients. As a result, this reduces computational complexity and enables large-scale image analysis. In addition, these methods enhance quality control by detecting and eliminating artifacts and extraneous information from neuroimaging data, guaranteeing that only data of superior quality is used for analysis and diagnosis. In summary, image preprocessing methods are crucial in using neuroimaging data for Alzheimer’s disease (AD) research, diagnosis, and therapy monitoring.
In this work, we proposed a max-color mathematical formulation technique for contrast enhancement. In the initial step, the gray mean values of each color channel are computed by Equation (1).
where
and
denote the row and column pixel values of the original image
, and the max-gray scale mean is computed by Equation (2).
where
and
are the solution functions of max-value and max-ref mean, respectively. To further correct each channel in a piecewise fashion, the following Equation (3) has been employed as follows:
where
and
are the gain correction factors for each channel, and
is the color correction imager. A few resultant images are shown in
Figure 5a. In this study, to address class imbalance, particularly for the moderate dementia class, several augmentation techniques were applied to enhance the MRI dataset. These techniques included horizontal and vertical flipping to introduce spatial variations, zooming by 20% to simulate different scales, and random rotations (±20 degrees) to ensure rotational invariance. Additionally, images were scaled between 80% and 120% of their original size to simulate variations in brain structure sizes, and elastic deformations were applied to mimic natural anatomical variations in brain tissue. These augmentation methods were carefully selected to preserve the real disease variability while preventing the introduction of synthetic patterns that may not reflect actual disease progression. Some example images are presented in
Figure 5b.
3.3. Transfer Learning Models
Transfer learning models are a powerful approach in deep learning where a model trained on one task (usually with a large dataset) is reused and fine-tuned for another related task with less data. Pre-trained CNN architectures such as DenseNet201 and MobileNetV2 are employed as transfer learning models to extract discriminative local and structural features from MRI scans. These models serve as the backbone networks within the proposed X-ViTCNN framework, enabling efficient learning and improved generalization.
3.3.1. DenseNet201
DenseNet201 provides interconnectivity among all neural network layers, allowing them to store maximum data flow and enabling the network to extract more features, thereby achieving high performance. This provides some benefits, including the resolution of the vanishing-gradient issue, enhanced feature propagation, the ability to utilize features, and a substantial decrease in the parameter count. Some levels, such as LMCI and EMCI, exhibit extremely similar characteristics, making differentiation difficult after several convolutional layers. The knowledge may degrade before reaching the desired destination due to the extended path between the input and output layers.
The objective of DenseNet201 was to tackle the accuracy problem caused by vanishing gradients in complex neural networks [
33]. This makes an appropriate choice for AD categorization in this research. Furthermore, this model helps provide feature map concatenation to avoid feature loss and leads to better performance with reduced computational cost. DenseNet201 enables efficient training and superior performance by including concatenated feature maps from all previous levels in the next layer. The kth layer acquires feature maps from all preceding layers, generating its own feature maps using Equation (4).
The notation
denotes the concatenation of feature maps generated in the corresponding layers. In addition to convolution, the composite function
comprises batch normalization [
34] and rectified linear unit (ReLU) [
35]. Using Equation (5), the batch normalization layer can attain a uniform pattern of activations across the entire network.
The specified equations for the mini-batch mean (
) and variance (
) are defined by Equations (6) and (7):
Additionally, the mean and standard deviation were specified as trainable parameters in Equation (8).
To improve the training process, the activation function of the ReLU was specified as
. For feature extraction, the convolution layer convolves the input representation with a kernel specified by Equation (9).
where
denotes the term for bias. To introduce bottleneck layers, a convolution was performed before each 3
3 convolution. Therefore, the computational complexity in numerous layers is minimized. The selection of hyperparameters, including kernel size, stride, and padding, ensures that the dimensions of the feature map remain consistent throughout the block. A transition layer was implemented to reduce dimensions, consisting of 2
2 average pooling layers, batch normalization, and 1
1 convolution. As a result of the fine-tuning, the initial layers and network weights were transferred. Modifications were made to the final layers following the number of classes and additional convolutional layers. Adam optimizer [
36] was utilized to optimize the learnable parameters to minimize the cross-entropy loss, as defined by Equation (10).
where
X and
Y indicate the number of samples and classes correspondingly. The notation
represents the predicted output while
represents the actual output. The parameters are modified by utilizing Equations (11)–(13).
where bias-corrected mean is defined by
and bias-corrected variance is defined by
.
3.3.2. MobileNetV2
MobileNetV2 utilizes depthwise separable convolution layers, which are a kind of factorized convolutions that decompose standard convolution down into two stages: pointwise and depthwise. Each input channel receives one filter through the depthwise stage of MobileNetV2. The output produced in the depthwise stage is combined with one filter through a pointwise convolution (1 × 1), whereas standard convolutions take all filters from the depthwise stage and apply them in one step to create an output. By breaking a standard convolution into two parts (combining part and filtering part), depthwise separable convolutions greatly reduce both computation costs and model size.
Figure 6 illustrates how a standard convolution (a) can be represented as depthwise convolution (c) and pointwise convolution 1 × 1 (b).
The input to standard convolutional layers can be expressed as feature map F with DF × DF × M dimensions where DF is both the height and the width of the feature map; M indicates the number of input channels (depth); M is the height and width of the output feature map; and N defines the number of output channels (depth), respectively.
The standard convolution layer can be described as a convolution kernel K with the size of
DF ×
DF ×
M ×
N, where
DF represents the size of the 2D convolution kernel and M and N are the number of input and output channels, respectively (as defined earlier). The output feature map for classical convolution is formed using a 1 × 1 stride (no offset) and can be mathematically characterized as per Equations (14) and (15).
The cost of computation is determined by the number of input channels M, output channels N, kernel size ×, and size of the feature map DF × DF. These variables and their interaction will be discussed in relation to the MobileNet architecture. For example, due to using depthwise separable convolutions, MobileNet is not reliant upon a relationship between the size of the kernel and the number of output channels generated. The purpose of the fundamental convolution operation is to combine features and to filter these combined features through the convolution kernel. The use of depthwise separable convolutions (also known as factorized convolutions or depthwise convolutions) allows for significant cost reductions since the filtering and combining processes of the convolution kernel can be separated into two distinct elements.
As an example of how to express a layer differently, let us take the separate convolutions that go into making up a separable convolution as an example. In this case of single-channel depthwise separable convolution, each channel has been assigned one filter. This creates a series of depthwise convolutions resulting in their combined output. In addition to this, a combination will then take place from the previous depthwise outputs to create pointwise convolutions (using a 1 × 1 convolution).
MobileNetV2 uses both batch normalization and ReLU nonlinearity for all two layers of each depthwise separable convolutions. A single 1
dj (operation) is applied to an individual input depth, and the example of a depthwise convolution is defined in Equation (16):
where
is the
depthwise convolutional kernel, and the
channel of the filtered output feature map
is the result of applying
filter on
channel. The computational cost of depthwise convolution is defined in Equation (17):
Although depthwise convolution is more efficient than standard convolution, it does not combine input channels to produce new features. To generate these new features, an additional layer is required. This layer uses a 1 × 1 (pointwise) convolution to perform a linear combination of the output from the depthwise convolution. This specific combination is referred to as depthwise separable convolution, as initially introduced in [
37]. The computational cost of depthwise separable convolutions is expressed by Equation (18), which represents the sum of the depthwise and 1 × 1 pointwise convolutions:
By conceptualizing convolution as a two-step procedure involving filtering and combining, the formulation is defined by Equation (19).
According to normal convolutions, MobileNetV2 utilizes 8 to 9 times less computing using 3 × 3 depthwise separable convolutions. Although depthwise convolutions require minimal computational resources, incorporating spatial dimension factorization, as demonstrated in references [
38,
39], does not yield substantial computational savings.
3.4. Custom ViT Architecture
Let S be an arbitrary collection of r remote sensing images as determined by {Xi,Yi}:{i = 1 to r}. Each Xi is a separate, individual image, while Yi corresponds to the class label assigned to Xi. The class assignment can take any value from Yi ={1, 2, …, m}, where m = the number of classes specified for this collection of images.
The ViT model is focused on locating image patches and the corresponding semantic class label associated with each image patch (ViT model). The ViT is designed using only the vanilla transformer architecture (Transformer) as previously described [
40]. The recent popularity of the transformer architecture has stemmed from the ViT model’s remarkable success in task areas such as NLP, etc. [
41]. One of the characteristics of the transformer architecture is that it comprises an encoder and decoder with a sequence of sequential data being able to be processed simultaneously, rather than through a recurrent neural network-type approach. The transformer architecture’s self-attention mechanism has aided the successful use of transformer-based approaches for modeling with long-term relationships between the sequence elements.
The ViT has been proposed to complement the traditional transformer for image categorization. Its main goal is to apply these techniques across various modalities beyond text, without relying on architecture specific to the data. The transformer encoder module in the ViT handles classification, transforming a sequence of image patches into a semantic label. Furthermore, unlike standard CNN architectures that typically use filters with a local receptive field, the attention mechanism allows it to focus on different parts of the image and combine information from the entire image.
The Vision Transformer (ViT) has been introduced as an extension to the classic transformer for the classification of images. The main purpose of ViT is to deliver transformer-based solutions to every input type besides textual inputs, without requiring the architecture to conform to the data characteristics. The transformer encoder component in the ViT performs the classification of input image patch sequences into a semantic label. In comparison to traditional convolutional neural networks (CNNs) that typically implement local filter-sized receptive fields, the ViT utilizes attention mechanisms that enable it to attend to various locations in the image and amalgamate all information from the complete input image.
The overall structure of the model is displayed in
Figure 7. The model consists of an embedding layer, encoder module and classification head. Initially, a non-overlapping patching strategy is applied on to the training image X, where each extracted patch represents a separate token to be the input to the transformer architecture. Each patch is determined to have the dimensions of c × p × p, being extracted from an image of dimensions c × h × w, where h represents height, w represents width, and c represents the number of channels. The patches are segregated into the sequence x1, x2, …, xn, with n defined as hw/p
2. Typically, the size p of the patch is 16 × 16 or 32 × 32.
We chose DenseNet201 and MobileNetV2 in combination with the ViT for their complementary strengths in feature extraction. DenseNet201 excels in capturing local structural features due to its dense connectivity pattern, while MobileNetV2 provides a lightweight architecture that is computationally efficient, making it suitable for resource-constrained environments. The customized ViT, on the other hand, is designed to capture long-range dependencies and global features, making it an ideal candidate to supplement the CNN-based models. This combination allows the model to capture both fine-grained local features and global contextual information, which is critical for accurately predicting multi-stage Alzheimer’s disease from MRI scans.
The ViT, a core component of the X-ViTCNN architecture, was tailored for AD prediction using MRI scans. It features 12 encoder layers and an embedding dimension of 768, which balances model complexity and computational efficiency. This configuration allows the model to capture both fine-grained local and long-range global features essential for accurate AD classification. A 16 × 16 patch size was chosen to balance computational cost and the preservation of spatial details in the MRI scans. Smaller patch sizes would increase computational demands, while larger ones could miss important features. The selected patch size effectively captures both local and global information. A key design choice was the use of a custom ViT architecture rather than a pre-trained ViT backbone. This decision was made because the MRI scans in Alzheimer’s research have unique characteristics that may not align well with general pre-trained models designed for natural images. A custom ViT allows for better adaptation to these specific features and provides more control over the model’s learning process. Additionally, pre-trained models typically require large and diverse datasets like ImageNet, which are not always representative of the smaller, specialized datasets in medical imaging. The custom ViT architecture was specifically tailored for the challenge of Alzheimer’s disease prediction, ensuring more accurate and relevant feature extraction from MRI data.
3.4.1. Model Parameters and Specification
Table 4 summarizes the parameters, core components, strengths, and limitations of the models used in this study, including DenseNet201, MobileNetV2, the customized Vision Transformer, and the proposed fusion architecture.
3.4.2. Linear Embedding Layer
A trained embedding matrix E converts the digitized patches into a d-dimensional vector before they enter an encoding stage of transformation. For the classification task, these embedded representations are also combined with an additional trainable classification token υclass. The transformer analyzes the entire collection of patches at once and does not address the order in which they are arranged or sequenced.
To allow for the preservation of the original image’s spatially arranged patch locations within the input, positional information has been encoded into the patch representations. An example of the encoded patch sequence that includes the initial classification token (referred to as
z0) is shown in Equation (20).
3.4.3. Vision Transformer Encoder
The
z0 sequence will be fed into the transformer encoder for processing. As illustrated in
Figure 6b, we have constructed the transformer encoder using n identical layers. Each layer contains two primary subcomponents; first, one fully connected feedforward MLP block consisting of two fully connected ML structures separated with a GELU activation function, and then one multi-head self-attention (MSA) block as defined in Equation (21) and MLP block is defined in Equation (22). Finally, we apply layer normalization (LN) and the incorporation of a residual skip connection between both of the two subcomponents of the encoder on a per-layer basis.
The last phase of the encoder includes transferring the first element of the sequence
to an external head classifier to predict the class label.
The encoder has a structural component called an MSA block that is the core of the transformer architecture. The purpose of the MSA block is to evaluate how much one patch embedding is connected to another within a patch embedding sequence. The MSA block consists of several components: a self-attention layer, a linear layer, a concatenating layer (which combines multiple attention outputs), and a final linear layer as illustrated in
Figure 6c. The overall attention weight for the input sequence (i.e.,
z) is calculated as a weighted average of every value present in the series. The SA head computes attention weights via a dot product between the Q, K and V for each position of the input sequence as described in
Figure 6. The results yield Q, K and V for each position of the input sequence via the matrix elements of UKQV where each UKQV is generated via an element-by-element multiplication of the input sequence and the learnt matrices UKQV (cf. Equation (23)). The relative value of an element to all other elements in the input sequence is computed using the dot product between the Q vector of the element and the K vector of each other element in the input series. The resulting values indicate the intrinsic relative value of different portions in the series. In Equation (24), the scaled dot product values are passed through a softmax function. To identify which patch has the greatest attention score, each patch embedding vector value is multiplied by the output of the softmax function, again as described in Equation (25). In addition to this, the SA block’s scaling of dot products uses the key dimension for determining how each of them gets scaled for the purpose of creating the SGEMM operation as defined in Equation (26). These equations explain the full set of work performed to locate points with high attention by using attention scores:
To independently calculate the scaled dot product attention, the MSA block will apply the same operation for each of the
h heads. Each of the attention head’s individual outputs will be merged together and subsequently projected into the desired dimension via a feed forward layer, utilizing learnable weights W. This operation is reflected in Equation (27):
Table 5 presents a comparative overview of the adopted architectures, focusing on their structural design, feature extraction capabilities, and relative computational complexity.
3.4.4. Network-Level Fusion
The designed custom ViT and pre-trained networks are fused into a single network to improve AD prediction’s learning capability. First, the DenseNet201 and MobileNetV2 architectures are fused using a depth concatenation (DC) layer. After that, the X-ViTCNN fused architecture is further concatenated with the custom ViT model. The DC layer is again used to fuse both these networks. After the fusion process, additional layers were added, such as dropout, softmax, and classification output. Mathematically, this process is defined as follows:
Given the global average pool layer of DenseNet201 defined by
and average pool layer MobileNetV2 defined by
, the depth concatenation is defined in Equation (28).
The output of
is further fused with a custom ViT model layer named MLP to refine the proposed model results. The working of the final fusion as follows:
After we added a few layers such as fully connected, dropout, fully connected, softmax, and classification output layer as described in Equation (29).
3.4.5. Model Training for AD Classification
The proposed X-ViTCNN fused model was trained on selected datasets. Several hyperparameters, such as initial learning rate, momentum, regularization factor, epochs, and optimizer, are required in the training process. Researchers typically set hyperparameters based on the existing literature knowledge; however, this approach is not ideal. We employed a Bayesian Optimization (BO) approach to initialize hyperparameters in this work. With BO, the initial learning rate was determined to be 0.000121, the momentum was set at 0.705, the batch size was 128, and the optimizer used was SGD. The trained model is later utilized for the activation function which extracts features and passes to the softmax classifier for the final classification results.
In this study, Bayesian Optimization (BO) was used to fine-tune the hyperparameters of the X-ViTCNN architecture. The optimized hyperparameters included learning rate, momentum, and batch size. The search space for learning rate was between 1 × 10−6 and 1 × 10−2, momentum between 0.1 and 0.9, and batch size between 16 and 256. The BO process was carried out over 50 iterations, using the Expected Improvement (EI) acquisition function to balance exploration and exploitation. The optimization was stopped after five consecutive iterations with less than 0.5% improvement in validation accuracy.
3.4.6. Interpret Fused Model
This work utilized the Grad-CAM visualization tool for the explainable X-ViTCNN fused model. For this purpose, we utilized the Grad-CAM technique. Grad-CAM: One further variant of CAM that has been expanded upon is the Gradient Class Activation Mapping (Grad-CAM) [
42]. The approach is tailored to a particular CNN architecture in CAM, where the GAP layer feeds straight into a softmax layer. Conversely, Grad-CAM allows you to inspect any convolutional layer by utilizing backpropagation to calculate the gradient and applying GAP to weigh each feature map output in that layer as:
In Equations (30) and (31), denotes total pixels, is the class score, and deotes the feature map activation of feature . Before passing the feature map outputs via a function, which ensures that only positive contributions to the class are presented, the output is weighted and added. Based on these maps, the image’s strong points are captured.