Resizer Swin Transformer-Based Classiﬁcation Using sMRI for Alzheimer’s Disease

: Structural magnetic resonance imaging (sMRI) is widely used in the clinical diagnosis of diseases due to its advantages: high-deﬁnition and noninvasive visualization. Therefore, computer-aided diagnosis based on sMRI images is broadly applied in classifying Alzheimer’s disease (AD). Due to the excellent performance of the Transformer in computer vision, the Vision Transformer (ViT) has been employed for AD classiﬁcation in recent years. The ViT relies on access to large datasets, while the sample size of brain imaging datasets is relatively insufﬁcient. Moreover, the preprocessing procedures of brain sMRI images are complex and labor-intensive. To overcome the limitations mentioned above, we propose the Resizer Swin Transformer (RST), a deep-learning model that can extract information from brain sMRI images that are only brieﬂy processed to achieve multi-scale and cross-channel features. In addition, we pre-trained our RST on a natural image dataset and obtained better performance. We achieved 99.59% and 94.01% average accuracy on the ADNI and AIBL datasets, respectively. Importantly, the RST has a sensitivity of 99.59%, a speciﬁcity of 99.58%, and a precision of 99.83% on the ADNI dataset, which are better than or comparable to state-of-the-art approaches. The experimental results prove that RST can achieve better classiﬁcation performance in AD prediction compared with CNN-based and Transformer models.


Introduction
In the past few decades, the field of medical and computer science research has ushered in rapid developments. Therefore, more and more researchers are trying to integrate computer technology into the medical process, providing valuable guidance for improving the utilization rate of limited medical resources [1] and the timely diagnosis of patients' diseases. Among them, classification algorithms applied to medical imaging have been a major field of research. The recent success of deep-learning techniques has inspired new research and development efforts to improve classification performance and develop novel models for various complex clinical tasks [2][3][4][5].
Alzheimer's disease (AD) is an irreversible chronic neurodegenerative disease that progressively impairs cognitive and behavioral functions. Numerous techniques, including brain and spinal cord aspiration, genetic testing, and neuroimaging, can be used to diagnose AD. Because of the high-definition and noninvasive visualization, structural magnetic resonance imaging (sMRI) is one of the most common imaging techniques for AD identification in both clinical and research settings [6]. Therefore, deep-learning algorithms have been applied increasingly frequently for AD classification using sMRI images since they can replace time-and labor-consuming procedures such as feature extraction. In particular, there is no need for feature selection in the deep-learning model, but it can also automatically learn sophisticated features by itself [7]. The recent rise of Transformer-based deep models has also had a significant impact on the field of deep learning.
Furthermore, a majority of current AD-related classification methods require preprocessing procedures like calibration, skull stripping, and alignment to standard templates [8].
In particular, skull stripping is regarded as a crucial preprocessing step that cannot be skipped since it can lessen the influence of irrelevant data on classification outcomes and simplify the computing process of the model [9]. However, insufficient skull stripping or over-processing caused by the toolkit might result in the loss of edge information, and the data must then be manually checked after processing, which takes time and significantly reduces the already-limited dataset relevant to AD. As an improvement of the Transformer, the Swin Transformer [10] has been widely used since its introduction. Some studies have added a time-attention block to the Swin Transformer to measure the feature changes in longitudinal mild cognitive impairment (MCI) data. At the same time, the shifted-window mechanism further integrates spatial features [11]. An adaptive resize-residual network was proposed for formatting the shape and size of image data before feeding them into the backbone model, which fully excavates the CNN's powerful image feature extraction ability [12].
In this study, we propose a deep-learning model for AD classification called the Resizer Swin Transformer (RST) using sMRI images. The Resizer module creates learnable image scaling, acquiring characteristics supporting Swin Transformer classification. The crosswindow connection realized by the moving-window mechanism of the Swin Transformer and patch merging provides multi-scale learning. Additionally, a convolutional neural network (CNN) model is employed to colorize the sMRI images to enable multi-channel learning [13]. The RST can extract sufficient features and achieve the accurate classification of AD with minimal data processing.
The rest of the contents of this paper are as follows: The RST model is briefly introduced in Section 2, along with a summary of some of the relevant work on sMRI-based categorization. The structure of the RST model is further explained in the third part. The fourth part presents the findings of the experiment together with its specifics and provides an analysis of the results. The fifth section gives an overview of all the work in this paper, suggests ways to fix the problems in this experiment, and offers a roadmap for future research.

CNN-Based Classification for AD
The main types of existing AD classification techniques based on the CNN model include the region of interest (ROI), voxel, patch, and attention mechanism. Notably, this categorization does not imply that the four approaches mentioned above are entirely distinct.
ROI-based methods require the pre-segmentation of brain regions based on prior knowledge, such as brain atlases. For instance, Wang et al. segmented the hippocampal area, one of the most AD-sensitive regions, and employed a dense convolutional neural network (Dense CNN) model to classify normal control (NC) and AD samples [14]. On the other hand, one study tried each brain region when training a 3D-CNN ensemble model [15].
The voxel-based technique, which does not require any prior knowledge or laborious preparation, obtains features directly from sMRI images and fully exploits the global characteristics [16]. For instance, Hazarika et al. employed LeNet, AlexNet, VGG, DenseNet, and other models to classify AD while evaluating their effectiveness [17]. In addition, the classification accuracy may be increased by combining the CNN model with transfer learning [18] and data augmentation [19].
Others noticed that only localized brain areas in early AD patients exhibit minor structural abnormalities, which leads to a possibility that features obtained at the voxel or region level cannot contribute to AD identification completely. Patch, the intermediate level between voxel and region, has gained more attention. With flexibility in size and location, patch-based models improve classification accuracy and avoid laborious preprocessing procedures. However, the choice of patches significantly impacts the categorization outcomes. By using anatomical marker detectors, Liu et al. first identified the patches discriminative of AD and then trained a CNN model to learn from those patches [20]. The landmark-based deep multi-instance learning (LDMIL) system was introduced the following year to learn local patch information as well as global information from all patches [21].
Another frequently employed module recently is the attention mechanism, which is quite useful for pinpointing AD-sensitive areas. For example, Zhang et al. added the attention mechanism to the ResNet framework, which effectively enhances the gray matter feature information and increases the accuracy of AD diagnosis [22].

Transformer-Based Classification for AD
One of the strongest deep-learning models available today is the Transformer, and a key component is the attention mechanism [23]. Despite the original purpose of the Transformer being for natural language processing (NLP), abundant studies have since demonstrated that Transformer-based models may reach superior performance in computer vision (Vision Transformer, ViT). Because of the multi-head attention mechanism of the ViT, Li et al. integrated a CNN to capture the relationships between distant brain areas. In this study, the ViT received input from feature maps extracted using the convolutional layer [24]. Another approach proposed by Jang et al. combined the ViT with a CNN structure that has an inductive bias, and the feature maps were generated using 3D ResNet. The 3D information provided by sMRI images can efficiently assess local aberrant characteristics associated with AD and link markers from multiplanar and multilayer slices to gather distant details [25]. Due to the inherent lack of inductive bias in the ViT-related study, a significant quantity of data is needed to train the model. Natural images share similar fundamental properties with brain sMRI images, including texture, edges, shape, etc. Hence, Lyu et al. applied the ImageNet [26] dataset to pre-train the ViT model using joint transfer learning first to address the limited brain imaging data [27].

Limitations of Current Methods
1. ImageNet is a natural image dataset in which each image contains three colors, although the coronal slices taken from 3D sMRI images are only gray-scale. Therefore, using sMRI images directly as the input of the ViT model implies delivering the same images into all three channels, which is a total waste of computational resources. In addition, the gray-scale image deviates significantly from the original RGB color image in each channel.
2. The majority of the current methods demand strict procedures for sMRI image preparation. Skull stripping is one of the crucial components. However, there are still some issues with employing SPM12 for skull stripping, such as partial stripping or loss of edge information due to over-processing. Therefore, the data must be visually checked and manually selected afterward. While eliminating the background (black region) of sMRI images has also become a critical step of preprocessing, the capacity of the deep-learning models is hampered by the growing complexity of the preprocessing process.
3. The ViT model can only extract features at the same scale because its computational complexity is proportional to the square of the image size. In addition, it can only extract features on one scale at a time. However, the damaged brain areas of AD patients are typically subtle, making it possible to overlook some regionally specific details.

Slice Options
The center slices of the sMRI images were chosen for this study because they contain most of the brain information. The classification effect of the coronal plane was found to be slightly inferior to that of the sagittal and axial planes in the comparison experiments. Still, the classification performance of the sagittal and axial planes was not significantly different. Therefore, 30 consecutive 2D slices of size 127 × 181 in the axial plane were employed in this study.

Overview of Resizer Swin Transformer Network
Li et al. applied joint transfer learning approaches to classify AD by first using 3D gray matter volume images as input, extracting local information via ResNet to generate feature maps, and finally adding location encoding to the maps that were input to the ViT [25]. The Swin Transformer performs multi-scale learning through cross-window connection and patch merging, as well as decreases the computational complexity of the ViT from the square level to the linear level [28]. Therefore, this study proposes the ReSwin Transformer (RST), a novel network structure that combines a CNN and the Swin Transformer to realize AD classification using sMRI images.
The RST model proposed in this study is shown in Figure 1. Axial-plane sMRI slices X are the input to the RST. A CNN model is first utilized to map the gray-scale image from one channel to three channels to generate color images. The Resizer module then scales the input image proportionally and accentuates the information essential to classification according to the Swin Transformer during the training procedure. The images are separated into non-overlapping patches using the patch-splitting module. Each patch is regarded as a separate token. A linear embedding layer then maps the token to a size C channel as the Swin Transformer input. Finally, a Softmax layer summarizes the AD and NC classification predictions from all the slices. The RST uses cross-entropy losses during network training, as follows: Li et al. applied joint transfer learning approaches to classify AD by first u gray matter volume images as input, extracting local information via ResNet to feature maps, and finally adding location encoding to the maps that were input t [25]. The Swin Transformer performs multi-scale learning through cross-window tion and patch merging, as well as decreases the computational complexity of from the square level to the linear level [28]. Therefore, this study proposes the Transformer (RST), a novel network structure that combines a CNN and the Sw former to realize AD classification using sMRI images.
The RST model proposed in this study is shown in Figure 1. Axial-plane sM are the input to the RST. A CNN model is first utilized to map the gray-sca from one channel to three channels to generate color images. The Resizer mod scales the input image proportionally and accentuates the information essential fication according to the Swin Transformer during the training procedure. The im separated into non-overlapping patches using the patch-splitting module. Each regarded as a separate token. A linear embedding layer then maps the token to channel as the Swin Transformer input. Finally, a Softmax layer summarizes the NC classification predictions from all the slices. The RST uses cross-entropy losse network training, as follows:  [29], ResNet [30], and EfficientNet [31], are three-channel model the single-channel gray-scale images are often repeated three times as input when transfer learning algorithms are involved. Inspired by the colorization of lung MR to restore the color images of lungs observed by human eyes proposed in [32] an dress the issue of wasted computational resources, our method converts brain s ages to color images using the CNN to achieve cross-channel learning from single to three-channel imaging [13]. Figure 2 illustrates the CNN structure.

CNN Module
sMRI scans only contain gray-scale information, so sMRI is generally known as a single-channel imaging technique. On the other side, various widely used CNN models, such as AlexNet [29], ResNet [30], and EfficientNet [31], are three-channel models. Hence, the single-channel gray-scale images are often repeated three times as input when the joint transfer learning algorithms are involved. Inspired by the colorization of lung MRI images to restore the color images of lungs observed by human eyes proposed in [32] and to address the issue of wasted computational resources, our method converts brain sMRI images to color images using the CNN to achieve cross-channel learning from single-channel to three-channel imaging [13]. Figure 2 illustrates the CNN structure.
In particular, the real color is converted into a vector via = ℊ −1 ( ), and is the weight that measures the rarity of the color class and thus rebalances the loss. nally, the probability distribution � is mapped to color values by the function � = ℊ(

Resizer Module
In the field of deep learning for image processing, the input image size is typica scaled to 224 × 224, and both training and inference are carried out at that resolution. C rently, image scaling often uses both bilinear and trilinear interpolation. At the same tim the sMRI slices need to be resized to fit as input to the network. In actuality, this mod cation does not improve the image of the network and even somewhat reduces the p formance of the model [33]. Therefore, this experiment substitutes the learnable Resiz module for the original linear interpolation after the CNN transfers the single-channel the three-channel color space. Resizer seeks to significantly improve the classification p formance by learning attributes favorable to Swin categorization by collaborative traini with the backbone network, in contrast to other approaches that resize images to enhan human eye perception. The Resizer module is shown in Figure 3. The bilinear Resiz model allows features calculated at the original resolution to be incorporated into t According to the given brightness Y on the gray-scale image, two chroma channels, a and b, were generated based on the CIE Lab color space. Then, the brightness-chroma color space was transformed into an RGB color space using the image and OpenCV library. For a given brightness channel X ∈ R H×W×1 , we converted it to Y ∈ R H×W×2 via the map Y = F (X), where H and W are the dimensions of the image. In addition, for a given X, its probability distribution was also obtained:Ẑ = H(X), whereẐ ∈ [0, 1] H x W x Q and Q represent the number of output spaces from channels a and b. The following equation compares the true value withẐ through polynomial cross-entropy loss L cl (·, ·): In particular, the real color Y is converted into a vector Z via Z = ℊ −1 gt (Y), and v(·) is the weight that measures the rarity of the color class and thus rebalances the loss. Finally, the probability distributionẐ is mapped to color values by the functionŶ = ℊ Ẑ .

Resizer Module
In the field of deep learning for image processing, the input image size is typically scaled to 224 × 224, and both training and inference are carried out at that resolution. Currently, image scaling often uses both bilinear and trilinear interpolation. At the same time, the sMRI slices need to be resized to fit as input to the network. In actuality, this modification does not improve the image of the network and even somewhat reduces the performance of the model [33]. Therefore, this experiment substitutes the learnable Resizer module for the original linear interpolation after the CNN transfers the single-channel to the three-channel color space. Resizer seeks to significantly improve the classification performance by learning attributes favorable to Swin categorization by collaborative training with the backbone network, in contrast to other approaches that resize images to enhance human eye perception. The Resizer module is shown in Figure 3. The bilinear Resizer model allows features calculated at the original resolution to be incorporated into the model. It acts as an inverse bottleneck (up-scaling). The skip connection in Figure 3 combines the bilinearly resized image with the CNN features.

Swin Transformer
The Swin Transformer utilizes the within-window calculation of self-attention to increase modeling efficiency. Beginning in the top-left corner, the window evenly and nonoverlappingly divides the image into sections. Assuming that there are × patches in a window, the next module utilizes a different window than the previous layer and moves an (� 2 �, � 2 �) patch from the original window when the window-based self-attention module completes its computation. The calculation process of the Swin Transformer block is as follows:

Swin Transformer
The Swin Transformer utilizes the within-window calculation of self-attention to increase modeling efficiency. Beginning in the top-left corner, the window evenly and nonoverlappingly divides the image into sections. Assuming that there are M × M patches in a window, the next module utilizes a different window than the previous layer and moves an ( M 2 , M 2 ) patch from the original window when the window-based self-attention module completes its computation. The calculation process of the Swin Transformer block is as follows:Ẑ l =W-MSA LN Z l−1 + Z l−1 , whereẐ l and Z l are multi-headed self-attentive based on the window and shifted window, respectively, and characterized by the output of the multilayer perceptron (MLP) module. Window movement enables the reciprocal learning of patches between several windows, thus achieving the goal of global modeling. Figure 4 depicts the model of the Swin Transformer structure as well as the calculation procedure of the block.
where � and are multi-headed self-attentive based on the window and shifted window, respectively, and characterized by the output of the multilayer perceptron (MLP) module. Window movement enables the reciprocal learning of patches between several windows, thus achieving the goal of global modeling. Figure 4 depicts the model of the Swin Transformer structure as well as the calculation procedure of the block.
Window multi-head self-attention (W-MSA) adds a relative position bias ∈ ℝ 2 × 2 for each head in the calculation of multi-headed self-attention as follows: where , , and ∈ ℝ 2 × are the query, key, and value matrices, respectively. is the dimension of the key. The number of patches in the window is 2 . Given that the relative location is within [− + 1, − 1], the bias matrix � ∈ ℝ (2 −1)×(2 −1) is set. The values of are derived from � . , , and are calculated from , , and by applying a linear transformation of − .
On the other hand, shifted-window multi-head self-attention (SW-MSA) uses the circular window movement rule to compute the multi-headed self-attentiveness.  Window multi-head self-attention (W-MSA) adds a relative position bias B ∈ R M 2 ×M 2 for each head in the calculation of multi-headed self-attention as follows: where Q, K, and V ∈ R M 2 ×d are the query, key, and value matrices, respectively. On the other hand, shifted-window multi-head self-attention (SW-MSA) uses the circular window movement rule to compute the multi-headed self-attentiveness.

Introduction of the Datasets
The Australian Imaging, Biomarker & Lifestyle (AIBL, aibl.csiro.au) [34] and the Alzheimer's Disease Neuroimaging Initiative (ADNI, adni.loni.usc.edu) [35] were used in this study. The National Institutes of Health and the National Institute on Aging provided funds to establish the ADNI, the premier data center for AD research. Both datasets gather sMRI, fMRI (functional MRI), and positron emission computed tomography (PET) from AD and NC participants.
The data used in the study were collected from an MRI scanner built by the MRI manufacturer SIEMENS. The slice thickness is 1.2 mm; the field strength is 3.0 Tesla. All the sMRI images were downloaded from ADNI-GO, ADNI1, and AIBL. Standard sMRI image preprocessing was conducted using SPM12 (fil.ion.ucl.ac.uk) on Matlab (R2022a), including format conversion, AC-PC correction, non-parametric non-uniform intensity normalization (N3), and alignment to the MNI standard template. The reconstructed images are 181 × 217 × 217, the voxel size is 1 × 1 × 1 mm 3 , and normalized intensity values are in the range of [0,1]. Table 1 displays the demographic information of the datasets. The ADNI and AIBL datasets are openly accessible, but access to the information still needs official authorization. Additionally, no data may be shared without consent, and only approved researchers may utilize it for studies. The summary of the demographics of the dataset is illustrated in Table 1.

Training Setup
This experiment used the PyTorch deep-learning framework to construct the proposed network model. Two NVIDIA 3080 TI GPUs were implemented on a server for training the classification task. The model was first transferred to the sMRI dataset after being pretrained on the ImageNet-1K dataset. Since the Resizer module may shrink the images to eliminate a tiny amount of incorrect information (dark parts of sMRI images), cropping the input images is unnecessary. We tested various batch sizes and learning rates to determine the settings that would produce the best experimental results. The experimental outcomes are displayed in Figure 5. Most curves are generally stable at epoch = 50, yet some still vary remarkably. It is the most stable when the batch size is 16 and the learning rate is 5 × 10 −5 . Therefore, the batch size was set to 16, and the learning rate was set at 5 × 10 −5 , considering parameters like processing speed and classification performance. With a patch size of 4 × 4, and using cross-entropy loss as a loss function, all networks were trained using the Adam optimizer. The training for each stage was performed up to 50 epochs on 70% of randomly selected data, and the best model was selected based on separate randomly selected 10% of the data for validation. The results of this study are based on the remaining unseen test set of 20% with the epoch.

Experimental Results
The classification performance was assessed using the following four common specificity indicators: accuracy (ACC), sensitivity (SEN), specificity (SPE), and precision (PRE). True positive, true negative, false positive, and false negative are labeled TP, TN, FP, and FN. Then, ACC, SEN, SPE, and PRE can be expressed as In order to demonstrate the superiority of this method, we compared various studies of different types in this field on ADNI public datasets, as shown in Table 2. Experimental results show that our method has excellent performance. At the same time, we compared studies based on Transformers. These studies also performed well, benefiting from the ability of Transformers to capture distant information. Compared with the research in [21], which also combines transfer learning technology, the performance of the RST architecture is significantly improved. It can be seen that our proposed RST architecture can effectively classify AD and NC. The abbreviations of the models are taken from the corresponding research. ROI means the model is based on the region of interest (ROI), and the rest are similar. Accuracy (ACC), sensitivity (SEN), specificity (SPE), and precision (PRE) are the four indicators used to assess the classification performance. The bold denotes our method.

Ablation Experiments
We carried out several ablation experiments to maximize the experimental outcomes. Experiment 1: We used sMRI slices in three different orientations. In Table 3, the experimental findings are displayed. The worst result was obtained when using the coronal plane slice without the skull-stripping preprocessing procedure, which differed greatly from the findings of the other two orientations. Since most experiments were centered on the axial plane and the differences between the experimental outcomes in the axial and sagittal planes were not particularly apparent, axial plane slices were employed in this experiment.  Table 4. The RST model, as implemented by a CNN, performs noticeably better than the other experiments, as shown in the table. In contrast, the step of skull stripping only slightly improves the performance of the RST model and is not proportional to the additional expense it incurs. We used sMRI's axial, sagittal, and coronal slices as the three input channels of the RST model, but surprisingly, the RST model did not perform well on 2.5D image data. Our analysis results may be due to 2.5D images containing more information, and the model cannot effectively extract features from them. At the same time, we also conducted a comparison with the Swin Transformer, and the experimental results show that the RST is still superior to the Swin Transformer on the ADNI dataset. Experiment 3: Investigations were carried out on several datasets to confirm the robustness of our proposed model. The outcomes of our trials are shown in Figure 6. The table illustrates that our model performs better for various datasets. The analysis results may be caused by the unevenness of the datasets and the image discrepancies between the datasets because the findings of AIBL, on the other hand, are worse than those of ADNI. Experiment 3: Investigations were carried out on several datasets to confirm bustness of our proposed model. The outcomes of our trials are shown in Figure  table illustrates that our model performs better for various datasets. The analysis may be caused by the unevenness of the datasets and the image discrepancies b the datasets because the findings of AIBL, on the other hand, are worse than t ADNI.

Conclusions and Future Work
This paper presents a Resizer Swin Transformer architecture that combine channel learning with a CNN and extracts features from two-dimensional axia slices of brain MRI. The analysis and visualization of the experimental data demon the accuracy with which the RST architecture can accomplish the categorization of well as its strong adaptability to various datasets. This experiment does still hav flaws, though. The RST model has comparatively high model parameters, and cros nel learning in conjunction with the CNN enhances the classification performanc the model. Meanwhile, we discovered that the accuracy of the experimental find lower than 85% when the model trained on ADNI is directly assessed using AIB

Conclusions and Future Work
This paper presents a Resizer Swin Transformer architecture that combines crosschannel learning with a CNN and extracts features from two-dimensional axial-plane slices of brain MRI. The analysis and visualization of the experimental data demonstrated the accuracy with which the RST architecture can accomplish the categorization of AD, as well as its strong adaptability to various datasets. This experiment does still have some flaws, though. The RST model has comparatively high model parameters, and cross-channel learning in conjunction with the CNN enhances the classification performance using the model. Meanwhile, we discovered that the accuracy of the experimental findings is lower than 85% when the model trained on ADNI is directly assessed using AIBL. The model may undergo further development to overcome these problems and provide a lighter and more scalable model. And we hope that in the future, we can better connect image preprocessing with the deep-learning model to construct an end-to-end model that can actually be used in clinical diagnosis.