Vision Transformer Approach for Classiﬁcation of Alzheimer’s Disease Using 18F-Florbetaben Brain Images

: Dementia is a degenerative disease that is increasingly prevalent in an aging society. Alzheimer’s disease (AD), the most common type of dementia, is best mitigated via early detection and management. Deep learning is an artiﬁcial intelligence technique that has been used to diagnose and predict diseases by extracting meaningful features from medical images. The convolutional neural network (CNN) is a representative application of deep learning, serving as a powerful tool for the diagnosis of AD. Recently, vision transformers (ViT) have yielded classiﬁcation performance exceeding that of CNN in some diagnostic image classiﬁcations. Because the brain is a very complex network with interrelated regions, ViT, which captures direct relationships between images, may be more effective for brain image analysis than CNN. Therefore, we propose a method for classifying dementia images by applying 18F-Florbetaben positron emission tomography (PET) images to ViT. Data were evaluated via binary (normal control and abnormal) and ternary (healthy control, mild cognitive impairment, and AD) classiﬁcation. In a performance comparison with the CNN, VGG19 was selected as the comparison model. Consequently, ViT yielded more effective performance than VGG19 in binary classiﬁcation. However, in ternary classiﬁcation, the performance of ViT cannot be considered excellent. These results show that it is hard to argue that the ViT model is better at AD classiﬁcation than the CNN model.


Introduction
Dementia is a degenerative disease that is increasing in prevalence within an aging population [1]. Alzheimer's disease (AD) is the most common type of dementia, accounting for 60-80% of dementia cases, and is one of the leading causes of death worldwide [2]. AD begins with mild declines in memory, thinking, and learning processes and may lead to severe loss of consciousness and difficulty with physical abilities due to brain damage [3]. Although it is possible to prevent and delay the onset of AD using FDA-approved therapeutic approaches, there is currently no treatment that can dramatically reverse the pathological changes following onset [4]. Therefore, early detection and management are the best ways to slow the progression of AD, and early diagnosis is especially crucial.
AD biomarkers (biological markers of diseases such as amyloid and tau) can be utilized for the early identification of diseases in people with mild or no cognitive impairment [5]. Amyloid accumulation in the brain, which is one of the causes of AD, is known to occur when an abnormal form of amyloid is deposited in the brain due to a metabolic problem [6]. Through an amyloid positron emission tomography (PET) test, an amyloid biomarker is

Related Works
Due to the importance of prevention and delay, there are many studies on the diagnostic classification of AD. For example, Hu et al. [26] proposed a VGG-TSwinformer model based on a convolutional neural network (CNN) and transformer. The classification process of stable MCI (sMCI) and progressive MCI (pMCI) was performed, and an accuracy of 77.2%, sensitivity of 79.97%, specificity of 71.59%, and AUC of 0.8153 were obtained. The VGG-TSwinformer model is a deep learning model for short-term longitudinal studies of MCI that can build a model of brain atrophy progression from longitudinal MRI images and improve diagnostic efficiency compared to algorithms that only use cross-sectional sMRI images. Yin et al. [19] proposed a SMIL-DeiT network for the AD classification task between three groups: AD, MCI, and normal control (NC). Vision Transformer is the basic structure of our work, and the data pre-training was performed using DINO, a self-supervised technique, while the downstream classification task was performed by multi-instance learning. The learning performance reached 93.2% on the Alzheimer's Disease Neuroimaging Initia-tive (ADNI) dataset (MRI), with the accuracy higher than 90.1% of Transformer and 90.8% of CNN. Carcagnì et al. [27] studied three deep convolutional models (ResNet, DenseNet, and EfficientNet) and two transducer-based architectures (MAE and DeiT) to improve the automatic detection of dementia in MRI brain images. Experiments showed that the very deep ResNet and DenseNet models performed better than the shallow ResNet and VGG versions tested in the literature. The significant improvement in accuracy (up to 7%) motivated us to consider the CAD approach in real-world applications. Lyu et al. [17] proposed a slicewise convolutional embedding method to improve the standard patching operation in vanilla ViT. The proposed cross-domain transfer learning method classified AD and CN, with an accuracy of 95.3%, recall of 94.4%, and precision of 90.0%, which can achieve similar classification performance compared to the most recent research. Kadri et al. [28] proposed a multimodal method based on MRI and PET modalities for the diagnosis of Alzheimer's disease using a combination of efficientnet V2 and a vision converter enhanced by a novel data augmentation based on self attention generative adversarial networks (SAGAN). The proposed method achieved 96% accuracy by combining the main advantages of vision transducer and Efficientnet V2. We validated the proposed method using ADNI and the Open Access Series of Imaging Studies (OASIS). Jang et al. [29] proposed a three-dimensional medical image classifier using Multi-plane and Multi-slice Transformer (M3T) networks to classify Alzheimer's disease in three-dimensional MRI images. They used the Alzheimer's Disease Neuroimaging Initiative (ADNI) training dataset containing MRI images, and for validation data, they used datasets from three institutions (AIBL, OASIS, and ADNI). In the validation results, ADNI achieved AUC 0.9634 and ACC 93.21%, AIBL with AUC 0.9258 and ACC 93.27%, and OASIS with 0.8961 and ACC 85.26%, which demonstrated the feasibility of efficiently combining CNN and Transformer for 3D medical imaging. Kushol et al. [30] analyzed the performance of a multi-visual transducer network to detect AD based on features extracted from a set of 2D coronal slices. ImageNet was used to train the model with coronal 2D slices, which were selected to utilize transfer learning properties. The classification performance to distinguish between AD and CN showed an ACC of 88.2%, a recall of 95.6%, and a specificity of 77.4%. Zhu et al. [31] proposed an advanced deep learning architecture called Brain Informer (BraInf) based on an efficient self-attention mechanism. The proposed model integrated representation learning, feature extraction, and classifier modeling into a unified framework. The effectiveness of the proposed model was validated using the Alzheimer's Disease Neuroimaging Initiative dataset. The model achieved 97.97% and 91.89% accuracy on the Alzheimer's disease and mild cognitive impairment classification tasks, respectively. Liu et al. [32] proposed a novel transformer for disease classification based on multimodal data, the Multi-Modal Mixing Transformer (3MT). In addition to the fact that labeled medical images are already scarce, the performance of data-driven methods such as deep learning is severely hampered. Therefore, multimodal methods that can seamlessly handle missing data in various clinical settings are highly desirable. We tested our model for AD and NC classification using neuroimaging data, gender, age, and MMSE scores. The model used a novel cascaded modality transducer architecture with cross-attention to integrate multimodal information for prediction. 3MT was directly applied to AIBL after training on the ADNI dataset and achieved a test accuracy of 92.5% without fine-tuning. Wang et al. [33] proposed a hybrid machine learning framework consisting of multiple convolutional neural networks, which are linear support vector classifiers that use extracted image features along with non-image information to make robust final predictions. The model achieved an ACC of 88% and an AUC of 0.95 in classifying sMCI and pMCI. On a completely different cohort dataset collected from a different population, it achieved an ACC of 84% and an AUC of 0.91. Eroglu et al. [34] proposed an mRMR-based hybrid CNN in their study. First, they extracted MRI features from Darknet53, InceptionV3, and Resnet101 models. The extracted features were then concatenated. The obtained features were then optimized using the mRMR method. SVM and KNN classifiers were used to classify the optimized features, achieving an accuracy of 99.1%.

Organization of Article
In the introduction part of the article, the topic and related studies are examined. In the second part, the dataset used in the article are described. Then, the models and the methods are revealed. In the third part, the experiments and the results are presented. In the fourth part, the subject is discussed. Finally, the fifth part is the conclusion.

Data Acquisition
This study included subjects with dual FBB images who underwent FBB testing between 1 April 2016, and 30 June 2022, in the Dong-A University cohort. In total, 716 subjects underwent FBB testing during this period. We included 383 subjects, excluding those with neurological, medical, or psychiatric disorders, as well as cases of unavailable or damaged images. The 383 subjects were classified according to their diagnoses into 220 patients with AD, 113 patients with MCI, and 37 subjects as HC (Table 1, Figure 1). Each phase of an FBB image was confirmed by a nuclear medicine physician following collection to ensure that the Aβ distribution labels were accurate. The brain amyloid plaque load (BAPL) score is a system measured by a doctor according to the visual assessment of amyloid deposition. BAPL is a three-grade scoring system: BAPL score 1 is No Amyloid-β Load, BAPL score 2 is Minor Amyloid-β Load, and BAPL score 3 is significant amyloid-β load [35]. During binary classification, subjects with AD and MCI were classified in the "abnormal group", whereas HC subjects were classified into the "normal group". The Dong-A University Hospital Institutional Review Board (DAUHIRB) reviewed this study with the members who participated in the Institutional Review Board Membership List and approved this study protocol (DAUHIRB-17-108). then concatenated. The obtained features were then optimized using the mRMR method. SVM and KNN classifiers were used to classify the optimized features, achieving an accuracy of 99.1%.

Organization of Article
In the introduction part of the article, the topic and related studies are examined. In the second part, the dataset used in the article are described. Then, the models and the methods are revealed. In the third part, the experiments and the results are presented. In the fourth part, the subject is discussed. Finally, the fifth part is the conclusion.

Data Acquisition
This study included subjects with dual FBB images who underwent FBB testing between 1 April 2016, and 30 June 2022, in the Dong-A University cohort. In total, 716 subjects underwent FBB testing during this period. We included 383 subjects, excluding those with neurological, medical, or psychiatric disorders, as well as cases of unavailable or damaged images. The 383 subjects were classified according to their diagnoses into 220 patients with AD, 113 patients with MCI, and 37 subjects as HC (Table 1, Figure 1). Each phase of an FBB image was confirmed by a nuclear medicine physician following collection to ensure that the Aβ distribution labels were accurate. The brain amyloid plaque load (BAPL) score is a system measured by a doctor according to the visual assessment of amyloid deposition. BAPL is a three-grade scoring system: BAPL score 1 is No Amyloidβ Load, BAPL score 2 is Minor Amyloid-β Load, and BAPL score 3 is significant amyloidβ load [35]. During binary classification, subjects with AD and MCI were classified in the "abnormal group," whereas HC subjects were classified into the "normal group". The Dong-A University Hospital Institutional Review Board (DAUHIRB) reviewed this study with the members who participated in the Institutional Review Board Membership List and approved this study protocol (DAUHIRB-17-108).

Image Acquisition and Preprocessing
All PET scans were performed using a Biograph 40 m CT Flow PET/CT scanner (Siemens Healthcare, Knoxville, TN, USA). PET images were acquired by performing without

Image Acquisition and Preprocessing
All PET scans were performed using a Biograph 40 m CT Flow PET/CT scanner (Siemens Healthcare, Knoxville, TN, USA). PET images were acquired by performing without an intravenous contrast agent at 100 kVP and 228 mA with a spin time of 0.5 s. The skull was scanned from apex to base using Ultra HD-PET (True X-TOF) for 90-110 min after injection.
Image pre-processing was performed using PMOD software (version 3.613, PMOD Technologies Ltd., Zurich, Switzerland). Using PMOD 's Fuse It program, CT and PET are called at the same time and matched. Using the Neuro program of PMOD, the area is cropped so that the CT image is not cut. Using PMOD's Fusion program, trans matrix (tx) Appl. Sci. 2023, 13, 3453 5 of 14 files are saved by matching standard CT images with cropped CT images. The matrix file is applied to the PET image and performs spatial normalization. The PET image is called up using PMOD's View program and performs a count normalization with the cerebellum. Skull stripping was performed, enabling the model to classify only the brain tissue and finally acquire the preprocessed 3D image (size 91 × 109 × 91) ( Figure 2). an intravenous contrast agent at 100 kVP and 228 mA with a spin time of 0.5 s. The skull was scanned from apex to base using Ultra HD-PET (True X-TOF) for 90-110 min after injection.
Image pre-processing was performed using PMOD software (version 3.613, PMOD Technologies Ltd., Zurich, Switzerland). Using PMOD 's Fuse It program, CT and PET are called at the same time and matched. Using the Neuro program of PMOD, the area is cropped so that the CT image is not cut. Using PMOD's Fusion program, trans matrix (tx) files are saved by matching standard CT images with cropped CT images. The matrix file is applied to the PET image and performs spatial normalization. The PET image is called up using PMOD's View program and performs a count normalization with the cerebellum. Skull stripping was performed, enabling the model to classify only the brain tissue and finally acquire the preprocessed 3D image (size 91 × 109 × 91) ( Figure 2).

Conversion of 3D Images to 2D Images and Data Augmentation
The final image obtained through preprocessing is a 3D image. Because the model accepts 2D images as input, each 3D image was converted to a 2D image. The 3D image was segmented into 91 pieces in the axial direction, and 28 pieces corresponding to the middle were selected. Consequently, 28 2D images were obtained for each subject. Data augmentation techniques were applied to maximize the dataset size and prevent overfitting. Specifically, we resized every image to 224 × 224 pixels as the input for the ViT model. For data analysis for the experiment, a MW Digitbox (single processor 4GPU) located at the Neuroscience Translational Research Solution Center (Busan, Republic of Korea) was used.
Transfer learning can alleviate the scarcity of training samples. Although transfertrained models are known to be less sensitive to sample size [14], the sample size still affects transmission performance. Accordingly, we applied data augmentation to increase the amount of training data [36]. Data augmentation is a technology that increases the amount of data through various algorithms using machine learning and deep learning techniques. We applied only image rotation among the affine transforms, considering that amyloid plaques were identified in the entire brain when diagnosing AD with FBB images. Rotations of ±5°, ±10°, and ±15° were applied to each original image.

Conversion of 3D Images to 2D Images and Data Augmentation
The final image obtained through preprocessing is a 3D image. Because the model accepts 2D images as input, each 3D image was converted to a 2D image. The 3D image was segmented into 91 pieces in the axial direction, and 28 pieces corresponding to the middle were selected. Consequently, 28 2D images were obtained for each subject. Data augmentation techniques were applied to maximize the dataset size and prevent overfitting. Specifically, we resized every image to 224 × 224 pixels as the input for the ViT model. For data analysis for the experiment, a MW Digitbox (single processor 4GPU) located at the Neuroscience Translational Research Solution Center (Busan, Republic of Korea) was used.
Transfer learning can alleviate the scarcity of training samples. Although transfertrained models are known to be less sensitive to sample size [14], the sample size still affects transmission performance. Accordingly, we applied data augmentation to increase the amount of training data [36]. Data augmentation is a technology that increases the amount of data through various algorithms using machine learning and deep learning techniques. We applied only image rotation among the affine transforms, considering that amyloid plaques were identified in the entire brain when diagnosing AD with FBB images. Rotations of ±5 • , ±10 • , and ±15 • were applied to each original image.
Test set was selected separately as BAPL scores 1, 2, and 3 indicated in Table 2a, and train/validation set was configured in consideration of the ratio of BAPL scores 1, 2, and 3 in Table 2b. The original and augmented datasets were constructed as shown in Table 2c. For the test data, 30 subjects were selected, with 10 subjects representing AD, MCI, and HC. Based on the test data, the original dataset was allocated according to a 6:2:2 ratio. The data were randomly extracted to configure the training and validation sets, with the test set for the augmented data prepared equivalently to that for the original data. The augmented data was configured by randomly extraction by applying ±5 • , ±10 • , and ±15 • to the remaining data except the test data.

Pretrained Models Used in the Study
The architectures used in this study are the Vision Transformer and VGG19 models.

ViT Architecture
The transformer is a model first proposed in the paper 'Attention is All You Need' [14], published by Google in 2017. The encoder-decoder, characterized by a sequence-tosequence structure, has a disadvantage in that some information of the input sequence is lost when the sequence is compressed into a single vector. However, the use of attention to compensate for this loss is outside the network bounds. The architecture of the ViT used in this study is illustrated in Figure 3 [14]. First, an image patch is created. Transformers require one-dimensional embeddings as a starting point in the field of NLP. The image in (224, 224, 1) produces 28 × 28 patch images in (8, 8, 1). The following steps include the creation of patch embedding, addition of class tokens, and addition of positional embedding. The conversion of each patched image to one dimension is known as linear projection. Each pixel is connected in a row to First, an image patch is created. Transformers require one-dimensional embeddings as a starting point in the field of NLP. The image in (224, 224, 1) produces 28 × 28 patch images in (8,8,1). The following steps include the creation of patch embedding, addition of class tokens, and addition of positional embedding. The conversion of each patched image to one dimension is known as linear projection. Each pixel is connected in a row to ensure one-dimensionality.
x R H×W×C (1) represents the original image size.
x l ∈ R N×(P 2 ×C) (2) is the input to the ViT after flattening the original image: (3) represents the number of patches, with N being the sequence length of the transformer. P is the size of the patch, which is a square. The resolution of the original image is (H, W) and the patch resolution of each image is (P, P).
A learnable class token is added to the front of the embedded patch. When this class token passes through several encoder layers of the transformer and emerges as a final output, it serves as a one-dimensional representation vector for the image. Finally, a position embedding of the same dimension is added to the vector, and order information is added to the embedding. Consequently, the entire image is defined as a one-dimensional embedding vector and input into the transformer's encoder.
Layer normalization, multi-head self-attention (MSA), and residual connections are performed. All image embeddings are layer normalized on a channel basis. To perform self-attention using patch + position embedding, one key (k), query (q), and value (v) are obtained for each embedding, and attention values are obtained accordingly, concatenated in the dimensional direction, with the multi-head creating attention. Subsequently, a residual connection is made by adding input embeddings to the multi-head attention.
Layer normalization, multilayer perceptron (MLP), and residual connections are applied. The residual connection matrix is normalized on a channel basis, as previously described. The MLP consists of two linear layers. The embedding size is expanded in the first layer and restored to its original size in the second layer. Subsequently, matrices are added to generate the final output feature. The process of creating the final output feature through input embedding is summarized by the following equations: The MLP classifier can be considered the output stage of the transformer and is functionally identical to the image classifier of a general CNN. The peculiarity is that only class tokens are used. When a class token passes through several encoder layers and the transformer's layer normalization to obtain the final output, it serves as a one-dimensional representation vector for the image.

VGG19 Architecture
VGGNet is a model developed by the Oxford University research team VGG, which was the runner-up in the 2014 ImageNet image recognition competition. VGGNet refers to a model with 16 or 19 layers, and the model used in this study is VGG19 [37]. Among the CNN models for comparison with the ViT model, we chose VGG19 because it achieves the best performance on various tasks and uses a small kernel (3 × 3) [38] instead of a large kernel. The image is input with a size of 224 × 224 × 3, and the convolutional kernel dimension is 3 × 3. The layer structure used Maxpooling for downsampling and adjusted ReLU as the activation function. By selecting the largest value in the image region as the region's pooled value, features can be extracted with minimal image distortion (Figures 4 and 5) [39].
former's layer normalization to obtain the final output, it serves as a one-dimensional representation vector for the image.

VGG19 Architecture
VGGNet is a model developed by the Oxford University research team VGG, which was the runner-up in the 2014 ImageNet image recognition competition. VGGNet refers to a model with 16 or 19 layers, and the model used in this study is VGG19 [37]. Among the CNN models for comparison with the ViT model, we chose VGG19 because it achieves the best performance on various tasks and uses a small kernel (3 × 3) [38] instead of a large kernel. The image is input with a size of 224 × 224 × 3, and the convolutional kernel dimension is 3 × 3. The layer structure used Maxpooling for downsampling and adjusted ReLU as the activation function. By selecting the largest value in the image region as the region's pooled value, features can be extracted with minimal image distortion (Figures 4  and 5) [39].   tokens are used. When a class token passes through several encoder layers and the transformer's layer normalization to obtain the final output, it serves as a one-dimensional representation vector for the image.

VGG19 Architecture
VGGNet is a model developed by the Oxford University research team VGG, which was the runner-up in the 2014 ImageNet image recognition competition. VGGNet refers to a model with 16 or 19 layers, and the model used in this study is VGG19 [37]. Among the CNN models for comparison with the ViT model, we chose VGG19 because it achieves the best performance on various tasks and uses a small kernel (3 × 3) [38] instead of a large kernel. The image is input with a size of 224 × 224 × 3, and the convolutional kernel dimension is 3 × 3. The layer structure used Maxpooling for downsampling and adjusted ReLU as the activation function. By selecting the largest value in the image region as the region's pooled value, features can be extracted with minimal image distortion (Figures 4  and 5) [39].

Experiments and Results
To compare the efficiency of the pre-trained models (VGG19, ViT), experiments were conducted based on classification tasks of two and three classes.

Experimental Setting
We compared the ViT and VGG19 architectures. All models were trained using the hyperparameters listed in Table 3 to equalize the experimental conditions. There were 28 2D images per subject, with each chapter classified differently. Accordingly, the subject classification criteria were defined as follows: In the case of binary classification, if more than five out of the 28 sheets were found to be abnormal, the subject was classified as abnormal. In the case of three-class classification, if more than five of the 28 chapters were classified as AD, the subject was classified as AD; if more than five chapters were classified as MCI, the subject was classified as MCI; and if both AD and MCI were classified with less than five chapters, the subject was classified as HC.

Classification Performance
The classification performance of the introduced model settings was reported and analyzed. The model was considered as follows: accuracy, recall, precision, and F1 score [40]. Depending on the normal and abnormal outcomes of the model, it can be represented as true positive (TP, the total number of correct predictions in the abnormal case), true negative (TN, the total number of incorrect predictions in the abnormal case), false positive (FN, the total number of correct predictions in the normal case), and false negative (FN, the total number of incorrect predictions in the normal case).
Accuracy is a performance metric that is typically evaluated when positive and negative groups are equal.
Recall (=sensitivity) is the percentage of correctly predicted positive observations out of the total positive predictions.
Precision calculates the accuracy of the classification model with a positive predicted value.
Accuracy does not take into account the distribution of the data. The F1 score is used to manage the distribution.
In Table 4, we confirm that the ViT model performed better than the VGG19 model for the classification of normal and abnormal subjects. However, contrary to the expectations, we observe that the augmented dataset exhibited worse results than the original dataset irrespective of the model. This suggests that the augmented data applied to the two models were not effective. When the augmented dataset was used, recall and precision were very low, indicating that there were many false positives. We believe that the model was trained to match the abnormal group, as the quantity of abnormal data increased proportionally when data augmentation was performed with a 1:2 ratio between the normal and abnormal groups. The confusion matrix for the binary classification in Table 4 can be found in Table 5.  In Table 6, VGG19 shows better classification performance for AD, MCI, and HC with the original dataset, whereas ViT shows better classification performance with the augmented data. However, both models generally exhibited a low classification performance of less than 0.7. The confusion matrix for the three-class classification in Table 6 can be found in Table 7. As shown in Table 4, classification performance with the augmented dataset is better than that with the original dataset.  Because performance with the augmented dataset is poorer than that with the original dataset, we attempted to check the classification performance in training and validation. As shown in Table 8, the validation accuracy of the ViT and VGG19 models was higher with the augmented dataset than the original dataset. In the binary case, the validation accuracy of VGG19 was higher than that of ViT, and in the three-class case, ViT and VGG19 both exhibited higher validation accuracy.

Discussion
Authors should discuss the results and how they can be interpreted from the previous perspective because the idea that ViT conveys the function of different regions within the brain has great potential for future work. The model's performance was compared with that of the CNN-based model VGG19. Although ViT exhibited higher performance than VGG19 in the binary classification task, its performance was low in the three-class classification.
Furthermore, when data augmentation was applied, classification performance was lower than that of the original data regardless of the model. When comparing the validation accuracy performance following training, a higher value was obtained in the case of the augmented data; therefore, data augmentation resulted in overfitting.
There are several speculations regarding the failure of ViT. The first is the augmented data problem. In fact, data augmentation resulted in lower classification performance regardless of the model. We attempted data augmentation through image rotation to solve the problem of data scarcity. Consequently, the training set increased in volume from 60 subjects to 630 subjects. When the actual augmented data were applied, the validation accuracy after training increased. However, when performance was compared by applying the test, it was lower than that before data augmentation. Furthermore, it was confirmed that the classification performance for normal subjects deteriorated as the data were augmented.
Furthermore, we did not perform proper fine-tuning. Although ViT consumes relatively less time per epoch than VGG19, it requires significant trial and error to optimize the hyperparameters. There is no certainty we achieved optimal performance, and the conditions to attain such performance can only be ascertained within the researcher's efforts.
The third reason for poor performance is that we had to set a very small batch size to fit the model into GPU memory. Under a small batch size, the statistics for batch normalization degrade the performance of the unstable station model.
However, although ViT has been reported to perform better than some other models, it is not possible to reliably judge the classification performance, owing to the lack of consistency, as the techniques are not implemented on the same criteria, such as the number of data samples, form, preprocessing technique, and database. Moreover, the best and most accurate technique for diagnosing Alzheimer's disease has yet to be determined. Deep learning models, such as CNNs, appear promising for AD diagnosis, especially given that they can utilize transfer learning to overcome the availability limitations of many medical images [20].
The limitations of this study are, first, that it was conducted with data from the Dong-A University Hospital cohort only. According to Table 1, we can see the imbalance of the classes included in this study and the BAPL scores corresponding to the classes. We expected the model to be able to learn the final AD classification through training, as opposed to the BAPL score classification typically performed with PET images. However, the results suggest that the number of data was not large enough for the model to learn the final AD classification rather than the BAPL score classification. Not only did the HC group misclassify 100% of the two subjects in BAPL 2, but the majority of the MCI group (61 subjects in BAPL 1, 17 subjects in BAPL 2, and 35 subjects in BAPL 3) were misclassified as either AD or HC. To solve this data imbalance and lack of data, we augmented the data by rotating the images, but this only made the data more imbalanced, resulting in worse classification performance. Another limitation is that we did not train various models and compared the AD classification performance with only two models, ViT and VGG19. Currently, there are many different ViT models and CNN models for medical image classification. It is difficult to say that our study has found and compared the most suitable model among them.
In the future, we plan to conduct a comparative study between the advanced ViT and CNN models by acquiring more data. In addition, we plan to apply dual PET images by adding early PET images [41], as it is considered that there are limitations in classifying HC, MCI, and AD with PET images alone.

Conclusions
In this study, augmented FBB PET image data were applied to a pre-trained ViT model for Alzheimer's disease diagnosis. We evaluated the accuracy, recall, precision, and F1 scores by comparing different classifications and differences in data size. However, the classification performance of this model was not ideal, possibly owing to overfitting and under-induction bias due to the limitations of the PET image data.
We hypothesized that the computer would yield accurate AD classification results apart from amyloid accumulation through amyloid PET imaging. In fact, the limitations of amyloid PET imaging have been identified previously. As a result, this study did not find that ViT could outperform CNN in PET image analysis. In addition, clinical classification through PET images can be divided into two classes by comparing normal and abnormal groups; however, there is a limitation in classifying the three classes by comparing HC, MCI, and AD. As a result, it is difficult to claim that the ViT model is better at AD classification than the CNN model. Further research is needed to acquire enough PET image data or add multimodal data to supplement the lack of image data [42,43].  Informed Consent Statement: Patient consent was waived due to retrospective study in permission of IRB.

Data Availability Statement:
The data used for this study are available upon request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.