Vision-Transformer-Based Transfer Learning for Mammogram Classification

Breast mass identification is a crucial procedure during mammogram-based early breast cancer diagnosis. However, it is difficult to determine whether a breast lump is benign or cancerous at early stages. Convolutional neural networks (CNNs) have been used to solve this problem and have provided useful advancements. However, CNNs focus only on a certain portion of the mammogram while ignoring the remaining and present computational complexity because of multiple convolutions. Recently, vision transformers have been developed as a technique to overcome such limitations of CNNs, ensuring better or comparable performance in natural image classification. However, the utility of this technique has not been thoroughly investigated in the medical image domain. In this study, we developed a transfer learning technique based on vision transformers to classify breast mass mammograms. The area under the receiver operating curve of the new model was estimated as 1 ± 0, thus outperforming the CNN-based transfer-learning models and vision transformer models trained from scratch. The technique can, hence, be applied in a clinical setting, to improve the early diagnosis of breast cancer.


Introduction
Breast cancer is the most prevalent cancer in women in the United States, accounting for 30% (or 1 in 3) of all new cases of female cancer each year, except for skin cancers [1,2]. Incidence rates have risen by 0.5% annually in recent years; however, there has been a steady decrease in the number of breast cancer deaths, with an overall decrease of 43% from 1989 to 2020 [1,3]. Better treatment options as well as earlier detection through screening and awareness campaigns are considered the reasons for death rate decline [4][5][6]. Mammography (MG) plays a major role in early detection of breast cancer [7]. MG can detect breast cancer at early stages even with small tumors that cannot be felt as lumps [8]. However, false diagnoses may occur because of the complexity of MGs and the high number of tests performed by radiologists [9]. To provide radiologists with an unbiased perspective, computer-aided detection (CAD), which applies image-processing methods and pattern recognition, has been developed [10]. Studies have demonstrated the value of of rearranged image data using the original image, thereby increasing the number and variety of the training image datasets [43]. It includes operations such as noise addition, rotation, translation, contrast, saturation, color augmentation, brightness, scaling, and cropping. Transfer learning utilizes pre-trained weights from selected datasets to be used as a starting point for training on another dataset [19,50]. This approach enables leveraging the knowledge learned from previous tasks for the target task [47]. Almost all CNN-based DL approaches for mammographic breast cancer detection utilize a transfer-learning approach to compensate for the lack of large datasets and to utilize an optimized model with prior feature knowledge for new tasks [51].
In this study, we developed a deep-learning approach for mammographic breast cancer detection using transfer learning based on vision transformers. This study makes two major contributions to the literature. The first is the image-data-balancing module used to solve the class imbalance problem in a mammogram dataset. The dataset utilized for this study is composed of two categories, those from benign and malignant tissues, with unequal sample sizes. In other words, there is a class imbalance that could lead to bias in model learning. To overcome this problem, we propose augmentation-based class balancing. Second, we designed a vision-transformer-based transfer-learning method for mammogram classification. This new transfer-learning approach improves on the shortcomings of CNN-based transfer-learning methods by leveraging the self-attention approach of transformers.

Related Works
DL based on CNNs is widely employed to aid the early detection of breast cancer using MG. As a result, a few artificial intelligence (AI) tools have been approved by the Food and Drug Administration (FDA) to aid radiologists in decision making. However, owing to the numerous convolution tasks within various network layers, CNNs are computationally complex and require high computational power as the quantity of data increases. Additionally, when analyzing mammograms, CNNs concentrate on a particular region (the region where a tumor is suspected), disregarding the rest of the image, which causes CNNs to miss some crucial details, which would have been discovered if the entire image was examined at once. Vision transformers (ViTs) have recently gained prominence in the field of computer vision, surpassing CNNs in tasks that require natural image classification. Because of their lower computational complexity and ability to overcome the limitations of CNNs in focusing only on a small portion of an image, ViTs outperformed the most advanced CNN models.
The ViT concept is a development of the text-transformer-based original transformer concept. With a minor adjustment in the code to accommodate the various data modalities, it is simply a transformer applied to the image domain. A ViT specifically employs several tokenization and embedding techniques. The general architecture is the same, though. A source image is divided into a collection of image patches known as visual tokens. The visual tokens are incorporated into a collection of fixed-dimension encoded vectors. The transformer encoder network, which is essentially the same as the one in charge of processing the text input, is fed the position of a patch in the image together with the encoded vector. The ViT encoder is composed of several blocks, each of which has three main processing components: the layer norm, the multi-head attention network (MSP), and the multi-layer perceptron (MLP). The model can adjust to differences in the training images thanks to the layer norm, which keeps the training process on track. A network called MSP is in charge of creating attention maps from the provided embedded visual tokens. These attention maps assist the network in concentrating on the image's most crucial areas, such as the object (s). The MLP is a two-layer classification network with a GELU (Gaussian Error Linear Unit) at the very end. The last MLP block, also referred to as the MLP head, serves as the transformer's output. SoftMax can be used on this output to provide classification labels (i.e., if the application is image classification).
A few studies have investigated the use of ViTs in classifying mammograms for the early diagnosis of breast cancer. Lee et al. [52] proposed transformer-based DL, which tackles the challenges of mammogram normalization and inter-reader variance in grading. They proposed an approach that uses a photometric transformer network (PTN) as a programmable normalization module to forecast the normalization parameters of input MG. It seamlessly connects to the primary prediction network, allowing for combined learning of the best normalization and density grade. In principle, the PTN resembles a spatial transformer network [53]. However, the PTN seeks to identify a set of photometric transformation parameters that are best for predicting breast density, while the spatial transformer network forecasts suitable geometric transformation parameters. Tulder et al. [45] suggested a novel token-based and pixel-wise cross-view transformer technique and used it on two public MG datasets. The authors suggested an approach based on transformers that join views at the feature map level without requiring pixel-by-pixel correspondences. They used cross-view attention rather than self-attention to transfer information across views, different from how conventional transformers process information inside a single sequence. For image segmentation and breast mass detection in digital mammograms, Su et al. [54] proposed the YOLO-LOGO transformer model. This included two steps: first, they used YoloV5 to detect the breast mass ROI and cropped it directly from the high-resolution image to increase training effectiveness. Thereafter, they used an updated version of the local-global (LOGO) segmentation strategy, which significantly increased the segmentation resolution at the original pixel level. Garrucho et al. [55] evaluated the domain generalization of MG models by comparing the performance of eight cutting-edge detection techniques trained in a single domain, including transformer-based models, and tested them in five unexplored domains. They observed that transformer-based models were more robust and performed better than others in domain generalization of mammograms. Chen et al. [56] used a multi-view transformer (MVT) model to detect breast cancer segments on mammograms. MVT was composed of two main components, the local and global transformer blocks. Local transformer blocks individually analyze data from each view image. In contrast, the global transformer blocks combine data from the four-view mammograms. Self-attention, multi-head attention, and multilayer perceptron were the three main components of the local and global transformer blocks, both of which had the same design.

Dataset
In this study, we used the Digital Database for Screening Mammography (DDSM) dataset to train and test our vision-transformer-based transfer-learning system for early identification of breast cancer. This dataset is publicly available and accessible at https: //data.mendeley.com/datasets/ywsbh3ndr8/2 [accessed on 12 September 2022]. The dataset includes 13,128 images including 5970 from benign and 7158 from malignant tissues. Sample images from the dataset are shown in Figure 1.

Class Balancing
The dataset retrieved for this study was class imbalanced; that is, the number of images from malignant and benign tissues in the dataset were not equal. The ratio of malignant-to-benign samples in the DDSM dataset was 0.65:0.35. This data distribution may affect the learning of the designed algorithm and had to be fixed first. Thus, we performed a novel data-balancing method using data augmentation. To the best of our knowledge, this data class balancing approach for mammogram images was used for the first time by our group [36]. First, the dataset was categorized into 80% training and 20% testing sets. To balance the dataset for 5-fold cross-validation (nested cross-validation), we used five-image augmentations, including color jitter, gamma correction, horizontal flip, salt-and-pepper, and sharpening, as seen in [36]. The dataset was divided into five folds, each of which included the training and validation datasets. Therefore, in the DDSM dataset, for the first four folds, 1145 images of malignant tumors were present in each fold, whereas 1146 images of malignant tumors were present in the fifth fold. Similarly, for the benign class, the first four folds had 955 images while the fifth fold had 956 images. To balance the data between both classes, we subjected images of the benign class to five image augmentations, whereas malignant mass images underwent only one augmentation. Finally, post-augmentation, 1146 images were present in every fold for both classes of benign and malignant tumors, as shown in Figure 2.

Class Balancing
The dataset retrieved for this study was class imbalanced; that is, the number of images from malignant and benign tissues in the dataset were not equal. The ratio of malignant-to-benign samples in the DDSM dataset was 0.65:0.35. This data distribution may affect the learning of the designed algorithm and had to be fixed first. Thus, we performed a novel data-balancing method using data augmentation. To the best of our knowledge, this data class balancing approach for mammogram images was used for the first time by our group [36]. First, the dataset was categorized into 80% training and 20% testing sets. To balance the dataset for 5-fold cross-validation (nested cross-validation), we used fiveimage augmentations, including color jitter, gamma correction, horizontal flip, salt-andpepper, and sharpening, as seen in [36]. The dataset was divided into five folds, each of which included the training and validation datasets. Therefore, in the DDSM dataset, for the first four folds, 1145 images of malignant tumors were present in each fold, whereas 1146 images of malignant tumors were present in the fifth fold. Similarly, for the benign class, the first four folds had 955 images while the fifth fold had 956 images. To balance the data between both classes, we subjected images of the benign class to five image augmentations, whereas malignant mass images underwent only one augmentation. Finally, post-augmentation, 1146 images were present in every fold for both classes of benign and malignant tumors, as shown in Figure 2.

Preprocessing
The pixel sizes of the mammograms in the dataset varied considerably; therefore, we resized all images to a size of 224 × 224 pixels, which is the preferred size for patch generation from the input images.

Preprocessing
The pixel sizes of the mammograms in the dataset varied considerably; therefore, we resized all images to a size of 224 × 224 pixels, which is the preferred size for patch generation from the input images.

The Proposed Method
In this study, we performed a vision-transformer-based transfer-learning method to classify mammograms as being from benign or malignant tissues. Therefore, vision transformer models pre-trained on natural images (ImageNet dataset) were used for mammogram classification.

Vision Transformer Architecture
Vision transformers are derived from the original transformer model used in the natural language processing (NLP) model, where the input is a one-dimensional sequence of word tokens. However, images are two-dimensional, and vision transformer models partition images into smaller two-dimensional patches and input the patches as word tokens, as performed by the original NLP transformer models. The input image of height H, width W, and number of channels C, is divided into smaller two-dimensional patches to arrange the input image data in a way similar to how the input is structured in the NLP domain. This produces N = HW P 2 patches with a pixel size of P × P [57]. Prior to providing the patches to the transformer encoder, flattening, sequence imbedding, learnable embedding, and patch embedding were performed in the following order: • Every patch was flattened into a vector, X n p , with a length of P 2 × C, for n = 1, . . . N.

•
Mapping the flattened patches to D dimensions using a trainable linear projection, E, produced a series of embedded image patches.

•
The sequence of the embedded image patches was prefixed with a learnable class embedding X class . The X class values correspond to the classification outcome Y. • Finally, one-dimensional positional embeddings E pos , which are also learned during training, are added to the patch embeddings to add positioning information to the input.
The embedding vectors produced as a result of the aforementioned operations are given by z o (1): We fed z o to the transformer-encoder network, which is a stack of L identical layers, to conduct the classification. The classification head was then fed with the value of X class at the L th layer of the encoder output. A MLP with a single hidden layer was used to implement the classification head during pretraining, and a single linear layer was used during fine tuning. The MLP implements the GELU nonlinearity, serving as the classification head.
Overall, the vision transformer used the encoder components of the original NLP transformer architecture. The encoder receives a sequence of embedded picture patches of size 16 × 16 as input, together with positional data, and a learnable class embedding suspended to the sequence. The smaller the size of the patch, the higher the performance will be and the higher the computational cost will be. Thus, 16 × 16 patch size was chosen as in [58] because of its robustness against performance degradation and computational complexity. The learnable class-embedding value is sent to a classification head coupled to the output of the encoder, which uses it to produce a classification output depending on its state. Figure 3 shows the general structure of the vision-transformer-based transfer-learning architecture. The original vision transformer model, pre-trained on the ImageNet dataset, was used in such a way that the last layer was replaced with a flattening layer followed by batch normalization and an output dense layer. as in [58] because of its robustness against performance degradation and computational complexity. The learnable class-embedding value is sent to a classification head coupled to the output of the encoder, which uses it to produce a classification output depending on its state. Figure 3 shows the general structure of the vision-transformer-based transferlearning architecture. The original vision transformer model, pre-trained on the ImageNet dataset, was used in such a way that the last layer was replaced with a flattening layer followed by batch normalization and an output dense layer.

Transfer Learning
Transfer learning was employed such that vision transformer models pre-trained on the large ImageNet natural image dataset were utilized as a starting point to train the mammogram dataset. The objective was to use the vision transformer's knowledge from the large natural image dataset to classify breast mammograms into two classes: those from benign and malignant tissues. For this, we detached the pre-trained prediction head and replaced it with a D × K feedforward layer, where K = 2 is the total number of classes in the downstream direction. Here, with transfer learning, we sought to enhance the learning of the target function f t (·) in the target domain D t , utilizing the knowledge from the source domain D s , and the learning task, T s . The ImageNet dataset has m training samples x 1 , y 1 , . . . , x i , y i , . . . , (x m , y m ) , where x i and y i represent the i th input and label, respectively. Thereafter, the weights of the ImageNet pre-trained vision transformer model W 0 , were utilized as a starting point during transfer learning to generate W 1 by minimizing the objective function in (2), where y ij x ij , W 0 , W 1 , b is the Softmax output probability function, and b is the bias.
In this study, we utilized three state-of-the-art Vision Transformer models: the vision transformer (ViT) model proposed by Dosovitskiy et al. [58], the Swin transformer (Swin-T) model proposed by Liu et al. [59], and the pyramid vision transformer (PVT) model proposed by Wang et al. [60]. Swin-T improves locality using local or window attention by applying self-attention to nonoverlapping windows. By gradually integrating the windows, window-to-window communication in the following layer creates a hierarchical representation, as shown in Figure 4. There are four variants of Swin transformer, Swin- tiny, Swin-small, Swin-base, and Swin-large, but in our case, we utilized Swin-small and Swin-base owing to improved performance and reduced computational complexity.
In this study, we utilized three state-of-the-art Vision Transformer models: the vision transformer (ViT) model proposed by Dosovitskiy et al. [58], the Swin transformer (Swin-T) model proposed by Liu et al. [59], and the pyramid vision transformer (PVT) model proposed by Wang et al. [60]. Swin-T improves locality using local or window attention by applying self-attention to nonoverlapping windows. By gradually integrating the windows, window-to-window communication in the following layer creates a hierarchical representation, as shown in Figure 4. There are four variants of Swin transformer, Swintiny, Swin-small, Swin-base, and Swin-large, but in our case, we utilized Swin-small and Swin-base owing to improved performance and reduced computational complexity. PVT uses a type of self-attention known as spatial-reduction attention (SRA), which is characterized by a spatial reduction in both keys and values, to obtain the quadratic complexity of the attention mechanism [60]. SRA gradually reduced the spatial dimensionality of the characteristics across the entire model. In addition, it applied positional embeddings to all transformer blocks, strengthening the idea of order. The PVT architecture is shown in Figure 5. There are four variants of PVT, PVT-tiny, PVT-small, PVT-medium, and PVT-large, but in our case, we utilized PVT-medium and PVT-large for improved performance. PVT uses a type of self-attention known as spatial-reduction attention (SRA), which is characterized by a spatial reduction in both keys and values, to obtain the quadratic complexity of the attention mechanism [60]. SRA gradually reduced the spatial dimensionality of the characteristics across the entire model. In addition, it applied positional embeddings to all transformer blocks, strengthening the idea of order. The PVT architecture is shown in Figure 5. There are four variants of PVT, PVT-tiny, PVT-small, PVT-medium, and PVT-large, but in our case, we utilized PVT-medium and PVT-large for improved performance.

Experimental Settings
The performance of the proposed method was evaluated using five experimental settings. The first is a comparison of the performance of the proposed transfer-learning method using three state-of-the-art vision transformer architectures. Second, we trained vision transformer models on the mammogram dataset from scratch using these three architectures and compared them with their transfer learning counterparts. Third, we compared transfer learning using vision transformers with CNNs. In the fourth experimental setting, the computational cost of each vision transformer model was evaluated. Fifth, we compared the performance of the proposed method with those using previous methods on the same dataset.

Experimental Settings
The performance of the proposed method was evaluated using five experimental settings. The first is a comparison of the performance of the proposed transfer-learning method using three state-of-the-art vision transformer architectures. Second, we trained vision transformer models on the mammogram dataset from scratch using these three architectures and compared them with their transfer learning counterparts. Third, we compared transfer learning using vision transformers with CNNs. In the fourth experimental setting, the computational cost of each vision transformer model was evaluated. Fifth, we compared the performance of the proposed method with those using previous methods on the same dataset.

Implementation Details
The models in this study were trained for 50 epochs with a learning rate of 0.0001, using the Adam optimizer. These parameters were chosen based on prior studies on the same dataset and hardware and software settings [19]. We applied an exponential decay and a batch size of 64. We divided our datasets into training and testing groups in an 8:2 ratio. For the vision transformer models, GELU was used as an activation function, together with an L2 regularizer. A rectified linear unit (ReLu) was used in the CNNs, along with an L2 regularizer. To prevent bias in the results, the same parameter settings were used for all comparisons. Five-fold cross-validation was used to compare the model performances. RTX 3090 GPUs were used to implement the proposed transfer learning model. We used Python programming language version 3.6 on TensorFlow framework.

Performance Metrics
Model performances were determined in terms of machine learning quantitative performance metrics and statistical measures. The metrics included accuracy, area under the receiver operating curve (AUC), F1-score, precision, recall, Matthew's correlation coefficient (MCC) [61], and kappa scores, all of which were calculated with a 95% confidence interval. Table 1 provides the details of the performance metrics.

Results
The proposed vision-transformer-based transfer-learning model exhibited superior performance on the DDSM dataset, as shown in Table 2. Six of the models used were from three different vision-transformer-based architectures and performed uniformly in terms of all the metrics used to evaluate performance. Consequently, the proposed visiontransformer-based transfer-learning model provided an accuracy, AUC, F1 score, precision, recall, MCC, and kappa value of 1 ± 0 on the DDSM dataset. This provides strong evidence that vision-transformer-based transfer learning is effective in improving the DL approach for breast mammograms, thereby improving the early diagnostic techniques for breast cancer. Figure 6 depicts the training time taken in seconds (s) (Figure 6a) and loss value (Figure 6b) of each model. Even though the loss value for the six models is between 0.4 and 0.5, the training time needed for each model varies. The ViT-large model needed 7400 s, being the slowest for training. On the other hand, the PVT-medium model took only 2900 s, being the fastest model to train on the DDSM dataset. This shows that some models take longer time to train to achieve the best performance, as in the case of ViT-large, while others need less training time to achieve the same performance, as in the case of PVT-medium. This is critical while choosing a model to deploy, especially in clinical settings where large number of images are processed every day. Vision transformer ViT-base An investigation was conducted to determine the computational complexity of the proposed method. To do so, we used floating point operations per second (FLOPS) to compare the computational costs of the different vision-transformer-based transfer-learning models. FLOPS is a measure of the number of operations needed to run a single instance of a certain model. For instance, how many operations are required to train a single instance of ViT model. The larger the FLOPS, the higher the computational cost; the lower the FLOPS, the lower the computational cost. Thus, a model with a smaller FLOPS is pre- An investigation was conducted to determine the computational complexity of the proposed method. To do so, we used floating point operations per second (FLOPS) to compare the computational costs of the different vision-transformer-based transfer-learning models. FLOPS is a measure of the number of operations needed to run a single instance of a certain model. For instance, how many operations are required to train a single instance of ViT model. The larger the FLOPS, the higher the computational cost; the lower the FLOPS, the lower the computational cost. Thus, a model with a smaller FLOPS is preferred. Figure 7 depicts the number of parameters in millions (M) (Figure 7a) trained for each model and their corresponding FLOPS in gigas (G) (Figure 7b) for all six vision-transformer-based transfer-learning models. As can be seen from the figure, models with a large number of parameters have a larger FLOPS, and vice versa. In our case, the PVT-medium with 44 million parameters had the smallest FLOPS of 7G. In contrast, ViT-large with 309 million parameters had the highest FLOPS of 59G. Therefore, the PVT-medium with the smallest value of FLOPS was effective for vision-transformer-based transfer learning on the DDSM dataset, although its performance in terms of accuracy is the same as that of the other five models.  We trained a vision transformer model built from scratch, as in Dosovitskiy et al. [58], Liu et al. [59], and Wang et al. [53] to compare their performance against pre-trained vision transformer models (the proposed transfer learning method) for classifying breast images. We used pre-trained vision transformer models, and the final layer was changed to reflect the number of classes we wanted to categorize into (two in our case). Apart from that, we used the original models (which have the same number of layers as the original models utilized in this paper) as they are. For a fair comparison, we utilized the same optimizers and their corresponding learning rates in other models trained from scratch, as those in the proposed model. Table 3 shows the results of the transformer models on the DDSM dataset trained from scratch. PVT-medium model provided the highest performance result among all the vision transformer models trained from scratch. It exhibited an accuracy, AUC, F1 score, precision, recall, MCC, and kappa score of 0.78 ± 0.02, 0.77 ± 0.02, 0.78 ± 0.01, 0.78 ± 0.02, 0.78 ± 0.02, 0.77 ± 0.01, and 0.77 ± 0.02, respectively. This result is far inferior to the results achieved by the vision-transformer-based transfer-learning models depicted in Table 2. Hence, vision-transformer-based transfer-learning models provide improved performance compared with vision transformer models trained from scratch on the DDSM dataset. To compare the performance of the proposed vision-transformer-based transfer learning with CNN-based transfer learning, we ran experiments using state-of-the-art CNN models, with the same setting as in vision transformer models, except for using ReLu and GELU as the activation functions for CNN-based models and vision-transformer-based models, respectively. The results of the performance of the CNN-based transfer learning models on the DDSM dataset are presented in Table 4. Compared with the results achieved by the vision-transformer-based transfer-learning models in Table 2, which has the highest AUC of 1 ± 0, the results of the CNN-based transfer-learning models, with the highest AUC of 0.95 ± 0.01 for ResNet50, indicate that CNN-based transfer learning models perform much poorer. This indicates that vision transformers are better for mammograms than CNNs are.

Discussion
Vision-transformer-based transfer learning for breast mammogram classification was developed in this study. We implemented image-augmentation-based class-wise data balancing to compensate for the imbalance in the number of benign and malignant samples within the DDSM dataset. This helps the proposed model to avoid bias to a given class with a greater amount of data and negatively affects the detection outcome. The state-of-the-art vision transformer architectures, including the vision transformer model proposed by Dosovitskiy et al. [58], the Swin vision transformer model proposed by Liu et al. [59], and the pyramid vision transformer model proposed by Wang et al. [60], were utilized to evaluate the performance of the developed vision-transformer-based transfer learning for breast mammogram classification. Consequently, the vision-transformer-based transfer-learning approach provided the highest quantitative and statistical measures for classifying breast mammograms as being from benign or malignant tissues. This proves the effectiveness and quality of the vision-transformer-based transfer-learning approach for detecting breast cancer from mammograms. The prime reason for the better performance of vision transformers is the ability to capture global information from the early layers and the deep self-attention mechanism that enables features in each patch to be carefully analyzed for decision making. Additionally, our study showed that vision transformer models are more effective when used for transfer learning on the DDSM dataset than training the models from scratch, because of the small number of images in the DDSM dataset. DL models require a large amount of data for training and a large number of parameters to be trained, which results in the overfitting of the models in the case of a small training dataset, such as the DDSM data. Therefore, transfer learning provided better results as it used weights that were pre-trained on large datasets, such as the ImageNet dataset, and leverages that knowledge to learn from small datasets, such as DDSM, during training. We further investigated the effectiveness of vision-transformer-based transfer learning by comparing it directly with CNN-based transfer learning for classifying breast mammograms as being from benign or malignant tissues. To summarize, we observed that vision-transformer-based transfer learning outperformed CNN-based transfer learning of the DDSM dataset. Moreover, PVT-based transfer-learning models were computationally less expensive, providing the same performance as those of other models, including ViTs with a lower computational cost for breast mammogram classification. Finally, we compared our approach with models from published works and found that our approach produced the best performance results (Table 5). Details about the models in Table 5 can be found in Section 2.

Conclusions
We have presented a vision-transformer-based transfer-learning approach for breast mammogram classification. A detailed evaluation using different vision transformer models and variants has been performed. Consequently, we found that vision-transformerbased transfer learning is effective for breast mammogram image classification, providing superior performance with less computational complexity. Vision transformer-based transfer learning outperformed convolutional neural network-based transfer learning for breast mammogram classification. However, this result was obtained from training on a single dataset obtained from one source, and further studies utilizing different datasets from different sources should be considered to generalize the result obtained in this study. Future studies should also consider the use of various deep learning parameters to investigate their effect on vision-transformer-based transfer learning for breast mammogram image classification.