Identifying Malignant Breast Ultrasound Images Using ViT-Patch

: Recently, the Vision Transformer (ViT) model has been used for various computer vision tasks, due to its advantages to extracting long-range features. To better integrate the long-range features useful for classiﬁcation, the standard ViT adds a class token, in addition to patch tokens . Despite state-of-the-art results on some traditional vision tasks, the ViT model typically requires large datasets for supervised training, and thus, it still face challenges in areas where it is difﬁcult to build large datasets, such as medical image analysis. In the ViT model, only the output corresponding to the class token is fed to a Multi-Layer Perceptron (MLP) head for classiﬁcation, and the outputs corresponding to the patch tokens are exposed. In this paper, we propose an improved ViT architecture (called ViT-Patch), which adds a shared MLP head to the output of each patch token to balance the feature learning on the class and patch tokens. In addition to the primary task, which uses the output of the class token to discriminate whether the image is malignant, a secondary task is introduced, which uses the output of each patch token to determine whether the patch overlaps with the tumor area. More interestingly, due to the correlation between the primary and secondary tasks, the supervisory information added to the patch tokens help with improving the performance of the primary task on the class token. The introduction of secondary supervision information also improves the attention interaction among the class and patch tokens. And by this way, ViT reduces the demand on dataset size. The proposed ViT-Patch is validated on a publicly available dataset, and the experimental results show its effectiveness for both malignant identiﬁcation and tumor localization.


Introduction
As humans understand more about pathology, medical practitioners expect more accurate algorithms to assist doctors to obtain precise information from medical images.Zheng et al. [1] used sparse representation to separate the structural information of lowdose CT images from noise and artifact information.In [2], Nikolaev et al. proposed a novel automated cone-based breast ultrasound system which can facilitate the volumetric ultrasound acquisition of the breast in a prone position.Xu et al. [3] applied a sparse angle CBCT reconstruction algorithm based on Guided Image Filtering (GIF) to reduce the radiation dose without damaging the reconstruction quality.
In the last decade, deep learning has been widely applied in the field of computer vision, image processing, and Natural Language Processing (NLP).For some medical image analysis tasks, deep learning has also become a mainstream technique.Various deep-learning models have been designed for disease classification, lesion detection and segmentation, and multi-model medical image registration.
Deep belief networks have been used earlier to classify Alzheimer's disease based on brain Magnetic Resonance Imaging (MRI) [4,5].In [6], Cao et al. systematically evaluated the performances of several state-of-the-art target detection methods based on deep learning.Most of the deep-learning models used for medical image analysis are based on the convolutional neural networks (CNNs).However, as indicated in [7], the CNN-based models are not good at extracting long-range features, which are exactly what is needed for many medical image analysis tasks.Deep-learning models are required to make judgments based on global features of the entire image, just like a real doctor.
Recently, the transformer model [8], which is more adaptable at extracting longrange features, has been introduced from the field of NLP to computer vision.In 2020, Dosovitskiy et al. [9] proposed a Vision Transformer (ViT) model for image classification, based on the self-attention mechanism of the transformer, and has achieved SOTA results.Based on ViT, the Deit [10] model adds a distillation token as a teacher to improve classification performance.Wang et al. [11] built a PVT model by simulating the feature pyramid in convolutional networks to make ViT with multi-scale feature information.The ViT has also been introduced for medical image analysis because of its long-range feature extraction capability [12].Wu et al. [13] used ViT to classify CT images for emphysema.Gao et al. [14] proposed ViT-based models to classify COVID-19 CT images in the MIA-COV19 challenge.ViT models were also used for medical image segmentation [15][16][17].In 2021, Gheflati and Rivaz [18] first applied the ViT for classifying breast ultrasound images.In [19], Shamshad et al. provided a comprehensive review of the applications of transformers in medical imaging covering various aspects, ranging from recently proposed architectural designs to unsolved issues.
Although ViT models have achieved significant success in many medical images analysis tasks, they still face challenges.In [20], Tu et al. indicated that ViT is weak in extracting local information, and proposed the maxvit model, which can extract both local and global information.Yi et al. [21] proposed a local-global transformer (LGT) to preserve the intra-unit information and inter-unit correlation.Yuan et al. [22] proposed tokens-to-token to imitate the convolution by reshaping the tokens in the transformer, which is useful for the interaction of tokens.In [23], Liu et al. proposed Swin-Transformer, to assist cross-window connection by the Shifted windows.And, as indicated by [9], compared to the CNN-based models, the ViT-based models usually require larger datasets to support their training, due to a more number of learnable parameters.However, for many medical related tasks such as the ultrasound image analysis also focused on in this paper, it is often difficult to build a large dataset like the natural image dataset ImageNet, due to the difficulties in acquiring and labeling data, which has resulted in the ViT models performing less efficiently than expected on many medical image analysis tasks.
The standard ViT architecture for image classification adds a class token in addition to the patch tokens to better aggregate the features extracted from each image patch.At the end of the transformer encoder, only the output corresponding to the class token is fed to a Multi-Layer Perceptron (MLP) head for classification, while all of the outputs corresponding to the patch tokens are directly discarded.To improve the training of ViT models on small training datasets, this work proposes an improved ViT architecture, called ViT-Patch, by introducing a secondary task on the patch tokens, which is related to the classification task (i.e., the primary task) on the class token.An MLP head is added for each patch token in ViT-Patch to determine whether the patch overlaps with the tumor area (i.e., the secondary task).Since the primary and secondary tasks are closely related, the proposed ViT-Patch can obtain more supervised information from a single image sample to train model parameters, thus reducing the requirement for training datasets, compared with the standard ViT that relies only on the output of the class token.
The proposed ViT-Patch is validated based on a publicly available breast ultrasound dataset.The experimental results show that ViT-Patch can effectively improve the efficiency of feature learning on the class token and patch tokens.Compared with the standard ViT, the ViT-Patch not only improves the performance of the primary task (malignant identifica-tion) on the class token, but also enables patch-based tumor localization using the outputs of patch tokens, which will be useful for subsequent detection or segmentation tasks.

Standard ViT for Malignant Identification
The ViT replaces the traditional convolutional computation with a self-attentive mechanism in vision tasks.Figure 1 shows the standard ViT, which is composed of three modules: patch and position embedding, transformer encoder, and MLP head.Patch and position embedding: Since the input to transformer is a sequence of vectors, the ViT first needs to divide the input image into different patches and encode them to form a sequence of embedded vectors, i.e., the initial patch tokens.Suppose that the input image is I ∈ R H×W×C , where C is the number of channels, and H and W represent the height and weight of the image, respectively.The size of each patch is S × S.There are N = HW/S 2 patches in total, and by patch and position embedding, N embedded vectors (patch tokens) are obtained.Similar to BERT [24], ViT adds a learnable token, i.e., the class token.All N + 1 tokens are fed to the transformer encoder for multi-headed self-attention computation.
The patch embedding can be implemented with a S × S × C × D convolutional layer of padding = 'valid', channel number = C and stride = S, where t n ∈ R D denotes the n-th patch token, and the dimension D of each token vector is determined by the number of output channels of the convolutional layer.To embed the position information, a learnable D-dimensional position encoding vector s n is added to each patch token, i.e., t n ← t n + s n .An initial token matrix T ∈ R (N+1)×D is finally obtained by concatenating the class token and the N patch tokens.
Transformer encoder: The transformer encoder is composed of multiple encoder blocks based on multi-headed attention.In each encoder block, there are two basic computational units, Multi-Headed Self-Attention (MHSA) and MLP, each preceded by a LayerNorm (LN) [25,26] and followed by a residual connection.Based on single attention function, MHSA use multi-head for multiple parallel attention calculations.
QKV represents the query Q ∈ R (N+1)×d , keyword K ∈ R (N+1)×d , and value V ∈ R (N+1)×d .Finally, MHSA consider h attention "heads", where h represents the number of heads, and Let T be the input token matrix of an encoder block, i.e., the output of the previous block.The data flow is shown in Figure 1b.The feedforward calculation of the encoder block is written as For more detail about the MHSA, please refer to [8].
MLP head on class token: For malignant identification, an MLP head is added on the top of the class token (see Figure 1), which contains a hidden layer with a GeLU activation, and an output layer of size 1 with a sigmoid activation.The desired output of the MLP head should be 1 if the input image contains a malignant tumor, and 0 if it is normal or benign.The Binary Cross Entropy (BCE) is used as a loss function to measure the difference between the desired output L c and the predicted output P c of the MLP head on the class token,

Improved ViT-Patch Architecture
The standard ViT usually requires large training datasets due to the large number of parameters.This makes it difficult for ViT models to take advantage of long-range feature extraction on some medical image analysis tasks where data acquisition and labeling costs are high, resulting in performance that is often inferior to that of CNN-based models.
As can be seen from Figure 1, only the class token is covered by an MLP head, while all patch tokens are not roofed in the standard ViT.This single MLP-head architecture tends to lead to the unbalanced training of class token and patch tokens.During the training phase, the supervised information is only back-propagated from the class token to the transformer encoder, which may lead to weaker and weaker associations between the features learned by each patch token and its initial embedded features, as the number of encoding blocks increases, since the MHSA mechanism allows for the lateral flow and fusion of embedded features between tokens.As shown in Figure 2, each row of the last layer token of ViT is much similar with the others.At the deeper levels, the patch tokens and class tokens have similar functionalities, and their output features are prone to redundancy, which can cause the model to be prone to overfitting on small datasets.
We expect the model to processing image information like a real doctor, which can constantly interact with global and local information.And, we believe that the patch tokens contain valid information for category classification, so we hope to find a way to introduce the patch tokens into the training process.In this work, We try to let each token keep its own characteristic information, and so a secondary task is imposed on the patch tokens, which is related to the primary task (malignant identification) imposed on the class token.As shown in Figure 3, an MLP head is added to each patch token output by the transformer encoder to determine whether the patch overlaps with the tumor.Adding the secondary task can balance the training of the class and patch tokens, and make the flow and fusion of the image features between the tokens more reasonable.With the class token focusing on the long-range features used for the malignant identification of the whole image, and with each patch token focusing on the local features related to this patch, ViT-Patch reduces the feature redundancy between the tokens and avoids the overfitting problem.
The proposed ViT-Patch architecture not only adds a patched-based tumor localization function, but also improves on the classification performance of the primary task by introducing auxiliary supervision information on the patch tokens that is closely related to the primary task on the class token.MLP head on patch tokens: All patch tokens share the same MLP head for the secondary task, which has the exact same structure as that on the class token, but does not share weights with the class token.If a patch has an overlap with a tumor area, the desired output of the MLP head corresponding to that patch should be 1, otherwise, it should be 0.
Joint loss function: Since the secondary task is also binary classified for each patch token, the BCE loss is adopted.The total loss function is defined as where P n /L n denotes the predicted/desired output of the MLP head on the n-th patch token.

Dataset
The breast ultrasound dataset is from Al-Dhabyani et al. [27], which is composed of 780 images from 600 women divided into three categories: 133 normal images, 437 benign images, and 210 malignant images.In this work, the normal and benign images are combined into one class, and the malignant images are another class, corresponding to the binary classification of the primary task.Both the benign and malignant images in the dataset are labeled with segmentation masks for the tumor areas that can be used to generate the patch-token labels for the secondary task.
Figure 4 shows some sample images and their masks in the original dataset.The morphological characteristics of the breast ultrasound images are shown in Table 1.In general, the malignant lesions are more irregular in shape than the benign lesions, and they are more likely to be confused with normal human tissue.The image sizes in the original dataset are not standardized.In order to be able to input the ViT models, we unified the image sizes to 384 × 384 by reshaping it.We randomly selected 118 (about 15% of the total samples) images as the test set, of which 33 images are malignant, 65 images are benign, and 20 images are normal.The remaining 664 images are used as training samples.

Experimental Setting
Our experimental platform is Torch 1.10 and Python 3.7.7 @ an RTX 3090Ti GPU.All models haved trained with the Adam [28] optimizer, with an initial learning rate of 1 × 10 −4 and beta = (0.9, 0.999).
The transformer encoder in the standard ViT contains 12 encoder blocks, and the dimension of all tokens is set to D = 768.For comparison, the ViT-Patch did not make any changes to the other modules of the standard ViT except for adding the MLP header to the patch tokens.All MLP heads used in the ViT and ViT-Patch have the same structure, containing a hidden layer of size 4D with the GeLU activation.All ViT-based models in our experiments are trained for 50 epochs.To verify the effects of different patch divisions on the performances of ViT-based models, we divided the 384 × 384 ultrasound images into patches of 3 × 3, 6 × 6, 12 × 12, and 24 × 24, forming patches of sizes S = 128, 64, 32, and 16, respectively.In the following, ViT/ViT-Patch(S) is used to denote the standard ViT/the proposed ViT-Patch with patch size S × S, respectively.

Experimental Results
We validate the performance of the ViT-Patch for whole-image-based malignant identification and patch-based tumor localization, based on the breast ultrasound dataset.The experimental results for malignant identification are compared with the standard ViT, and some traditional models which have been widely used for comparison.The learning curves of the standard ViT and ViT-Patch models are compared, to validate the improvement of the proposed architecture for model training.In addition, the differences between the deep token features learned by the standard ViT and the ViT-Patch models are analyzed to show the optimization of the proposed architecture for self-attention-based feature learning.
Malignant identification: The primary task (malignant identification) is evaluated using Accuracy (ACC) and Sensitivity (SEN), where TP indicates the number of True Positive samples, FP indicates False Positive, FN indicates False Negative, and TN indicates True Negative.Table 2 gives the ACC and SEN evaluation of the involved models for the primary task.The top two results (bolded in the table) are all from the ViT-Patch architecture.Compared to the standard ViT counterparts, the ViT-Patch models have advantages at all patch sizes, suggesting that adding a suitable secondary task to the patch tokens can improve the classification performance on the class token.The performance of the ViTbased models decrease significantly for larger patch sizes (S = 64, 128), which is lower than that of the CNN-based baselines.The reason for the drop may be that the patch embedding leads to significant feature loss when the patch size is large, and it is difficult for the self-attention mechanism to take advantage of long-range feature extraction, based on the barren embedded vectors.In terms of SEN alone, the ViT-based models have some advantages over the CNN-based models.In Table 2, the ACCs of the ViT-based models with large patch are lower than those of the CNN-based models, but their SEN still remains comparable.

Model ACC (%) SEN (%) Params
ResNet50 [29] 83.1 48.5 -InceptionV3 [30] 86.4 54.5 -VGG16 [31] 88.1 66.7 -Swin-B [23] 86.4 60.6 88 M TNT-B [32] 83.1 63.6 65.6 M T2T-ViT-24 [22] 84.7 51.5 64.1 M PVT-Large [11] 85 Lesion localization using ViT-Patch: We also evaluated the tumor localization ability of the ViT-Patch, i.e., the performance for the secondary task, which is a free gift provided by the proposed architecture.Figure 5 shows the localization results of several testing images predicted by the ViT-Patch models of various patch sizes.In the predicted mosaic maps (from the third to the last columns), the yellow blocks denote TP, the green blocks denote FP, the blue blocks denote FN, and the purple blocks denote TN.By comparing the predicated mosaics with the labeled segmentation masks (the second column in the figure), it can be seen that the ViT-Patches with different patch sizes can provide relatively reliable patch-based tumor localization, which can be used for subsequent segmentation tasks, e.g., optimizing the RPN or ROI modules of the Faster-RCNN model [33].The DICE coefficient is used to quantitatively evaluate the secondary task, Table 3 shows the DICE results for the ViT-Patch with different patch sizes.In contrast to the performance on the primary task, the ViT-Patch with the largest patch size (S = 128) has the best DICE.The overall trend is that models with small patch sizes have a large deviation between the predicted mosaic maps and the ground truth.However, it must be noted that small patches imply higher spatial resolution.The difficulty of the second task itself increases dramatically as the patches size decreases.

Learned Features for ViT vs. ViT-Patch
We compare the ViT and ViT-Patch in terms of model training and feature learning.For both architectures, the patch size is set to S = 32, the size that performs best on the primary task (see Table 1).
Figure 6 compares the loss-epoch learning curves and ACC (on the test set)-epoch curves of the standard ViT and ViT-Patch for the primary task.Although the Vit-Patch is trained with the joint loss in (4), for comparison, the learning curve is calculated only for l class , the primary-task loss component on the class token.As shown in Figure 6a, the loss curve of the standard ViT decreases faster than that of the ViT-Patch as the number of epochs increases, and its converged loss on the training set is lower than that of the ViT-Patch.However, the ViT-Patch outperforms the standard ViT on the test set at almost all training epochs (see Figure 6b).The contrast on the training and test sets suggests that the standard ViT is prone to overfitting on small training sets under a single class-token supervision, and the proposed ViT-Patch can improve on the generalization of the model by introducing additional patch-token supervision.To quantitatively evaluate the similarity between tokens, the variances of 145 tokens on each feature channel is calculated (see the variances-channel curves with the average variances over 768 channels in the third column of Figure 7).Consistent with the visual evaluation of the token maps, the variances obtained by the ViT-Patch are significantly larger than those obtained by the ViT.Further, we computed the eigenvalues of the token maps (matrices), which are arranged from smallest to largest in the last column of Figure 7, and it can be seen that the eigenvalue distribution of the ViT-Patch is better than that of the ViT, which is very concentrated, indicating a clear linear correlation between its learned feature vectors.

Conclusions
This paper has proposed an improved ViT architecture, ViT-Patch, which introduces a secondary task on the patch tokens, in addition to the primary task on the class token.The standard ViT used for breast ultrasound image classification only constructs one MLP head on the class token for supervised training, while a larger number of patch tokens are in suspense.This unbalanced architecture tends to lead to an overfitting of the ViT models on small training datasets.The ViT-Patch constructs an MLP head for each patch token (with shared weights across patches), and introduces supervised information on whether each patch overlaps with the tumor areas as a secondary task.The improved architecture not only gives the model the ability to localize tumors based on patches, but also improves on the performance of the primary task (i.e., malignant identification) on the class token, because more task-related supervised information is provided.The ViT-Patch is validated on the public breast ultrasound image dataset.The experimental results show that in the malignant identification task, the ViT-Patch performs significantly better than the standard ViT at all patch sizes, and the ViT-Patch models of small patch sizes also outperform the CNN-based baseline models.Besides malignant identification, the ViT-Patch can provide patch-based mosaic maps for tumor localization.In the experimental part, we also compared and analyzed the learning curves and the learned token maps of the ViT and ViT-Patch, which demonstrate that the ViT-Patch architecture can improve the generalization of the ViT model and optimize the feature learning at deep levels.

Figure 1 .
Figure 1.Standard ViT architecture (with N = 9 patches as an example).(a) Data Flow of ViT.(b) Transformer Encoder Block.

Figure 2 .
Figure 2. Tokens of Last Attention layer.

Figure 4 .
Figure 4. Breast ultrasound image images and their masks.

Figure 5 .
Figure 5. Lesion localization using ViT-Patch models.Each row corresponds to one sample, and the columns from left to right are: original images, labeled mask maps, and mosaic maps predicted by ViT-Patch with patch size S = 16, 32, 64, and 128, respectively.

Figure 6 .
Figure 6.Learning curves and ACC curves for ViT vs. ViT-Patch (32).(a) Learning curves.(b) ACC (on test set) curves.To further show the improvement of the ViT-Patch on feature learning, we visualized the token vectors learned by the standard ViT and ViT-Patch.The patch size is still set to be 32, which means that each level of transformer encoder has 145 token vectors (1 class token and 144 patch tokens) that can be visualized as a 145 × 768 token map.Column 2 of Figure 7 compares the token maps of the penultimate transformer block of the ViT and ViT-Patch, where the penultimate block is chosen because the patch tokens of the last block of the standard ViT are not included in the backpropagation loop.As shown in the figure, the row vectors in the ViT's token map are very similar, indicating that the multi-headed attention mechanism in the single-task mode based on the class token leads to homogenization across patch tokens and significant redundancy in the learned deeper features.In contrast, by introducing a patch-based second task, the ViT-Patch can avoid the homogenization of patch tokens.

Figure 7 .
Figure 7.Comparison of deep token features learned by ViT and ViT-Patch.Each row corresponds to one sample.The columns from left to right are: original image, token maps, variance-channel curves, and eigenvalue curves.

Table 1 .
Morphological characteristics of breast ultrasound images.

Table 2 .
Quantitative evaluation and comparison for the main task.

Table 3 .
Quantitative evaluation for the secondary task of ViT-Patch models.