Batch Similarity Based Triplet Loss Assembled into Light-Weighted Convolutional Neural Networks for Medical Image Classification

In many medical image classification tasks, there is insufficient image data for deep convolutional neural networks (CNNs) to overcome the over-fitting problem. The light-weighted CNNs are easy to train but they usually have relatively poor classification performance. To improve the classification ability of light-weighted CNN models, we have proposed a novel batch similarity-based triplet loss to guide the CNNs to learn the weights. The proposed loss utilizes the similarity among multiple samples in the input batches to evaluate the distribution of training data. Reducing the proposed loss can increase the similarity among images of the same category and reduce the similarity among images of different categories. Besides this, it can be easily assembled into regular CNNs. To appreciate the performance of the proposed loss, some experiments have been done on chest X-ray images and skin rash images to compare it with several losses based on such popular light-weighted CNN models as EfficientNet, MobileNet, ShuffleNet and PeleeNet. The results demonstrate the applicability and effectiveness of our method in terms of classification accuracy, sensitivity and specificity.


Introduction
Medical image classification is one of the more basic and important tasks for computeraided diagnosis (CAD). An efficient medical image classifier can help reduce the workload for the doctors and guide the inexperienced physicians. In recent years, deep learning (DL), especially the convolutional neural networks (CNNs)-based methods, have shown outstanding performance for image processing tasks. Therefore, some researchers have developed and applied lots of heavy-weighted CNNs for medical image classification [1]. For example, the AlexNet [2] is used for breast cancer recognition from histological images [3] and Alzheimer's disease diagnosis from MRI images [4]. Besides this, the visual geometry group (VGG) network [5] is utilized to identity papillary thyroid carcinomas in cytological images [6] and discover COVID-19 cases based on X-ray images [7]. In addition, the Inception-V3 [8] is trained for distinguishing skin cancer images from normal ones [9], and differentiating benign and malignant renal tumors based on CT images [10]. Moreover, the residual network (ResNet) [11] is applied to HEp-2 cell classification [12] and the quality assessment of retinal OCT images [13]. Even though these mentioned heavy-weighted models can achieve better performance in some specific applications, they have limited capabilities in many medical applications in the case of small samples. The reason lies in the fact that the effectiveness of these networks depends on the quality and quantity of training data, while there are usually not enough annotated image data to train very deep networks. Therefore, such light-weighted networks as DenseNet [14], MobileNet [15], ShuffleNet [16] and EfficientNet [17] arouse researchers' interest, and many models are applied to medical image classification tasks. For example, Yuan et al. used the DenseNet to realize polyp recognition from the wireless capsule endoscopy images [18]. Brehar et al. designed a new shallow CNN model to recognize hepatocellular carcinoma and it provided a higher accuracy than the compared deep models [19]. Besides this, some researchers tried to diagnose COVID-19 using MobileNet V2, ShuffleNet, and EfficientNet based on X-ray images [20][21][22]. These light-weighted models have less adjustable weights, are easier to train and have better computational efficiency compared with the heavy-weighted ones, but their classification ability still needs to be improved to deal with complex medical applications, especially in the case of small amounts of training samples.
From the perspective of loss function, the mentioned CNN models take the traditional cross entropy (CE) as the loss function for the training process. However, the CE function only measures the difference between the predicted probability distribution and the target distribution [23], which means that CE-based models cannot analyze the distribution of samples and classes. Compared with regular CNN models, some models for few-shot recognition tasks adopt different losses where the deep metric learning (DML) technique is involved [24]. The DML losses take advantage of data distribution for discovering the differences among classes and finding the major common patterns for each category. In other words, the DML losses can assist the CE-based CNN models in the training process.
Generally, the DML-based CNNs use multi-inputs and produce multiple embedding vectors so that they can calculate a certain distance metric from the embedding vectors as the loss for training. By minimizing the loss, the CNNs can generate similar embedding vectors for images belonging to the same class, and dissimilar vectors for different categories of inputs [25]. For example, the Siamese network [26] proposed by Jane et al. took paired images as inputs so that it can use the distance of embedding vectors along with the ground truth to form the contrastive loss function. Hoffer et al. [27] designed the triplet network which used the ternary input to obtain distinguishing features from two different kinds of samples. To get more information from different classes, Sohn et al. [28] proposed multi-class N-pair loss, which utilized one sample from each class to identify each input example. Notice that this pair-based DML loss only observes several samples one at a time, which means the estimated data distribution varies a lot. As an improvement, Song et al. proposed a lifted structured feature-embedding method [29], which measured all the distances between every two samples in a training batch so that it could learn to discriminate embedding vectors according to the data distribution of the input batch.
Despite the fact that the DML-based CNNs can analyze the data distribution and improve the training process, they are designed for such few-shot recognition or image retrieval tasks as face recognition [28,30]. Even though some researchers have tried to apply DML loss to train CNNs in such medical applications as coronary heart disease classification [31] and COVID-19 diagnosis [32], these models need extra classifiers to accomplish the classification tasks. Besides this, it is hard to select a reasonable support set for regular medical image classification tasks. There are two simple ways to introduce DML into regular classification applications without a query image set. One is to use DML loss to train the CNN models to obtain distinguishing embedding vectors, and classify them with a traditional classifier. For example, Gupta et al. [33] proposed Siamese CNN, which was trained based on the triplet loss, and used the SVM to recognize mitotic HEp-2 cell images. This kind of scheme relies on the classification ability of the adopted classifier in the case of a small training dataset. Another idea is to combine the DML loss with the CE loss to train models. However, these mentioned pair-based DML losses involve specially designed rules for pairing the samples. The rules are incompatible for regular CE-based CNNs at the training stage. Sun et al. [34] tried to combine triple loss with the CE loss to overcome the problem that the herbal images are too diverse and complicated. Lei et al. [35] proposed a novel class-center-involved triplet loss, and combined it with the CE loss to deal with the Sensors 2021, 21, 764 3 of 21 imbalanced data problem for the skin disease classification. However, these DML losses are still pair-based and cannot effectively represent the data distribution of the dataset.
To more effectively utilize the DML loss to train the regular CNNs in the case of medical image classification applications, we have proposed a novel batch similarity-based triplet loss, denoted as "BSTriplet" loss for short. The proposed loss can facilitate assessing the similarities among a batch of input images instead of several pairs to evaluate the distribution of training data. As shown in Figure 1, the proposed BSTriplet loss takes the embedding vectors produced by the CNN model to calculate a similarity matrix, which contains all the similarities between every pair of two samples in the input batch. For each sample, the BSTriplet loss analyzes the similarities between it and the rest of the samples. According to the ground truth, the BSTriplet loss converts the difference of samples from the same class and the similarity of the rest of the samples into a loss. By reducing the produced loss, the CNN model can achieve the goal of making the samples of the same class become closer in the embedding space, or farther apart if they are from different categories. The BSTriplet loss calculates the loss for each sample in parallel, which means that it is compatible for model training by CE loss. Since the proposed loss can only play a role in clustering, we integrated the softmax for classification and utilized the CE to evaluate the ability of classification. Moreover, we have designed a novel sample-mining method to build the input batch according the distribution of the whole training dataset. In this way, the input batch will contain diverse samples of all classes, and still have better consistency compared with batches constructed by the random selection. To evaluate the effectiveness of the proposed loss, we have tried to use the BSTriplet loss combined with CE loss to train such popular light-weighted networks as MobileNet-V3 [36], ShuffleNet-V2 [37], EfficientNet [17] and PeleeNet [38] on three different medical image datasets. The results show that the BSTriplet loss can guide the CNN models to cluster the embedding vectors. Moreover, it can help these light-weighted CNN models to achieve better performance compared with the traditional CE loss or other combined losses. regular CE-based CNNs at the training stage. Sun et al. [34] tried to combine triple loss with the CE loss to overcome the problem that the herbal images are too diverse and complicated. Lei et al. [35] proposed a novel class-center-involved triplet loss, and combined it with the CE loss to deal with the imbalanced data problem for the skin disease classification. However, these DML losses are still pair-based and cannot effectively represent the data distribution of the dataset.
To more effectively utilize the DML loss to train the regular CNNs in the case of medical image classification applications, we have proposed a novel batch similaritybased triplet loss, denoted as "BSTriplet" loss for short. The proposed loss can facilitate assessing the similarities among a batch of input images instead of several pairs to evaluate the distribution of training data. As shown in Figure 1, the proposed BSTriplet loss takes the embedding vectors produced by the CNN model to calculate a similarity matrix, which contains all the similarities between every pair of two samples in the input batch. For each sample, the BSTriplet loss analyzes the similarities between it and the rest of the samples. According to the ground truth, the BSTriplet loss converts the difference of samples from the same class and the similarity of the rest of the samples into a loss. By reducing the produced loss, the CNN model can achieve the goal of making the samples of the same class become closer in the embedding space, or farther apart if they are from different categories. The BSTriplet loss calculates the loss for each sample in parallel, which means that it is compatible for model training by CE loss. Since the proposed loss can only play a role in clustering, we integrated the softmax for classification and utilized the CE to evaluate the ability of classification. Moreover, we have designed a novel sample-mining method to build the input batch according the distribution of the whole training dataset. In this way, the input batch will contain diverse samples of all classes, and still have better consistency compared with batches constructed by the random selection. To evaluate the effectiveness of the proposed loss, we have tried to use the BSTriplet loss combined with CE loss to train such popular light-weighted networks as MobileNet-V3 [36], ShuffleNet-V2 [37], EfficientNet [17] and PeleeNet [38] on three different medical image datasets. The results show that the BSTriplet loss can guide the CNN models to cluster the embedding vectors. Moreover, it can help these light-weighted CNN models to achieve better performance compared with the traditional CE loss or other combined losses. In this study, our main contributions are as follows: 1. We have introduced a novel batch similarity-based loss, which can be embedded into arbitrary CNN models to help the training process; 2. A reasonable sample mining strategy is designed to help CNN models in the better estimation of the distribution of the training dataset; 3. The proposed loss is combined with cross entropy to train several light-weighted CNN models, and its effectiveness has been demonstrated on different kinds of medical image datasets. In this study, our main contributions are as follows:

1.
We have introduced a novel batch similarity-based loss, which can be embedded into arbitrary CNN models to help the training process; 2.
A reasonable sample mining strategy is designed to help CNN models in the better estimation of the distribution of the training dataset; 3.
The proposed loss is combined with cross entropy to train several light-weighted CNN models, and its effectiveness has been demonstrated on different kinds of medical image datasets.
The context of this paper is organized as follows. Section 2 describes the details of the designed BSTriplet loss and the designed data mining technique. Section 3 presents the experiments performed to discover some characteristics of the proposed loss and compare it with some other DML loss. The conclusion is given in Section 4. transposition off i . By reducing L s trip , s + i and s − i will be close to 1 and 0, respectively. Notice that for f i , L s trip only uses two reference vectors f − i and f + i to calculate the loss, which is disadvantageous for training due to the variety and complexity of the triplet inputs.

Batch Similarity Based Triplet Loss
In the BSTriplet, we aim at making full use of the input batch. Therefore, similarities between every set of two corresponding samples in the batch are measured to evaluate the data distribution. After the embedding vectors f ∈ R N b ×l for the input batch containing N b samples are obtained, each vector f i is L2 normalized to producef i = f i / f i 2 , which is inherited from the traditional similarity-based triple loss. The similarity matrix S ∈ R N b ×N b is produced by S =ff T , which stores the similarities of all possible pairs of samples in the batch. To analyze the similarity matrix S, the ground truth y ∈ R N b ×1 is necessary.
Each label y i ∈ R 1×1 can be transformed into a row vectorŷ i ∈ R 1×C according to the one-hot encoding method, where C is the number of classes. As a result,ŷ i is a row vector full of zeros, except the y i -th element is one. A binary matrix B ∈ R N b ×N b can be generated by B =ŷŷ T . The binary matrix B is utilized to distinguish the similarities for positive pairs from those for negative ones by calculating the discriminative similarity matrix D: where the symbol ⊗ denotes the Hadamard product [40]. 1 ∈ R N b ×N b and I ∈ R N b ×N b are a matrix full of 1s and an identity matrix, respectively. Note that in the matrix D, the similarities of positive pairs are greater than 0, and vice versa. Moreover, the diagonal elements of D are zero. The diagonal elements of S will be wiped out in that they represent the similarity between the vectors f i and themselves, and they are helpless for evaluating data distribution due to the fact that S i,i = 1. The process of evaluating the similarities among all samples for the BSTriplet loss is shown in Figure 2.
similarities of positive pairs are greater than 0, and vice versa. Moreover, the diagonal elements of D are zero. The diagonal elements of S will be wiped out in that they represent the similarity between the vectors i f and themselves, and they are helpless for evaluating data distribution due to the fact that , 1 The process of evaluating the similarities among all samples for the BSTriplet loss is shown in Figure 2.
is a constant. The two complex items in the brackets represent the average similarity for positive pairs and that for negative ones, respectively. Since D is adopted instead of linear functions so that the loss can have a smoother gradient around the optimal solution. The average loss for the input batch is equal to the average of losses for every i x , which is realized by As shown in Figure 3   Based on the above discriminative similarity matrix D, the loss for each input sample x i ∈ X can be evaluated by observing the i-th row of the matrix D. The values D i,j in the i-th row represent the similarity between the vector f i and other vectors in the batch (except the diagonal element D i,i = 0). Clearly, there may be multiple similarities for positive and negative pairs in the i-th row. For convenience, we re-denote the similarities for positive pairs as D + i,j , j = 1, 2, . . . , N + , and negative ones as D + i,k , k = 1, 2, . . . , N − , respectively. Please note that N b = N + + N − + 1. The batch similarity-based triplet loss for x i is defined as where m ∈ [0, 1] is a constant. The two complex items in the brackets represent the average similarity for positive pairs and that for negative ones, respectively. Since D i,j ≤ 1, the square of D i,j is adopted instead of linear functions so that the loss can have a smoother gradient around the optimal solution. The average loss for the input batch is equal to the average of losses for every x i , which is realized by As shown in Figure 3, by reducing the loss L b trip , all the embedding vectors f i will move towards the center of the vectors f + i,j and further away from the other kinds of vectors In addition, for a well-trained model, the lower bound for In other words, m controls the strictness of the constraint for clustering, and it should be near 1. The reason for not setting m = 1 directly is that it is unnecessary and unrealistic to let all the embedding vectors in the same class be the same. for not setting 1 m = directly is that it is unnecessary and unrealistic to let all the embedding vectors in the same class be the same.
where λ is a trade-off parameter. Since the value of b trip L is close to that of ce L , the parameter λ is empirically set as 1.

Data Mining Strategy
For the regular triplet loss, each triplet must meet the requirement that every triplet input contains a sample i x , and two reference samples + i x and i x − to form both a positive pair and a negative pair. Therefore, the data mining is necessary for constructing the available triplet input. In contrast with the regular triplet loss, the proposed BSTriplet loss will work well even in the case that samples in the batch can only compose positive pairs or negative ones. The only unavailable case for the BSTriplet loss is Nevertheless, to obtain the objective data distribution of the training dataset, we have designed a new data mining strategy, which can be described in the following steps: 1. Classify the original images into different categories according to the ground truth; 2. Cluster the original images into several groups for each category; 3. Count the number of images in each group, and calculate the ratio for each group; 4. Randomly select samples from every group to construct an input batch according to the obtained ratios.
The number of groups for each class in the dataset can be different because it depends on specific applications and data distribution. Notice that this strategy is only used in the To accomplish the classification and evaluate the classification capability, the softmax classifier and the CE loss L ce are used. They are defined as where o i,j ∈ o i is the output of the last fully connected layer in the model F(x i ), and p i,j is the predicted probability for sample x i , belonging to the j-th class. 1 y i =j represents that the ground truth y i is equal to j. The total loss is determined by both the CE loss and the BSTriplet loss as where λ is a trade-off parameter. Since the value of L b trip is close to that of L ce , the parameter λ is empirically set as 1.

Data Mining Strategy
For the regular triplet loss, each triplet must meet the requirement that every triplet input contains a sample x i , and two reference samples x + i and x − i to form both a positive pair and a negative pair. Therefore, the data mining is necessary for constructing the available triplet input. In contrast with the regular triplet loss, the proposed BSTriplet loss will work well even in the case that samples in the batch can only compose positive pairs or negative ones. The only unavailable case for the BSTriplet loss is N b = 1. Nevertheless, to obtain the objective data distribution of the training dataset, we have designed a new data mining strategy, which can be described in the following steps:

1.
Classify the original images into different categories according to the ground truth; 2.
Cluster the original images into several groups for each category; 3.
Count the number of images in each group, and calculate the ratio for each group; 4.
Randomly select samples from every group to construct an input batch according to the obtained ratios.
The number of groups for each class in the dataset can be different because it depends on specific applications and data distribution. Notice that this strategy is only used in the training process. During the test phase of a trained CNN model, the model will ignore both the BSTriplet loss and the CE loss, and takes the output of the softmax function as the predicted results.

Computational Complexity
Supposing the number of training datasets is N trn , it takes N trn floating point operations (FLOPs) for indexing in the first step of the proposed data mining scheme. As for the second step, its computational efficiency depends on the adopted clustering method. The required number of FLOPs for clustering is denoted as O(clsutering). Given the number g of groups of training images, the third step will take g FLOPs to obtain the ratios, and the fourth step needs N b FLOPs to construct an input batch. Notice that the first three steps of the proposed data mining strategy are carried out only once before the training process, and they need O(clsutering) + N trn + g FLOPs in total, while the fourth step will be performed once for each iteration in the training process.
The calculation process of the BSTriplet loss consists of basic mathematical operations. Therefore, it is computationally efficient and easy for implementation. Normalization for the embedding vectors requires lN b multiplication and (l − 1)N b addition, which are (2l − 1)N b FLOPs. Meanwhile, the calculation for obtaining the similarity matrix S needs FLOPs. Besides this, encoding the ground truth y into a one-hot vectorŷ takes N b operations of indexing and assignment, the calculation cost of which is considered to be 2N b FLOPs in this paper for facilitating statistical analysis. The binary matrix B is generated with (C 2 + C − 1)N 2 b FLOPs. As for the discriminative similarity matrix D, it is calculated once and shared for every sample in the batch, and it can be produced by setting the diagonal value of S ⊗ (2B − 1) as 0. Therefore, the cost of D is 3) needs to be calculated repeatedly N b times, and it costs 2N 2 b + 2N b FLOPs in total. The average operation of the BSTriplet loss can be performed at a cost of N b FLOPs. Overall, the number of required FLOPs for the BSTriplet loss in each iteration is

Experimental Setup
In the following, some experiments have been performed to test the effectiveness of the proposed BSTriplet loss. We have tried to apply the proposed loss to such popular lightweighted networks as EfficientNet-B1, MobileNet-V3-Small, ShuffleNet-V2 and PeleeNet for testing its performance. Those chosen models have distinctive structures and state-ofthe-art performance, and they have been widely used in a variety of image classification tasks. The reason for adopting EfficientNet-B1 rather than the other models in [17] is that EfficientNet-B1 has a similar complexity to other light-weighted models, and its classification ability is better than that of the EfficientNet-B0. The number of parameters and the computational complexity of the compared networks are shown in Table 1, where the FLOPs are produced when the size of the input images is 128 × 128 × 1. According to Table 1, PeleeNet has the fewest parameters and MobileNet-V3-Small has the smallest number of FLOPs. All the involved networks are realized using Python 3.6.2 (downloaded from www.python.org) with Keras 2.3.1 and TensorFlow 2.0.0. In addition, all the following experiments are conducted on a computer with Ubuntu 16.04, a CPU of Intel Xeon Gold 6129 and a GPU of Nvidia Tesla V100 with CUDA 10.0 for acceleration. During the training process of every network, the initial learning rate is set as 0.001 to let the training loss get smaller quickly. The training loss is monitored. Once there is no improvement for the training loss for 20 epochs, the learning rate is multiplied by 0.3 to help the models find the optimal solutions. The maximum number of epochs is set as 400, and the minimum learning rate is set as 10 −8 . Besides this, some images are chosen from the test dataset to constitute the validation dataset for observing the performance of the network. In addition, the early-stopping technique is adopted to avoid the over-fitting problem, which is realized by stopping the training process once the improvement in validation loss is less than 10 −4 for a successive 30 epochs.

The Influence of N b
The number of images in one batch N b is an important parameter in the calculation of the BSTriplet loss, because it can influence the similarity matrix S and further affect the estimation for training data distribution. Besides this, it has an impact on the stability of the training process. To explore the influence of N b , we have downloaded a dataset of chest X-ray images [41,42] from the Kaggle [43] website. This dataset is denoted as "Chest-1" for brevity in the following. For this dataset, the X-ray images will be classified into three classes, including normal, (regular) pneumonia, and COVID- 19. Several examples are shown in Figure 4. Here, these images are cropped to squares for better clarity. All the images are resized as 128 × 128 to be input into the network, and the construction of the Chest-1 dataset is shown in Table 2. According to the data mining strategy described above, we cluster the images in the Chest-1 dataset into six groups by K-means [44], wherein there are two groups for each class. Based on the ratios of groups, we have sampled each group to build different sizes of input batches N b ∈ {6, 12, 18, 24, 30, 36}. ModbileNet-V3-Small is used as the base framework and is trained by CE loss combined with the proposed BSTriplet loss. Some experiments have been performed with various N b , while other hyper-parameters are fixed according to the method of controlling variables. The accuracy (ACC) is adopted as a metric to evaluate the performance, which is defined as: where TP i and TN i are the numbers of true positive and true negative cases for the i-th class, respectively, and N is the number of test images. During the training process of every network, the initial learning rate is set as 0.001 to let the training loss get smaller quickly. The training loss is monitored. Once there is no improvement for the training loss for 20 epochs, the learning rate is multiplied by 0.3 to help the models find the optimal solutions. The maximum number of epochs is set as 400, and the minimum learning rate is set as 8 10 − . Besides this, some images are chosen from the test dataset to constitute the validation dataset for observing the performance of the network. In addition, the early-stopping technique is adopted to avoid the over-fitting problem, which is realized by stopping the training process once the improvement in validation loss is less than 4 10 − for a successive 30 epochs.

The Influence of b N
The number of images in one batch b N is an important parameter in the calculation of the BSTriplet loss, because it can influence the similarity matrix S and further affect the estimation for training data distribution. Besides this, it has an impact on the stability of the training process. To explore the influence of b N , we have downloaded a dataset of chest X-ray images [41,42] from the Kaggle [43] website. This dataset is denoted as "Chest-1" for brevity in the following. For this dataset, the X-ray images will be classified into three classes, including normal, (regular) pneumonia, and COVID- 19. Several examples are shown in Figure 4. Here, these images are cropped to squares for better clarity. All the images are resized as 128 128 × to be input into the network, and the construction of the Chest-1 dataset is shown in Table 2. According to the data mining strategy described above, we cluster the images in the Chest-1 dataset into six groups by K-means [44], wherein there are two groups for each class. Based on the ratios of groups, we have sampled each group to build different sizes of input batches variables. The accuracy ( ACC ) is adopted as a metric to evaluate the performance, which is defined as:    All the obtained ACC values for different N b s are shown in Figure 5. From Figure 5, we can see that the accuracy (94.95%) achieved using N b = 36 is the highest, and the accuracy (93.31%) achieved using N b = 6 is the lowest. Overall, the accuracy has a positive correlation with N b , which is in line with the supposition that a bigger N b will ensure that Sensors 2021, 21, 764 9 of 21 each input batch can represent the data distribution of the training dataset more precisely. Moreover, the curve converges quickly. The reason for this is that the CE loss provides a base accuracy, and it increases the stability of the curve. Furthermore, our data mining technique ensures the diversity of images in the batch. Therefore, a batch with a small N b has a similar data distribution to that with a big N b . Considering that a bigger N b will lead to insufficient iteration times for network training in one epoch, we will use N b = 36 as the default setting for the rest of the experiments.

Total
1583 4273 1036 6892 All the obtained ACC values for different b N s are shown in Figure 5. From Figure   5, we can see that the accuracy (94.95%) achieved using 36 b N = is the highest, and the accuracy (93.31%) achieved using 6 b N = is the lowest. Overall, the accuracy has a positive correlation with b N , which is in line with the supposition that a bigger b N will ensure that each input batch can represent the data distribution of the training dataset more precisely. Moreover, the curve converges quickly. The reason for this is that the CE loss provides a base accuracy, and it increases the stability of the curve. Furthermore, our data mining technique ensures the diversity of images in the batch. Therefore, a batch with a small b N has a similar data distribution to that with a big b N . Considering that a bigger b N will lead to insufficient iteration times for network training in one epoch, we

Clustering Effect of the BSTriplet Loss
To display the effect of the BSTriplet loss in an intuitive way, principal component analysis (PCA) was applied to reduce the dimension of the embedding vectors for visualization. We have compared two distributions of embedding vectors for the training images in Figure 6, which are generated via two kinds of MobileNet-V3-Small trained with different kinds of loss function for 10 epochs using the batch size 36 b N = . For all the obtained embedding vectors, their dimensions are reduced from 1280 to 2 through PCA. Figure 6a is obtained via the model trained by only CE loss, while Figure 6b involves the BSTriplet loss. Obviously, the BSTriplet loss is able to increase the inter-class distance and decrease the intra-class distance. In other words, it helps to improve the classification ability of MobileNet-V3-Small.

Clustering Effect of the BSTriplet Loss
To display the effect of the BSTriplet loss in an intuitive way, principal component analysis (PCA) was applied to reduce the dimension of the embedding vectors for visualization. We have compared two distributions of embedding vectors for the training images in Figure 6, which are generated via two kinds of MobileNet-V3-Small trained with different kinds of loss function for 10 epochs using the batch size N b = 36. For all the obtained embedding vectors, their dimensions are reduced from 1280 to 2 through PCA. Figure 6a is obtained via the model trained by only CE loss, while Figure 6b involves the BSTriplet loss. Obviously, the BSTriplet loss is able to increase the inter-class distance and decrease the intra-class distance. In other words, it helps to improve the classification ability of MobileNet-V3-Small.

Effect of the Data Mining Strategy
To assess the effect of the data mining strategy, we trained MobileNet-V3-Small wit several schemes. The training schemes utilize random selection (RS) or the proposed dat mining (DM) strategy to construct input batches, and use CE or "CE+BST" as the los function, where the latter means CE combined with the BSTriplet function. The loss curve

Effect of the Data Mining Strategy
To assess the effect of the data mining strategy, we trained MobileNet-V3-Small with several schemes. The training schemes utilize random selection (RS) or the proposed data mining (DM) strategy to construct input batches, and use CE or "CE+BST" as the loss function, where the latter means CE combined with the BSTriplet function. The loss curves of training and validation for each training scheme are given in Figure 7, and the obtained ACC values are shown in Table 3. In Figure 7, "RS" and "DM" refer to the schemes using the random selection and the proposed data mining strategy, respectively. From Figure 7a, we can see that the loss curves of the DM scheme are much smoother than those of the RS scheme. This indicates that the network is easier to train using the date mining strategy. Besides this, for two kinds of loss functions, the proposed strategy can reduce the gap between the training loss and the validation loss, which means that it can alleviate the over-fitting problem.

Effect of the Data Mining Strategy
To assess the effect of the data mining strategy, we trained MobileNet-V3-Small with several schemes. The training schemes utilize random selection (RS) or the proposed data mining (DM) strategy to construct input batches, and use CE or "CE+BST" as the loss function, where the latter means CE combined with the BSTriplet function. The loss curves of training and validation for each training scheme are given in Figure 7, and the obtained ACC values are shown in Table 3. In Figure 7, "RS" and "DM" refer to the schemes using the random selection and the proposed data mining strategy, respectively. From Figure  7a, we can see that the loss curves of the DM scheme are much smoother than those of the RS scheme. This indicates that the network is easier to train using the date mining strategy. Besides this, for two kinds of loss functions, the proposed strategy can reduce the gap between the training loss and the validation loss, which means that it can alleviate the over-fitting problem. From Table 3, it can be seen that the data mining strategy used in the CE-based model can improve the accuracy by 1.21% compared with random selection. For the model based on CE+BST loss, its accuracy is improved by 0.90%. The results show that the proposed data mining strategy is helpful for training CNN models. Here, the strategy is more helpful for the CE-based model than for the CE+BST based one. The reason is that the BSTriplet loss can evaluate the data distribution of the training dataset to a certain extent, while the CE function lacks this ability.  From Table 3, it can be seen that the data mining strategy used in the CE-based model can improve the accuracy by 1.21% compared with random selection. For the model based on CE+BST loss, its accuracy is improved by 0.90%. The results show that the proposed data mining strategy is helpful for training CNN models. Here, the strategy is more helpful for the CE-based model than for the CE+BST based one. The reason is that the BSTriplet loss can evaluate the data distribution of the training dataset to a certain extent, while the CE function lacks this ability.

Applicability of the BSTriplet Loss
Moreover, we have compared the performances of several mentioned models with the BSTriplet loss to discover its applicability and effect. For each compared network, we have trained it with the CE loss and CE+BST loss. The training loss and validation loss are shown in Figure 8. By comparing the four networks trained with CE loss, we can see that the validation loss of EfficientNet-B1 and MobileNet-V3-Small goes up when the time of iteration gets bigger. This observation means that there is a more severe overfitting problem in their training process compared with the others. When these models are trained by CE+BST loss, the over-fitting problem is suppressed. Moreover, the gap between training loss and validation loss demonstrates that there is a smaller gap for CE+BST loss than for CE loss in such models as EfficientNet-B1, MobileNet-V3-Small and ShuffleNet-V2.
As for PeleeNet, it has a relatively small over-fitting problem, which benefits from the fact that it has the lowest number of weights. Overall, the BSTriplet loss can suppress the over-fitting problem, which demonstrates that the BSTriplet loss can play the role of a regularization term for the CE loss.
have trained it with the CE loss and CE+BST loss. The training loss and validation loss are shown in Figure 8. By comparing the four networks trained with CE loss, we can see that the validation loss of EfficientNet-B1 and MobileNet-V3-Small goes up when the time of iteration gets bigger. This observation means that there is a more severe over-fitting problem in their training process compared with the others. When these models are trained by CE+BST loss, the over-fitting problem is suppressed. Moreover, the gap between training loss and validation loss demonstrates that there is a smaller gap for CE+BST loss than for CE loss in such models as EfficientNet-B1, MobileNet-V3-Small and ShuffleNet-V2. As for PeleeNet, it has a relatively small over-fitting problem, which benefits from the fact that it has the lowest number of weights. Overall, the BSTriplet loss can suppress the over-fitting problem, which demonstrates that the BSTriplet loss can play the role of a regularization term for the CE loss.  To verify the effectiveness of the proposed combined loss, we have compared it with CE combined with triple loss [45] and CE combined with the improved lifted structure loss [46]. We test all these loss functions on four compared networks on the Chest-1 dataset. Since Chest-1 provides a multi-class classification task, the average sensitivity To verify the effectiveness of the proposed combined loss, we have compared it with CE combined with triple loss [45] and CE combined with the improved lifted structure loss [46]. We test all these loss functions on four compared networks on the Chest-1 dataset. Since Chest-1 provides a multi-class classification task, the average sensitivity SEN and specificity SPE are used as metrics for evaluating the performance of the networks. SEN and SPE are calculated as where SEN i and SPE i respectively represent the sensitivity and specificity of the i-th class, which shows the ability of the classifier to correctly find real positive cases and negative ones for the target disease; FP i and FN i denote the number of false positive and false negative cases for the i-th class, respectively. Besides this, the ACC and the area under the curve (AUC) of the receiver operating characteristic (ROC) are also employed to assess the classification ability of the trained models. The results are listed in Table 4, where "Triplet", "LS" and "BST" represent the regular triplet loss, the improved lifted structure loss and the proposed BSTriplet loss, respectively. For each evaluated model with different losses, the best value for every metric is indicated with bold in Table 4. Clearly, ShuffleNet-V2 provides the highest accuracy among all the compared networks for each of the four different losses, which demonstrates the superiority of its structure. For each network alone, both the LS loss and the BSTriplet loss can improve the accuracy compared with the CE loss. By comparison, the regular triplet loss causes ACC to decrease from 92.70% to 92.24% for MobileNet-V3-Small, and ACC to decrease from 92.93% to 90.06% for PeleeNet. The reason is that the regular triplet loss identifies each input sample only according to one positive pair and one negative pair, which easily leads to an unstable training process and could make it difficult to search for the optimal solution. Furthermore, among all the compared losses, the CE+BST loss guides such models as EfficientNet-B1, MobileNet-V3-Small and ShuffleNet-V2 to gain the highest accuracy, sensitivity, specificity and AUC values. As for PeleeNet, the proposed CE+BST loss achieves the second highest AUC value (0.9841) and specificity (96.21%). Overall, the BSTriplet loss is helpful for training light-weighted CNN models when employed and combined with CE loss, and it is more effective than the regular triplet loss and lifted structure loss. To intuitively show the superiority of the proposed BSTriplet loss, the confusion matrixes of the four networks trained with CE loss and CE+BST loss are given in Figure 9. It can be seen that the normal images are relatively more easily misclassified as pneumonia images or COVID-19 images for all the compared models. Therefore, the proposed loss has the lowest average recognition rate (0.905) over all trained models. However, the average recognition rates for pneumonia images and COVID-19 images are 0.949 and 0.944, respectively. In addition, EfficientNet-B1, ShuffleNet-V2 and PeleeNet achieve higher recognition rates for every class when trained with CE+BST loss compared to when trained with CE loss. As for the MobileNet-V3-Small model, the proposed CE+BST loss causes the recognition rate for normal images to decrease from 0.928 to 0.909, but it has a much better ability to distinguish the images of the rest of the categories. Moreover, the recognition rate for COVID images is improved the most among the three classes when the BSTriplet loss is used, which indicates that it can resolve the problem of data imbalance. 0.949 and 0.944, respectively. In addition, EfficientNet-B1, ShuffleNet-V2 and PeleeNet achieve higher recognition rates for every class when trained with CE+BST loss compared to when trained with CE loss. As for the MobileNet-V3-Small model, the proposed CE+BST loss causes the recognition rate for normal images to decrease from 0.928 to 0.909, but it has a much better ability to distinguish the images of the rest of the categories. Moreover, the recognition rate for COVID images is improved the most among the three classes when the BSTriplet loss is used, which indicates that it can resolve the problem of data imbalance. Figure 9. Confusion matrixes of the compared networks trained with CE loss or CE+BST loss on the Chest-1 dataset. "N", "P" and "C" denote normal images, pneumonia images and COVID-19 images, respectively.
To further verify the consistency of the proposed BSTriplet loss, we have carried out comparative experiments on another dataset of chest X-ray images denoted as "Chest-2". This dataset is also downloaded from the Kaggle [43] website. Here, Chest-2 is used for the classification of lung images into normal lung images and pneumonia images. Some examples are shown in Figure 10. The construction of Chest-2 is listed in Table 5. Each of the mentioned four light-weighted networks is trained with four kinds of losses on Chest-2. Similar to the experiments on Chest-1, all these images in Chest-2 are resized into 128 128 × . We have performed the K-means algorithm to partition the training images into six groups, and randomly selected six samples from each group to build the input batch for the proposed CE+BST loss.
The results for the classification of Chest-2 are shown in Table 6. From Table 6, we can see that ShuffleNet-V2 still has the best performance because it can achieve the highest average SEN (97.14%), average SPE (80.56%), average ACC (90.91%) and average AUC (0.9569) over four kinds of losses. Besides this, the CE+Triplet loss helps ShuffleNet-V2 to provide the highest ACC (92.47%), which is 3.21% higher than that of ShuffleNet-V2 trained with CE loss. Nevertheless, CE+Triplet loss achieves lower ACC and AUC compared with CE loss when they are applied to PeleeNet, which reveals that the Figure 9. Confusion matrixes of the compared networks trained with CE loss or CE+BST loss on the Chest-1 dataset. "N", "P" and "C" denote normal images, pneumonia images and COVID-19 images, respectively.
To further verify the consistency of the proposed BSTriplet loss, we have carried out comparative experiments on another dataset of chest X-ray images denoted as "Chest-2". This dataset is also downloaded from the Kaggle [43] website. Here, Chest-2 is used for the classification of lung images into normal lung images and pneumonia images. Some examples are shown in Figure 10. The construction of Chest-2 is listed in Table 5. Each of the mentioned four light-weighted networks is trained with four kinds of losses on Chest-2. Similar to the experiments on Chest-1, all these images in Chest-2 are resized into 128 × 128. We have performed the K-means algorithm to partition the training images into six groups, and randomly selected six samples from each group to build the input batch for the proposed CE+BST loss. consistency of the effectiveness of CE+Triplet loss cannot be guaranteed for various networks. In comparison, both CE+LS and our CE+BST loss have better consistency, which benefits from the analysis for all the pairs formed by the samples in the batch. Furthermore, the proposed CE+BST loss surpasses the CE+LS loss in terms of average values of SEN , SPE , ACC and AUC over four evaluated networks by 0.26%, 7.27%, 2.89% and 0.012%, respectively. The reason why BSTriplet outperforms LS is that the former adopts similarity instead of Euclidean distance, so that it has a clear upper bound, and the value of BSTriplet loss is close to CE loss, so that it can achieve better coordination than LS loss. We have also provided the ROC curves of all compared methods in Figure  11. When the ROC curve is closer to the upper left corner, it means the corresponding classifier has a better classification ability. For ShuffleNet-V2, all the three combined losses show similar performances, and they all surpass the performance of CE loss. As for the rest of the networks, our CE+BST loss can provide the most significant improvement in classification performance for each network, especially when it is used in PeleeNet. In general, the proposed BSTriplet loss is more suitable for assisting CE loss in CNN training, and it outperforms the other compared DML losses.     The results for the classification of Chest-2 are shown in Table 6. From Table 6, we can see that ShuffleNet-V2 still has the best performance because it can achieve the highest average SEN (97.14%), average SPE (80.56%), average ACC (90.91%) and average AUC (0.9569) over four kinds of losses. Besides this, the CE+Triplet loss helps ShuffleNet-V2 to provide the highest ACC (92.47%), which is 3.21% higher than that of ShuffleNet-V2 trained with CE loss. Nevertheless, CE+Triplet loss achieves lower ACC and AUC compared with CE loss when they are applied to PeleeNet, which reveals that the consistency of the effectiveness of CE+Triplet loss cannot be guaranteed for various networks. In comparison, both CE+LS and our CE+BST loss have better consistency, which benefits from the analysis for all the pairs formed by the samples in the batch. Furthermore, the proposed CE+BST loss surpasses the CE+LS loss in terms of average values of SEN, SPE, ACC and AUC over four evaluated networks by 0.26%, 7.27%, 2.89% and 0.012%, respectively. The reason why BSTriplet outperforms LS is that the former adopts similarity instead of Euclidean distance, so that it has a clear upper bound, and the value of BSTriplet loss is close to CE loss, so that it can achieve better coordination than LS loss. We have also provided the ROC curves of all compared methods in Figure 11. When the ROC curve is closer to the upper left corner, it means the corresponding classifier has a better classification ability. For ShuffleNet-V2, all the three combined losses show similar performances, and they all surpass the performance of CE loss. As for the rest of the networks, our CE+BST loss can provide the most significant improvement in classification performance for each network, especially when it is used in PeleeNet. In general, the proposed BSTriplet loss is more suitable for assisting CE loss in CNN training, and it outperforms the other compared DML losses. To further validate the effectiveness of the proposed BSTriplet loss, some experiments have been conducted on a skin rash image dataset, which is used to distinguish the images containing Lyme disease from those with other disease [47]. The composition of the rash image dataset is listed in Table 7. Figure 12 shows some images in the rash image dataset. It can be seen that the images in this dataset are colorful optical images, which are much different from the above X-ray images. Besides this, the number of images in the rash image dataset is evidently less than that in the above datasets. To alleviate the over-fitting problem, we have augmented the training images by such methods as rotation by 90 • , 180  To further validate the effectiveness of the proposed BSTriplet loss, some experiments have been conducted on a skin rash image dataset, which is used to distinguish the images containing Lyme disease from those with other disease [47]. The composition of the rash image dataset is listed in Table 7. Figure 12 shows some images in the rash image dataset. It can be seen that the images in this dataset are colorful optical images, which are much different from the above X-ray images. Besides this, the number of images in the rash image dataset is evidently less than that in the above datasets. To alleviate the over-fitting problem, we have augmented the training images by such methods as rotation by 90°, 180° and 270°, horizontal/vertical flipping, and horizontal/vertical translation for five pixels so that the number of training images is enlarged by seven-fold. As for the test images, the augmentation has not been implemented. We have clustered the images into four categories, and built the input batches for the proposed CE+BST loss according to the steps in Section 2.3.    The results for each evaluated model are given in Table 8. It can be seen that almost The results for each evaluated model are given in Table 8. It can be seen that almost all the metrics are lower than those in the above experiments. The reason for this is that the rash images are full of varied backgrounds and the training images are insufficient.
Nevertheless, when the loss involves the DML methods, the CNNs can gain higher accuracy compared with themselves trained with CE loss, except that CE+LS loss provides the same ACC (75.86%) for MobileNet-V3-Small. MobileNet-V3-Small provides the highest ACC (83.91%) and AUC (0.8755) when trained with the proposed CE+BST loss, and it can produce the second best ACC (82.76%) and AUC (0.8423) when it is trained with CE+Triplet loss. Both ShuffleNet-V2 and PeleeNet gain the second highest ACC (80.46%) when they are trained with our CE+BST loss. A comparison among the four results obtained by PeleeNet shows that CE+LS loss gives the highest SEN but the lowest SPE, while CE+Triplet loss gives the opposite results. In comparison, our CE+BST loss can provide both the second highest SEN (77.78%) and SPE (82.35%), as well as the highest ACC (80.46%), for PeleeNet. Besides this, if the results for the CE loss are used as the baseline for each model, the average improvement for each combined loss in four models can be calculated. The proposed CE+BST loss gains an advantage over CE+Triplet loss and CE+LS loss by providing the highest average improvements in SEN, SPE, ACC and AUC, by 9.03%, 5.39%, 6.90% and 0.1063, respectively. The ROC curves for every test model on the rash image dataset are shown in Figure 13. It is clear that these curves are not as smooth as those in Figure 11 due to the small number of test images. Almost all the combined losses surpass CE loss for each compared CNN, especially for MobileNet-V3-Small. Only CE+Triplet is inferior to CE loss when applied to PeleeNet, which reveals its instability again. In addition, the curves of the proposed CE+BST loss are much closer to the top left corner in each subfigure than the rest, which demonstrates its effectiveness for a small dataset and adaptability to different networks.
To further verify the applicability of the BSTriplet loss to other medical image modalities, additional experiments have been done on an osteosarcoma histology image dataset [48][49][50], which can be accessed from the cancer imaging archive (TCIA) [51]. There are three kinds of images in the osteosarcoma histology image dataset, as shown in Figure 14. Compared with the above involved images, the histology images are colored images, and their backgrounds are simpler than those of the skin rash images. This dataset is a small sample dataset and its construction is listed in Table 9. The augmentation for this dataset is the same as that for the rash images. We have clustered each kind of images into two groups for CE+BST loss according to the proposed data mining strategy. Using the accuracy ACC, average sensitivity SEN, average specificity SPE and AUC as metrics, we have tested MobileNet-V3-Small trained with several different loss functions. The results are provided in Table 10 and Figure 15.    To further verify the applicability of the BSTriplet loss to other medical image modalities, additional experiments have been done on an osteosarcoma histology image dataset [48][49][50], which can be accessed from the cancer imaging archive (TCIA) [51]. There are three kinds of images in the osteosarcoma histology image dataset, as shown in Figure  14. Compared with the above involved images, the histology images are colored images, and their backgrounds are simpler than those of the skin rash images. This dataset is a small sample dataset and its construction is listed in Table 9. The augmentation for this dataset is the same as that for the rash images. We have clustered each kind of images into two groups for CE+BST loss according to the proposed data mining strategy. Using the accuracy ACC , average sensitivity SEN , average specificity SPE and AUC as metrics, we have tested MobileNet-V3-Small trained with several different loss functions. The results are provided in Table 10 and Figure 15.   Necrotic Tumor  Viable Tumor  Total  Training  429  210  276  915  Testing  107  53  69  229  Total  536  263 345 1144    Table 10 shows that CE+BST loss-based MobileNet-V3-Small provides the best performance in terms of all the metrics. In particular, the accuracy provided by CE+BST loss is 2.18% higher than the second best ACC, which is produced by CE+LS loss. Besides this, the regular triplet loss and the LS loss help the CE to obtain better SEN, SPE and ACC values, although their AUC values are smaller than those of CE loss. On the other hand, BSTriplet loss obtains the best AUC value in Table 10 and ROC curve in Figure 15. The results demonstrate that, compared with the LS loss and the regular triplet loss, our proposed BSTriplet loss has a better ability to assist CE loss in improving the performance of the CNNs on small sample datasets.    Necrotic Tumor  Viable Tumor  Total  Training  429  210  276  915  Testing  107  53  69  229  Total  536  263 345 1144

Conclusions
In this paper, a novel batch similarity-based triple loss is proposed for light-weighted CNNs in the case of medical image classification. The proposed loss takes the similarities among all the samples in the input batch into account in order to gather samples of the same class and distinguish those of different classes. It can be easily assembled into existing CNNs, and assist cross entropy loss in training the CNNs by resolving the over-fitting problem. A reasonable data mining technique is also provided, which can help to build input batches according to the distribution of the training data. Several experiments have been implemented on such medical image datasets as chest X-ray images and rash images. The results show the superiority and consistency of the proposed loss combined with cross entropy loss compared to other combined losses in terms of sensitivity, specificity, accuracy and AUC. Our further work will be focused on the optimization of the computational efficiency of the proposed loss, its combination with more loss, and testing on other image datasets to ensure that the training process of CNNs will be more stable and the over-fitting problem can be addressed more effectively.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.