An Asymmetric Contrastive Loss for Handling Imbalanced Datasets

Contrastive learning is a representation learning method performed by contrasting a sample to other similar samples so that they are brought closely together, forming clusters in the feature space. The learning process is typically conducted using a two-stage training architecture, and it utilizes the contrastive loss (CL) for its feature learning. Contrastive learning has been shown to be quite successful in handling imbalanced datasets, in which some classes are overrepresented while some others are underrepresented. However, previous studies have not specifically modified CL for imbalanced datasets. In this work, we introduce an asymmetric version of CL, referred to as ACL, in order to directly address the problem of class imbalance. In addition, we propose the asymmetric focal contrastive loss (AFCL) as a further generalization of both ACL and focal contrastive loss (FCL). The results on the imbalanced FMNIST and ISIC 2018 datasets show that the AFCL is capable of outperforming the CL and FCL in terms of both weighted and unweighted classification accuracies.


Introduction
Class imbalance is a major obstacle occurring within a dataset when certain classes in the dataset are overrepresented (referred to as majority classes), while some are underrepresented (referred to as minority classes). This can be problematic for a large number of classification models. A deep learning model such as a convolutional neural network (CNN) might not be able to properly learn from the minority classes. Consequently, the model would be less likely to correctly identify minority samples as they occur. This is especially crucial in medical imaging, since a model that cannot identify rare diseases would not be effective for diagnostic purposes. For example, the ISIC 2018 dataset [1,2] is an imbalanced medical dataset that consists of images of skin lesions that appear in various frequencies during screening.
To produce a less imbalanced dataset, it is possible to resample the dataset by either increasing the number of minority samples [3][4][5][6] or decreasing the number of majority samples [7][8][9][10]. Other methods for handling class imbalance include substituting the standard cross-entropy (CE) loss for a more suitable loss, such as the focal loss (FL). Lin et al. [11] modified the CE loss into FL so that minority classes can be prioritized. This is done by ensuring that the model focuses on samples that are harder to classify during model training. Recent studies also unveiled the potential of contrastive learning as a way to combat imbalanced datasets [12][13][14][15].
Contrastive learning is performed by contrasting a sample (called an anchor) to other similar samples (called positive samples) so that they are mapped closely together in the feature space. As a consequence, dissimilar samples (called negative samples) are pushed away from the anchor, forming clusters in the feature space based on similarity. In this research, contrastive learning is done using a two-stage training architecture, which utilizes the contrastive loss (CL) formulated by Khosla et al. [16]. This formulation of CL is supervised, and it can contrast the anchor to multiple positive samples belonging to the same class. This is unlike self-supervised contrastive learning [17][18][19][20], which contrasts the anchor to only one positive sample in the mini-batch.
In this work, we propose a modification of supervised CL that is referred to as the asymmetric contrastive loss (ACL). Unlike CL, the ACL is able to directly contrast the anchor to its negative samples so that they are pushed apart in the feature space. This becomes important when a rare sample has no other positive samples in the mini-batch. To our knowledge, we are the first to modify the supervised version of CL in order to address class imbalance, effectively augmenting several studies performed previously in [12,13]. The proposed ACL is aimed toward improving the effectiveness of the twostage architecture originally presented in [12,13], especially in the feature learning aspect. In addition, the ACL is designed as a generalization of CL, and thus, it provides more flexibility and tuning opportunities as a loss function.
We also consider the asymmetric variant of the focal contrastive loss (FCL) [21], which is called the asymmetric focal contrastive loss (AFCL). Using FMNIST and ISIC 2018 as datasets, experiments were performed to test the performance of both the ACL and AFCL in binary classification tasks. It was observed that the AFCL was superior to the CL and FCL in multiple class imbalance scenarios, provided that suitable hyperparameters were used. In addition, this work provides a streamlined survey of the literature related to entropy and loss functions.

Related Work
Several studies have been conducted in recent years on the application of contrastive losses to imbalanced datasets. On Siamese networks, for example, Wang et al. [14] and Alenezi et al. [15] proposed the novel focal CL and W-shaped CL, respectively. Their methods managed to achieve state-of-the-art performance in handling the class imbalance problem, wherein Wang et al. used satellite images and Alenezi et al. used skin lesion images as datasets. Their CL functions had a different form from that of the supervised CL of Khosla et al. [16], which is the CL that upon which our study is based.
Marrakchi et al. [12] and Chen et al. [13] independently adopted supervised CL to combat class imbalance in the medical domain. They both used a two-stage architecture consisting of (1) feature learning using CL, followed by (2) fine-tuning using classification loss. Their architectures were almost identical; they differed only in the type of loss function during fine-tuning (Marrakchi et al. used cross-entropy loss, while Chen et al. used focal loss). One limitation present in these studies was that CL was not modified further to deal with imbalance and was implemented as is. Therefore, our aim is to generalize CL in order to effectively learn from imbalanced datasets using the aforementioned twostage architecture.
In this paper, we present a novel CL referred to as the ACL, and we include its focal-based variant, AFCL. Our motivation for introducing the losses comes from both the asymmetric loss due to Ben-Baruch et al. [22] and the focal contrastive loss due to Zhang et al. [21], whose explanations are provided in Section 3. Although these losses were proposed for different applications (fine-tuning and multi-label classification, respectively), it turns out that these ideas can be applied to our goal of modifying CL so as to handle imbalance.

Background on Entropy and Loss Functions
In this section, we provide a literature review on the basics of information theory and loss functions for easy reference.

Entropy, Information, and Divergence
Introduced by Shannon [23], entropy provides a measure of the amount of information contained in a random variable, usually in bits. The entropy H(X) of a random variable X is given by the formula Given two random variables X and Y, their joint entropy H(X, Y) is the entropy of the joint random variable (X, Y): In addition, the conditional entropy H(Y | X) is defined as Conditional entropy is used to measure the average amount of information contained in Y when the value of X is given. Conditional entropy is bounded above by the original entropy; that is, H(Y | X) ≤ H(Y), with equality if and only if X and Y are independent [24]. The formulas for entropy, joint entropy, and conditional entropy can be derived via an axiomatic approach [25,26].
The mutual information I(X; Y) is a measure of dependence between random variables X and Y [27]. It provides the amount of information about one random variable provided by the other random variable, and it is defined by Mutual information is symmetric. In other words, I(X; Y) = I(Y; X). Mutual information is also nonnegative (I(X; Y) ≥ 0), and I(X; Y) = 0 if and only if X and Y are independent [24].
The dissimilarity between random variables X and X on the same space X can be measured using the notion of KL-divergence: Similarly to mutual information, KL-divergence is nonnegative (D KL (X X ) ≥ 0), and D KL (X X ) = 0 if and only if X = X [24]. Unlike mutual information, KL-divergence is asymmetric, so D KL (X X ) and D KL (X X) are not necessarily equal.

Cross-Entropy and Focal Loss
Given random variables X andX on the same space X , their cross-entropy H(X;X) is defined as [28]: H(X;X) = E P X − log(PX(X) .
Cross-entropy is the average number of bits needed to encode the true distribution X when its estimateX is provided [29]. A small value of H(X;X) implies thatX is a good estimate for X. Cross-entropy is connected to KL-divergence via the following identity: WhenX = X, the equality H(X;X) = H(X) holds. Now, the cross-entropy loss and focal loss are provided within the context of a binary classification task consisting of two classes labeled 0 and 1. Suppose that y ∈ {0, 1} denotes the ground-truth class and p ∈ [0, 1] denotes the estimated probability for the class labeled 1. The value of 1 − p is then the estimated probability for the class labeled 0. The cross-entropy (CE) loss is given by If y = 1, then the loss L CE is zero when p = 1. On the other hand, if y = 0, then the loss is zero when 1 − p = 1. In either case, the CE loss is minimized when the estimated probability of the true class is maximized, which is the desired property of a good classification model.
The focal loss (FL) [11] is a modification of the CE loss introduced to put more focus on hard-to-classify examples. It is given by the following formula: The parameter γ in L foc is known as the focusing parameter. Choosing a larger value of γ would push the model to focus on training from the misclassified examples. For instance, suppose that γ = 4 and denote the estimated probability of the true class by p t . The graph in Figure 1 shows that when p t > 0.5, the FL is quite small. Hence, the model would be less concerned about learning from an example when p t is already sufficiently large. FL is a useful choice when class imbalance exists, as it can help the model focus on the less represented samples within the dataset.

Asymmetric Loss
For multi-label classification with K labels, let y i ∈ {0, 1} be the ground truth for class i and let p i ∈ [0, 1] be its estimated probability obtained by the model. The aggregate classification loss is then If FL is the chosen type of loss, L + i and L − i are set as follows: In a typical multi-label dataset, the ground truth y i has value 0 for the majority of classes i. Consequently, the negative terms L − i dominate in the calculation of the aggregate loss L. Asymmetric loss (ASL) [22] is a proposed solution to this problem. ASL emphasizes the contribution of the positive terms by modifying the losses of (11) to and where γ + , γ − are hyperparameters and p (m) i is the shifted probability of p i obtained from the probability margin m ≥ 0 via the formula This shift helps decrease the contribution of L − i . Indeed, if we set m = 1, then L − i = 0.

Contrastive Loss
Contrastive learning is a learning method for learning representations from data. A supervised approach of contrastive learning was introduced by Khosla et al. [16] to learn from a set of sample-label pairs The samples x i are fed through a feature encoder Enc(·) and a projection head Proj(·) in succession to obtain features z i = Proj(Enc(x i )). The feature encoder extracts features from x i , whereas the projection head projects the features into a lower dimension and applies 2 -normalization so that z i lies in the unit hypersphere. In other words, z i 2 = 1.
A pair (z i , z j ), where i = j, is referred to as a positive pair if the features share the same class label (y i = y j ), and it is a negative pair if the features have different class labels (y i = y j ). Contrastive learning aims to maximize the similarity between z i and z j whenever they form a positive pair and minimize their similarity whenever they form a negative pair. This similarity is measured with cosine similarity [29]: From the above equation, Fixing z i as the anchor, let A i = {z k | k = i} be the set of features other than z i and let P i = {z k ∈ A i | y k = y i } be the set of z k such that (z i , z k ) is a positive pair. The predicted probability p ij that z i and z j belong to the same class is obtained by applying the softmax function to the the set of similarities between z i and z k ∈ A i : where τ is referred to as the temperature parameter. Since our goal is to maximize p ij whenever z j ∈ P i , the contrastive loss that is to be minimized is formulated as Information-theoretical properties of L con are given in [21], for which we provide a summary. Let X, Y, and Z denote random variables of the samples, labels, and features, respectively. The following theorem states that L con is positively proportional to H(Z | Y) − H(Z) under the assumption that no class imbalance exists.
Theorem 1 (Zhang et al. [21]). Assuming that features are 2 -normalized and the dataset is balanced, Theorem 1 implies that minimizing L con is equivalent to minimizing the conditional entropy H(Z | Y) and maximizing the feature entropy H(Z). Since I(Z; Y) = H(Z) − H(Z | Y), minimizing L con is equivalent to maximizing the mutual information I(Z; Y) between features Z and class labels Y. In other words, contrastive learning aims to extract the maximum amount of information from class labels and encode it in the form of features.
After the features are extracted, a classifier Clas(·) is assigned to convert z i into a predictionŷ i = Clas(z i ) of the class label. The random variable of predicted class labels is denoted byŶ.
For the next theorem, the definition of conditional cross-entropy H(Y;Ŷ | Z) is given as follows: Conditional CE measures the average amount of information needed to encode the true distribution Y using its estimateŶ given the value of Z. A small value of H(Y;Ŷ | Z) implies thatŶ is a good estimate for Y given Z.
Theorem 2 (Zhang et al. [21]). Assuming that features are 2 -normalized and the dataset is balanced, where the infimum is taken over classifiers.
Theorem 2 implies that minimizing L con will minimize the infimum of conditional cross-entropy H(Y;Ŷ | Z) taken over classifiers. As a consequence, contrastive learning is able to encode features in Z such that the best classifier can produce a good estimate of Y given the information provided by the feature encoder.
The formula for L con can be modified so as to resemble the focal loss, resulting in a loss function known as the focal contrastive loss (FCL) [21]:

Proposed Loss Functions and Architecture
In this section, our proposed modification of the contrastive loss, which is called the asymmetric contrastive loss, is introduced. In addition, the architecture of the model in which the contrastive losses are implemented is explained. Our proposed asymmetric loss function is novel, while the architecture is obtained from [12,13] with no changes made. Thus, our contribution lies simply in the change of the loss function.

Asymmetric Contrastive Loss
In (17), the inside summation of the contrastive loss is evaluated over P i . Consequently, according to (16), each anchor z i is contrasted with vectors z j that belong to the same class. This does not present a problem when the mini-batch contains plenty of examples from each class. However, the calculated loss may not give each class a fair contribution when some classes are less represented in the mini-batch.
In Figure 2, a sampled mini-batch consists of 11 examples with a blue-colored class label and one example with a red-colored class label. When the anchor z i is the representation of the red-colored sample, z i does not directly contribute to the calculation of L con , since P i is empty. In other words, z i cannot be contrasted to any other sample in the mini-batch. This scenario is likely to happen when the dataset is imbalanced, and it motivates us to modify CL so that each anchor z i can also be contrasted with z j not belonging to the same class. Let N i = A i \ P i be the set of vectors z k such that (z i , z k ) is a negative pair. Motivated by the L + i and L − i of (10), we define and where The loss function L + i contrasts z i to vectors in P i , whereas L − i contrasts z i to vectors in N i . The resulting asymmetric contrastive loss (ACL) is given by the formula where η ≥ 0 is a fixed hyperparameter. If η = 0, then L AC = L con . Hence, ACL is a generalization of CL. When the batch size is set to a large number (over 100, for example), the value p ij tends to be very small. This causes L − i to be much smaller than L + i . In order to balance their contribution to the total loss L AC , a large value for η is usually chosen (between 60 and 300 in our experiment).
In summary, we propose ACL in order to (1) generalize the CL via the addition of a summation over negative samples and (2) specifically address the problem of class imbalance. ACL is intended to be both more flexible and robust to imbalances than the vanilla CL.

Asymmetric Focal Contrastive Loss
Following the formulation of L FC in (21), L + i can be modified to have the following formula: Using this loss, the asymmetric focal contrastive loss (AFCL) is then given by where L − i = 1 |N i | ∑ z j ∈N i log(1 − p ij ). We do not modify L − i by adding the multiplicative term (p ij ) γ , since p ij is usually too small and would make L − i vanish if the term is added. We have L AFC = L FC when γ = 1. Thus, AFCL generalizes the FCL. Unlike with the FCL, we add the hyperparameter γ ≥ 0 to the loss function so as to provide some flexibility to the loss function.

Model Architecture
This section explains the inner workings of the classification model used for the implementation of the contrastive losses. The architecture of the model is taken from [12,13]. The training strategy for the model, as shown in Figure 3, comprises two stages: the featurelearning stage and the fine-tuning stage. In the first stage, each mini-batch is fed through a feature encoder. We consider either ResNet-18 or ResNet-50 [30] for the architecture of the feature encoder. The output of the feature encoder is projected by the projection head to generate a vector z of length 128. If ResNet-18 is used for the feature encoder, then the projection head consists of two layers of lengths 512 and 128. If ResNet-50 is used, then the two layers are of lengths 2048 and 128. Afterwards, z is 2 -normalized, and the model parameters are updated using some version of the contrastive loss (either CL, FCL, ACL, or AFCL).
After the first stage is complete, the feature encoder is frozen and the projection head is removed. In its place, we have a one-layer classification head that generates the estimated probability that the training sample belongs to a certain class. The parameters of the classification head are updated using either the FL or CE loss. The final classification model is the feature encoder trained during the first stage, together with the classification head trained during the second stage. Since the classification head is a significantly smaller architecture than the feature encoder, the training is mostly focused on the first stage. As a consequence, we typically need a larger number of epochs for the feature-learning stage compared to the fine-tuning stage.

Experiments
The datasets and settings of our experiments are outlined in this section. We provide and discuss the results of the experiments on the FMNIST and ISIC 2018 datasets. The PyTorch implementation is available on GitHub (https://github.com/valentinovito/ Asymmetric-CL, accessed on 8 September 2022).

Datasets
In our experiments, the training strategy outlined in Section 4.3 was applied to two imbalanced datasets. The first was a modified version of the Fashion-MNIST (FMNIST) dataset [31], and the second was the International Skin Imaging Collaboration (ISIC) 2018 medical dataset [1,2].
The FMNIST dataset consisted of low-resolution (28 × 28 pixels) grayscale images of ten classes of clothing. In this study, we took only two classes to form a binary classification task: the T-shirt and shirt classes. The samples were taken such that the proportion between the T-shirt and shirt images could be imbalanced, depending on the scenario. On the other hand, the ISIC 2018 dataset consisted of high-resolution RGB images of seven classes of skin lesions. As with FMNIST, we used only two classes for the experiments: the melanoma and dermatofibroma classes. Illustrations of the sample images of both datasets are provided in Figure 4.
FMNIST was chosen as our dataset, since, although simple, it is a benchmark dataset for testing deep learning models for computer vision. On the other hand, ISIC 2018 was chosen since it is a domain-appropriate imbalanced dataset for our model. We first applied the model (using AFCL as the loss function) to the more lightweight FMNIST dataset under various class imbalance scenarios. This was conducted to check the appropriate values of the η and γ parameters of the AFCL under different imbalance conditions. Afterwards, the model was applied to the ISIC 2018 dataset using the optimal parameter values obtained during the FMNIST experiments.

Experimental Details
The experiments were conducted using the NVIDIA Tesla P100-PCIE GPU allocated by the Google Colaboratory Pro platform. The models and loss functions were implemented using PyTorch. To process the FMNIST dataset, we used the simpler ResNet-18 architecture as the feature encoder and trained it for 20 epochs. On the other hand, to process the ISIC 2018 dataset, we used the deeper ResNet-50 as the feature encoder and trained it for 40 epochs. For both the FMNIST and ISIC 2018 datasets, the learning rate and batch size were set to 10 −2 and 128, respectively. In addition, the classification head was trained for 10 epochs. The encoder and the classification head were both trained using the Adam optimizer. Finally, the temperature parameter τ of the contrastive loss was set to its default value of 0.07.
The evaluation metrics utilized in the experiment were (weighted) accuracy and unweighted accuracy (UWA), both of which could be calculated from the number of true positives (TP), true negatives (TN), false negatives (FN), and false positives (FP) using the formulas Accuracy = TP + TN TP + TN + FN + FP (27) and UWA = 1 2 respectively. Unlike accuracy, the UWA provided the average of the individual class accuracies regardless of the number of samples in the test set of each class. UWA is an appropriate metric when a dataset is significantly imbalanced [32]. For heavily imbalanced datasets, a high accuracy and low UWA may mean that the model is biased towards classifying samples as part of the majority class. This indicates that the model does not properly learn from the minority samples. In contrast, a lower accuracy with a high UWA indicates that the model takes significant risks to classify some samples as part of the minority class. Our aim was to construct a model that maximized both metrics simultaneously; that is, a model that could learn unbiasedly from both the majority and minority samples with minimal misclassification error.

Experiments Using FMNIST
The data used in the FMNIST experiment comprised 1000 images classified as either a T-shirt or a shirt. The dataset was split 70/30 for model training and testing. The images were augmented using random rotations and random flips. We deployed 11 class imbalance scenarios on the dataset, which controlled the proportion between the T-shirt class and the shirt class. For example, if the proportion was 60:40, then 600 T-shirt images and 400 shirt images were sampled to form the experimental dataset. Our proportions ranged from 50:50 to 98:2.
During the first stage, the ResNet-18 encoder was trained using the AFCL. Afterwards, the classification head was trained using the CE loss during the second stage. As AFCL contains two parameters, η and γ, our goal was to tune each of these parameters independently, keeping the other parameter fixed. First, η was tuned as we set γ = 0, followed by the tuning of γ as we set η = 0. Each experiment was performed four times in total. The average accuracy and UWA of these four runs are provided in Table 1 (for the tuning of η) and Table 2 (for the tuning of γ). For the tuning of η, six values of η were experimented on: η ∈ {0, 60, 120, 180, 240, 300}. When η = 0, the loss function was reduced to the ordinary CL. As observed in Table 1, the optimal value of η tended to be larger when the dataset was moderately imbalanced. As the scenario went from 60:40 to 90:10, the parameter η that maximized accuracy increased in value, from η = 0 when the proportion was 60:40 to η = 300 when the proportion was 90:10. In general, this indicated that the L − i term of the ACL became more essential to the overall loss as the dataset got more imbalanced, confirming the reasoning contained in Section 4.1.
As seen in Table 2, we experimented on γ ∈ {0, 1, 2, 4, 7, 10}, where choosing γ = 0 meant that we were using the CL. Although the overall pattern of the optimal γ was less apparent than η of the previous experiment, some insights could still be obtained. When the scenario was between 70:30 and 90:10, the focusing parameter γ was optimally chosen when it was larger than zero. This was in direct contrast to when the proportion was perfectly balanced (50:50), where γ = 0 was the most optimal parameter. This suggests that a larger value of γ should be considered when class imbalance is significantly present within a dataset.
When the dataset was balanced, however, our experiments suggested that neither asymmetry nor focality was markedly helpful. Indeed, in the 50:50 scenario, CL already provided the second-best accuracy in Table 1 and the best accuracy in Table 2. In Table 1, the CL was the case where η = 0 was chosen. In Table 2, on the other hand, the CL was used when γ = 0. Therefore, our proposed loss function works best with imbalanced datasets.

Experiments Using ISIC 2018
From the ISIC 2018 dataset, a total of 1113 melanoma images and 115 dermatofibroma images were combined to create the experimental dataset. As with the previous experiment, the dataset was split 70/30 for training and testing. The images were resized to 128 × 128 pixels. The ResNet-50 encoder was trained using one of the available contrastive losses, which included the CL/FCL as baselines and the ACL/AFCL as the proposed loss functions. The classification head was trained using FL as the loss function, with its focusing parameter set to γ = 2.
The proportion between the melanoma class and the dermatofibroma class in the experimental dataset was close to 90:10. Using the results from Tables 1 and 2 as a heuristic for determining the optimal parameter values, we set η = 300 and γ = 2, 7. It is worth mentioning that even though γ = 2 produced the best accuracy in the FMNIST experiment, the UWA of the resulting model was quite poor. However, we decided to include this value in this experiment for completeness.
The results of this experiment are given in Table 3. As in the previous section, each experiment was conducted four times, so the table lists the average accuracy and UWA of these four runs for each contrastive loss tested. Each run, which included both model training and testing, was completed in roughly 80 min using our computational setup.
From Table 3, CL and ACL performed the worst in terms of UWA and accuracy, respectively. However, ACL gave the best UWA among all losses. This may indicate that the ACL encouraged the model to take the risky approach of classifying some samples as part of the minority class at the expense of accuracy. Overall, AFCL with η = 300 and γ = 7 emerged as the best loss in this experiment, producing the best accuracy and the second-best UWA behind the ACL. This led us to conclude that the AFCL, with optimal hyperparameters chosen, is superior to the vanilla CL and FCL.

Conclusions and Future Work
In this work, we introduced an asymmetric version of both contrastive loss (CL) and focal contrastive loss (FCL), which are referred to as ACL and AFCL, respectively. These asymmetric variants of the contrastive loss were proposed to provide more focus on the minority class. The experimental model used was a two-stage architecture consisting of a feature-learning stage and a classifier fine-tuning stage. This model was applied to the imbalanced FMNIST and ISIC 2018 datasets using various contrastive losses. Our results show that the AFCL was able to outperform the CL and FCL in terms of both weighted and unweighted accuracies. On the ISIC 2018 binary classification task, AFCL, with η = 300 and γ = 7 as hyperparameters, achieved an accuracy of 93.75% and an unweighted accuracy of 74.62%. This is in contrast to the FCL, which achieved 93.07% and 74.34% on both metrics, respectively.
The experiments in this research were conducted using datasets consisting of approximately 1000 images in total. In the future, the experimental model may be applied to larger-scale datasets in order to test its scalability. In addition, other models based on the ACL and AFCL can also be developed for specific datasets, ideally within the realm of multi-class classification.