Rebalancing in Supervised Contrastive Learning for Long-Tailed Visual Recognition

Lv, Jiahui; Lei, Jun; Zhang, Jun; Chen, Chao; Li, Shuohao

doi:10.3390/bdcc9080204

Open AccessArticle

Rebalancing in Supervised Contrastive Learning for Long-Tailed Visual Recognition

by

Jiahui Lv

,

Jun Lei

,

Jun Zhang

,

Chao Chen

and

Shuohao Li

^*

Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(8), 204; https://doi.org/10.3390/bdcc9080204

Submission received: 22 June 2025 / Revised: 5 August 2025 / Accepted: 5 August 2025 / Published: 11 August 2025

Download

Browse Figures

Versions Notes

Abstract

In real-world visual recognition tasks, long-tailed distribution is a pervasive challenge, where the extreme class imbalance severely limits the representation learning capability of deep models. Although supervised learning has demonstrated certain potential in long-tailed visual recognition, these models’ gradient updates dominated by head classes often lead to insufficient representation of tail classes, resulting in ambiguous decision boundaries. While existing Supervised Contrastive Learning variants mitigate class bias through instance-level similarity comparison, they are still limited by biased negative sample selection and insufficient modeling of the feature space structure. To address this, we propose Rebalancing Supervised Contrastive Learning (Reb-SupCon), which constructs a balanced and discriminative feature space during model training to alleviate performance deviation. Our method consists of two key components: (1) a dynamic rebalancing factor that automatically adjusts sample contributions through differentiable weighting, thereby establishing class-balanced feature representations; (2) a prototype-aware enhancement module that further improves feature discriminability by explicitly constraining the geometric structure of the feature space through introduced feature prototypes, enabling locally discriminative feature reconstruction. This breaks through the limitations of conventional instance contrastive learning and helps the model to identify more reasonable decision boundaries. Experimental results show that this method demonstrates superior performance on mainstream long-tailed benchmark datasets, with ablation studies and feature visualizations validating the modules’ synergistic effects.

Keywords:

long-tailed visual recognition; contrastive learning; representation learning; feature balancing

1. Introduction

The explosive development of deep learning has enabled Convolutional Neural Networks (CNNs) to achieve groundbreaking advances across multiple computer vision domains. Notably, CNNs have demonstrated superhuman-level performance in fundamental vision tasks including image classification [1], object detection [2], and semantic segmentation [3]. Progress in neural architecture search (NAS) techniques [4,5] has further enhanced CNN capabilities by automating architectural optimization, particularly improving their effectiveness on large-scale datasets.

These achievements largely rely on the powerful computational resources and high-quality large-scale datasets enabled by the Internet era, such as ImageNet [6] and MS COCO [7]. These datasets are carefully designed to ensure sufficient and balanced training samples for each category. However, real-world data distributions often deviate from this ideal scenario and instead exhibit long-tailed characteristics [8], as shown in Figure 1a. In a long-tailed distribution, a small number of categories dominate the majority of the data, while a large number of tail classes contain only limited samples. This imbalanced distribution pattern is prevalent in many practical applications, such as rare object recognition in image classification and niche topic detection in text classification.

In long-tailed recognition tasks, the imbalanced data distribution substantially influences both the model training process and final performance. This imbalance results in strong recognition performance for head classes, while significantly degrading learning efficacy for tail classes due to their scarcity of samples, ultimately degrading overall classification accuracy [9,10,11,12]. Our in-depth analysis identifies the root cause as a systematic bias in classifier weight norms [13], which critically influences the model’s feature response capability. Specifically, larger weight norms maintain strong sensitivity to head class features, whereas smaller norms result in inadequate responses to tail class features, leading to misclassification. As clearly illustrated by the blue lines in Figure 2, our visualization analysis of classifier weight norms in the baseline Supervised Contrastive Learning (SCL) [14] method demonstrates that tail classes exhibit significantly smaller weight norms compared to head classes. This observation provides crucial evidence from a parameter space perspective for understanding the performance bottleneck in long-tailed recognition.

To address this challenge, conventional approaches primarily include sampling methods [15,16], cost-sensitive learning techniques [9,12,17], and classifier optimization strategies [18,19,20]. These methodologies aim to mitigate classification bias toward head classes while improving recognition performance for tail classes. However, these solutions typically operate under the assumption of well-separated inter-class differences, focusing solely on optimizing sample distribution while paying insufficient attention to the latent representations learned from imbalanced data. Crucially, they fail to adequately account for inter-class feature correlations and intrinsic relationships. Moreover, their excessive emphasis on tail classes often comes at the expense of head class performance, resulting in an overall imbalance in classification effectiveness.

The emergence of self-supervised learning [21,22] has inspired new approaches to learning without label dependency, offering novel perspectives for addressing long-tailed problems. As a fundamental paradigm in self-supervised learning, contrastive learning [23,24] effectively models instance-level similarity relationships, enabling the extraction of highly discriminative feature representations without requiring labeled supervision. Khosla et al. [14] generalized the conventional self-supervised contrastive loss to its supervised counterpart (SCL) by integrating label information, yielding significant performance improvements. While SCL has achieved notable success on balanced datasets [25], its effectiveness on imbalanced distributions remains limited. The standard contrastive learning framework, which critically depends on the selection of positive–negative sample pairs, is inherently biased toward majority classes [13,26], consequently amplifying the challenges induced by data imbalance.

Based on the above analysis, this paper proposes a Rebalancing Supervised Contrastive Learning (Reb-SupCon) method for long-tailed visual recognition. The approach mitigates the learning difficulties associated with long-tailed data by breaking the dominance of head classes in model optimization within the feature space and establishing balanced category representations. The main contributions are as follows:

We propose an adaptive focal gradient-weighted rebalancing factor that dynamically integrates class frequency statistics and gradient magnitude information. It automatically adjusts the importance weight of each sample during training, simultaneously enhancing gradient focus on tail classes while preventing representation degradation of head classes, thereby constructing a balanced feature space.
Leveraging the balanced feature representations enabled by the rebalancing factor, we introduce a prototype-aware discriminative enhancement module. This module constrains the geometric structure of prototypes through an aggregation–separation loss and incorporates swapped prediction to achieve dual-branch supervision alignment between features and prototypes. This approach reconstructs the feature space from global balance to local discriminability, alleviating the limitation of traditional contrastive learning that relies solely on inter-sample similarity.
Extensive experiments on multiple benchmark datasets demonstrate that the proposed method achieves state-of-the-art performance.

2. Related Work

2.1. Long-Tailed Visual Recognition

Early approaches to addressing the long-tailed problem have mainly revolved around two core strategies: re-sampling and re-weighting. Re-sampling [27,28] either oversamples low-frequency classes or undersamples high-frequency classes. Re-weighting [12,17,29] assigns different loss values to different training samples based on their class or instance characteristics. Furthermore, module enhancement methods, such as BBN [30], introduce dual-branch networks combining conventional learning and rebalancing to address long-tailed recognition. Similarly, Hybrid-SC and Hybrid-PSC [31] adopt this dual-branch architecture to decouple feature learning from classification tasks, thereby mitigating inter-task interference. Notably, Hybrid-PSC further introduces a class prototype contrastive mechanism, achieving computational efficiency without compromising discriminative power on tail classes. Metric learning has also been widely applied in long-tailed recognition. Its core objective is to design task-specific distance metrics to better capture similarities and differences between objects, thereby promoting the learning of more discriminative feature spaces. For instance, KCL [32] proposes a k-positive contrastive loss to alleviate class imbalance by learning a balanced feature space, improving model generalization. In decoupled training approaches, He et al. [33] proposed an adaptive calibration method that operates through two key phases. During the first-phase feature space decoupling, it constructs a uniformly distributed hyperspherical prototype space. For the second-phase adaptive fine-tuning, the method develops a dynamic temperature scaling algorithm based on prototype distance metrics derived from the first phase, enabling automatic calibration of decision boundaries according to each class’s specific characteristics. Recent Vision Transformer (ViT)-based approaches have demonstrated unique advantages in long-tailed recognition. Lin et al. [34] proposed rebalanced modal gradient modulation (ReGrad) for the modal imbalance problem in multimodal face anti-counterfeiting (FAS). By adaptively adjusting the gradient amplitudes of each mode, the convergence speed is balanced. Through interpretable reliability modeling and gradient control mechanisms, this provides new possibilities for multimodal long-tailed learning.

2.2. Contrastive Learning

Self-supervised Contrastive Learning (SSCL) and SCL, as two major branches of contrastive learning, learn effective feature representations through contrastive loss functions in scenarios with unlabeled and labeled data, respectively. In recent years, SSCL has achieved remarkable progress. SimCLR [21] pioneered the training paradigm by optimizing feature representations through positive and negative sample pairs, demonstrating that contrastive learning can achieve performance comparable to supervised learning even without labels. Subsequently, BYOL [35] further refined the contrastive learning framework by proposing a method that relies solely on positive sample pairs, eliminating the challenges associated with negative sample selection. Other related models, such as MoCo [23,24] and SwAV [26], further improved performance through momentum encoding and online clustering, respectively. Collectively, these advancements have driven unsupervised representation learning to approach the performance of supervised learning.

In contrast to SSCL, SCL relies on labeled data to pull samples of the same class closer while pushing apart those from different classes [14]. In long-tailed recognition tasks, numerous variants have been proposed. KCL [32] adopts a two-stage learning paradigm, ensuring an equal number of positive examples per class within each batch. Wang et al. [31] introduced a hybrid network architecture combining contrastive loss for feature learning and cross-entropy loss for classifier learning, which both improves representation quality and mitigates classifier bias toward head classes. Subsequently, several methods [10,13,25] incorporating class-complementary mechanisms for constructing positive and negative pairs have been developed to ensure adequate participation of tail classes during the learning process. ProCo [26] introduces a probabilistic contrastive learning framework based on von Mises–Fisher distribution modeling, which generates infinite contrastive pairs by estimating class-conditional feature distributions online and optimizes the expected loss in closed form, effectively addressing the inherent limitations of batch constraints. DSCL [36] proposes a decoupled Supervised Contrastive Learning framework, which alleviates intra-class distance bias and underrepresentation of tail class features by separating optimization objectives for two types of positive samples and introducing a patch-based self-distillation mechanism. Regarding the design of distribution-aware mechanisms, DaSC [37] innovatively consolidates distribution-aware weighting strategies and mixed-sample augmentation techniques into a contrastive learning framework. By dynamically fusing global data distribution information with local sample confidence, it effectively liberates the model from traditional methods’ dependence on high-purity samples.

Our work differs from previous approaches in several key aspects. Existing methods typically employ an equal-treatment strategy for both types of positive samples, which has been demonstrated to lead to cross-category optimization bias [13]. To address this, we propose an adaptive focal gradient weighting mechanism that constructs category-aware rebalancing factors, enabling precise modeling of long-tailed data. Secondly, our proposed prototype-aware discriminative enhancement module innovatively integrates aggregation–separation loss with a swapped prediction mechanism, achieving explicit optimization of the feature space geometry. Most importantly, unlike existing methods that require complex two-stage tuning, our end-to-end framework accomplishes both feature balancing and discriminative enhancement through single-stage training.

2.3. Prototype Learning for Long-Tailed Recognition

Prototype learning methods provide an effective solution for long-tailed recognition tasks by constructing category-specific feature prototypes. The evolution of these methods demonstrates a clear progression from static prototype modeling to dynamic adaptive learning. The groundbreaking development of Open Long-Tail Recognition (OLTR) [8] introduced a novel framework for long-tailed recognition in open environments. By maintaining a visual memory bank that stores discriminative feature prototypes and utilizing retrieved memory features to enhance original representations, this approach not only strengthens the discriminative power of the feature space, but also effectively pushes novel-class samples away from the memory bank region. This design enables simultaneous recognition of known classes and detection of open-set classes.

Building upon OLTR, subsequent research has introduced various extensions and improvements. Expanded Episodic Memory (EEM) [38] implements a meta-embedding memory with dynamic update mechanisms, assigning each class a distinct memory block to record its most discriminative prototypes. This dynamic memory design significantly mitigates class imbalance effects by selectively retaining highly discriminative prototypes. Recent advancements have driven prototype learning toward more sophisticated implementations. PCL [39] innovatively employs category prototypes to represent semantic information, constructing a learnable prototype classifier through dual calibration mechanisms at both prototype and instance levels, effectively addressing data imbalance issues. Meanwhile, PADE [40] breaks the conventional assumption about test distributions by implementing momentum prototypes for balanced compactness learning, combined with instance-level domain detection and multi-expert parameter customization, achieving dynamic adaptation to arbitrary test distributions.

3. Method

In this section, we present Reb-SupCon, a novel framework designed to address feature bias in SCL for long-tailed visual recognition. Our approach consists of four key components: (1) fundamental concepts of Supervised Contrastive Learning (Section 3.1), (2) systematic analysis of classification bias in long-tailed scenarios (Section 3.2), (3) implementation details with proposed loss functions (Section 3.3), and (4) model training methodology (Section 3.4).

3.1. Preliminaries

In the visual recognition task, our objective is to establish a mapping from the input space

X

to the target space

Y

. This complex mapping process consists of two fundamental components: (1) an encoder

f_{θ} (\cdot)

that extracts features from input data

x \in X

, and (2) a linear classifier

g_{ϕ} (\cdot)

that processes the extracted features

z = f_{θ} (x) \in R^{d}

to produce the final classification output

\hat{y} = g_{ϕ} (z) \in Y

. The quality of feature extraction plays a pivotal role in determining the model’s overall performance. Consequently, enhancing the encoder’s capability and learning discriminative feature representations are primary objectives for optimizing classification performance. To facilitate our subsequent analysis, we first formalize the following key definitions:

Supervised Contrastive Learning (SCL). The goal of SCL is to optimize recognition performance by contrasting similarities between samples from the same class and those from different classes. Given a sample

x_{i}

with its embedding representation

z_{i}

, the similarity between two embedding vectors is measured using cosine similarity, defined as

sim (z_{i}, z_{j}) = \frac{z_{i} \cdot z_{j}}{∥ z_{i} ∥ ∥ z_{j} ∥},

(1)

where

z_{i} \cdot z_{j}

denotes the dot product between vectors

z_{i}

and

z_{j}

, and

∥ z ∥

represents the L2-norm of vector z. The similarity range is normalized to [−1, 1], with higher values indicating greater similarity. This similarity metric serves as the foundation for constructing contrastive loss functions that pull together embeddings from the same class while pushing apart those from different classes.

The Supervised Contrastive Loss is formally defined as

L_{\sup} (θ) = - \sum_{i = 1}^{N} \frac{1}{| P (i) |} \sum_{p \in P (i)} log \frac{exp (s (z_{i}, z_{p}) / τ)}{\sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} exp (s (z_{i}, z_{j}) / τ)},

(2)

where

z_{i}

denotes the embedding vector of sample

x_{i}

,

P (i)

represents the set of positive samples sharing the same label with

x_{i}

,

τ

is the temperature parameter, and

exp (\cdot)

is the exponential function.

3.2. Analysis

When applying SCL to long-tailed visual recognition tasks, frequent classes typically achieve lower loss values and dominate the model training process [13]. This leads to excessive attention being focused on features from frequent classes while those from rare classes are neglected, thereby exacerbating the data imbalance problem. In SCL, the selection of negative samples is determined by their label difference from the current anchor sample. Specifically, only samples sharing the same class label with the anchor are considered positive samples, while all other samples are treated as negatives. However, in class-imbalanced datasets, the number of head class samples significantly exceeds that of tail class samples. This imbalance is particularly pronounced in the MoCo [23,24] framework, where it causes two critical issues during the negative sample

q u e u e

update process:

The negative sample $q u e u e$ exhibits a skewed class distribution. Let $N_{h}$ , $N_{m}$ , and $N_{t}$ denote the number of head, medium, and tail class samples in the dataset, respectively. The head class proportion in the dataset is

$p_{h} = \frac{N_{h}}{N_{h} + N_{m} + N_{t}}$

(3)

Within the $q u e u e$ of size Q, the empirical head class proportion is $Q_{h} / Q$ . When $N_{h} ≫ N_{t}$ in the dataset, the queue proportion $Q_{h} / Q$ approaches 1, causing the negative sampling process to become heavily biased toward head classes while suppressing gradient contributions from tail classes. This imbalance ultimately degrades model performance.
The class imbalance in negative sampling critically distorts gradient updates during long-tailed learning. Following BCL [25], we reformulate the supervised contrastive loss at the class level for theoretical analysis:

$L_{\sup} = - \sum_{i} log (\frac{exp (s (z_{i}, z_{y_{i}}) / τ)}{\sum_{j \neq i} exp (s (z_{i}, z_{y_{j}}) / τ)}),$

(4)

where $z_{y_{i}}$ denotes anchor features from the same class as $z_{i}$ . During backpropagation, the gradient update for each sample i can be derived via the chain rule, with its mathematical formulation expressed as

$\frac{\partial L_{\sup} (i)}{\partial θ} = \frac{\partial L_{\sup} (i)}{\partial z_{i}} \cdot \frac{\partial z_{i}}{\partial θ}$

(5)

To account for the long-tailed distribution, the impact on gradient updates can be approximated as

$\frac{\partial L_{\sup} (i)}{\partial z_{i}} \propto - \frac{1}{τ} \frac{\partial \tilde{s} (i)}{\partial z_{i}} \cdot (1 - \frac{exp (\tilde{s} (i) / τ)}{\sum_{j \in h e a d} exp (\tilde{s} (j) / τ) + \sum_{j \in t a i l} exp (\tilde{s} (j) / τ)}),$

(6)

where $\tilde{s} (i) = s (z_{i}, z_{y_{i}})$ denotes the positive sample pair similarity, while $\tilde{s} (j) = s (z_{i}, z_{y_{j}})$ represents the negative sample pair similarity. For analytical simplicity, the above formulation presents only the approximated gradient updates for head and tail classes. When $\sum_{j \in h e a d} ≫ \sum_{j \in t a i l}$ , this bias causes gradient updates to concentrate predominantly on head class samples while suppressing updates from tail classes, consequently impairing the model’s ability to learn discriminative features for tail classes. Furthermore, the temperature parameter $τ$ governs the concentration of similarity distributions. In long-tailed data scenarios, an excessively small $τ$ value exacerbates class imbalance by further strengthening the dominance of head class samples.

3.3. Rebalancing Supervised Contrastive Learning

Based on the analysis in the previous section, to achieve rebalancing for Supervised Contrastive Learning in long-tailed recognition tasks, we need to address two key issues: (1) the excessive proportion of head class samples in the negative sample feature queue causes the gradient update process to skew toward head classes, thereby weakening the model’s learning capability for tail classes; (2) existing methods rely on label differentiation for local feature comparison, lacking further optimization of the feature space, which amplifies the negative impact of imbalanced data distribution.

To address this, we incorporate two additional terms into the original loss function: an adaptive focal gradient-weighted rebalancing factor to establish balanced feature representations, and a prototype-aware discriminative enhancement module to refine prototype representations in order to construct a more discriminative feature space. As shown in Figure 3, our framework is built on the two-branch momentum contrast (MoCo) architecture [23], integrating the collaborative optimization interaction between the rebalancing module and the prototype-aware constraints. The prototype alignment dynamically adjusts the sample reweighting strategy, and the resulting balanced gradient iteratively optimizes the prototype position, forming a coupled optimization cycle.

3.3.1. Rebalancing Factor

Traditional contrastive learning methods typically assume that negative samples are relatively easy to distinguish, thus focusing primarily on maximizing the similarity of positive samples. Under this assumption, existing studies often employ simple class-frequency-based weighting strategies to balance the training process [41]. However, such approaches fail to adequately account for the complex relationship between class distribution and sample difficulty. A common practice is to treat samples from head classes as inherently easy and those from tail classes as uniformly hard—a simplistic frequency-based division with clear limitations. In reality, accurately assessing sample difficulty should incorporate multiple factors, including their distribution characteristics in feature space and gradient information, rather than relying solely on class frequency [42].

Motivated by this observation, we propose an adaptive focal gradient-weighted rebalancing factor, which introduces a dynamic gradient-aware mechanism into supervised contrastive loss to enable adaptive focus across different classes. Formally, given a batch of samples

\{x_{i}, y_{i}\}

, the enhanced loss function is defined as

L_{i} = - \sum_{c \in C} ω (c) \cdot I_{c = c_{y}} \cdot log (\frac{exp (z_{c} \cdot G (x_{i}) / τ)}{\sum_{z_{j} \in R (i)} exp (z_{j} \cdot G (x_{i}) / τ)}),

(7)

where C denotes the original set of categories.

R (i) = S_{i}^{+} \cup S_{i}^{-}

represents the positive and negative sample sets for instance i, excluding its own feature

z_{i}

.

G (\cdot)

denotes a multilayer perceptron composed of two fully-connected layers. The rebalancing factor

ω (c)

is defined as

ω (c) = \frac{1}{1 + α \cdot \nabla_{c}} \cdot ω_{f} (c),

(8)

where

α

is a hyperparameter controlling the rebalancing intensity, and the gradient term ∇_c represents the average gradient magnitude for class c, reflecting the model’s learning difficulty on this category. It is calculated as follows:

\nabla_{c} = \frac{1}{|B_{c}|} \sum_{i \in B_{c}} {∥\nabla_{θ} L_{i}∥}_{2},

(9)

where

\nabla_{θ} L_{i}

denotes the loss gradient of sample i with respect to model parameters

θ

(calculated while freezing

ω (c)

to avoid second-order gradients), and

B_{c}

represents the batch of samples from class c. We allocate more attention to classes with higher learning difficulty to prevent them from being “forgotten” or inadequately learned.

To further enhance focus on hard samples, we introduce a focal weight

ω_{f} (c)

, which depends on the class frequency

f (c)

, along with hyperparameters

β

and

ε

to control the adjustment intensity:

ω_{f} (c) = \frac{β}{f (c) + ε},

(10)

where

β

controls the adjustment intensity of the focal weight, and

ε

= 1 × 10⁻⁵ is added to prevent division-by-zero errors. Furthermore, to ensure that only positive sample pairs of the current instance are computed, we introduce an indicator function

I_{c = c_{y}}

, which equals 1 when c matches the label of sample i and 0 otherwise.

In summary, the rebalanced supervised contrastive loss function can be expressed as follows:

L_{reb} = \sum_{i}^{N} \frac{1}{|\{S_{i}^{+}\}|} L_{i}

(11)

3.3.2. Prototype-Aware Discriminative Enhancement Module

Building upon the balanced feature representations established by the aforementioned rebalancing factor, we introduce feature prototypes to further optimize the discriminative capability at the feature space level. By jointly employing aggregation and separation losses, we explicitly constrain the geometric structure of the prototype space. Additionally, we incorporate a swapped prediction loss to enforce bidirectional alignment between features and dynamic prototypes, thereby achieving refined feature space reconstruction on top of the balanced distribution.

As illustrated in Figure 3, for input features

z_{q}

and

z_{k}

, we first introduce a set of K trainable global prototype vectors

C = [c_{1}, c_{2}, \dots, c_{K}] \in R^{K \times D}

stored in the network through the dual-branch architecture. Each prototype

c_{i}

dynamically encodes the global data distribution via cross-batch gradient updates, while supporting multiple prototypes per class to capture fine-grained subclass structures. To address the common prototype collapse issue in long-tailed learning, we abandon traditional clustering methods and instead employ optimal transport theory [43], efficiently computing soft assignment matrices

Q_{q}, Q_{k} \in R^{K \times B}

through the Sinkhorn algorithm. This ensures proper prototype distribution in the feature space while preventing prototype collapse and achieving balanced class representation. The final optimization objective can be formally expressed as

max_{Q \in Q} Tr (Q^{⊤} C^{⊤} Z) + ψ H (Q),

(12)

where

Tr (Q^{⊤} C^{⊤} Z)

maximizes the trace between the prototype assignment matrix

Q

and the feature–prototype similarity matrix

C^{⊤} Z

, directly optimizing the matching degree between prototypes and features. The term

ψ H (Q)

imposes entropy regularization to constrain the uniformity of the assignment matrix, where H denotes the entropy function,

H (Q) = - \sum_{i j} Q_{i j} log Q_{i j}

. An approximate optimal solution is efficiently obtained through three Sinkhorn iterations while computational efficiency is maintained. Furthermore, to enable online updates within each batch, the constraint set

Q

for the optimal transport problem is defined as

Q = \{Q \in R_{+}^{K \times B} |Q 1_{B} = \frac{1}{K} 1_{K}, Q^{⊤} 1_{K} = \frac{1}{B} 1_{B}\}

(13)

where

1_{K}

denotes a K-dimensional all-ones vector, and

1_{B}

denotes a B-dimensional all-ones column vector. These constraints enforce that each prototype is selected at least

\frac{B}{K}

times on average within the batch, preventing all samples from clustering around a few prototypes and ensuring uniform prototype coverage across samples.

To facilitate cross-branch knowledge transfer, we implement a symmetric swapped prediction mechanism that enforces mutual consistency between the prototype assignment distributions predicted by both branches. The computation proceeds as follows: first, we calculate the normalized similarity matrices between the features and prototypes for each branch:

S_{q} = Z_{q} C^{⊤} / τ, S_{k} = Z_{k} C^{⊤} / τ

(14)

These similarity scores are then transformed into probabilistic assignment distributions

P_{q}

and

P_{k}

through softmax normalization, which subsequently participate in the loss computation:

L_{1} = \frac{1}{2 B} \sum_{i = 1}^{B} [L (Q_{k}^{i}, P_{q}^{i}) + L (Q_{q}^{i}, P_{k}^{i})]

(15)

here,

L (\cdot, \cdot)

represents a divergence measure between probability distributions, implemented as cross-entropy loss.

During gradient computation, we prevent backpropagation through

Q_{q}

and

Q_{k}

to avoid interference with the prototype updating process. Subsequently, to enhance the discriminative power of the feature space, we design an aggregation loss that compels samples from the same class to cluster around their corresponding prototypes:

L_{2} = \frac{1}{B} \sum_{i = 1}^{B} \sum_{k = 1}^{K} Q_{q}^{k i} {∥z_{q}^{i} - c_{k}∥}_{2}^{2},

(16)

where the soft assignment probability

Q_{q}^{k i}

serves as weighting coefficients, giving higher-confidence samples greater importance.

Building upon the rapid prototype–sample association established by the aggregation loss, we further design a separation loss to maximize the distances between different prototypes:

L_{3} = \sum_{k = 1}^{K} \sum_{k^{'} \neq k} max (0, δ - {∥c_{k} - c_{k^{'}}∥}_{2}^{2}),

(17)

where

δ

is a hyperparameter controlling the minimum separation threshold between prototypes. The complete objective function combines these components with weighting hyperparameters

λ_{1}

,

λ_{2}

, and

λ_{3}

to balance their contributions:

L_{proto} = λ_{1} L_{1} + λ_{2} L_{2} + λ_{3} L_{3}

(18)

3.4. Model Training

Our proposed feature rebalancing framework comprises two key components: (1) an adaptive focal gradient-weighted loss function with rebalancing factors, and (2) a discriminative enhancement loss based on feature prototypes. The complete objective function can be expressed as

L = L_{proto} + L_{r e b}

(19)

These components work synergistically to address the class imbalance problem in long-tailed datasets by ensuring sufficient learning signals for tail classes while simultaneously enhancing the separation and discriminability between head and tail classes, thereby improving overall recognition performance.

4. Experience

4.1. Dataset

We evaluated our method against three widely-used long-tailed recognition benchmarks: CIFAR-100-LT [44], ImageNet-LT [8], and iNaturalist 2018 [45], as illustrated in Table 1. Unlike their balanced counterparts (CIFAR-100 [46] and ImageNet-2012 [1]), CIFAR-100-LT and ImageNet-LT are synthetically constructed imbalanced class distributions. In accordance with the standard evaluation protocol in long-tailed recognition [8], we trained the models on the original long-tailed training data and evaluated their performance on uniformly sampled test sets, strictly adhering to the official train–test splits provided for each dataset.

CIFAR-100-LT. CIFAR-100 contains 60,000 images—50,000 for training and 10,000 for validation across 100 classes. For fair comparison, we adopted the long-tailed version of CIFAR with identical experimental settings to those specified in [9,10]. These studies employed an imbalance factor

I F

to control the degree of data imbalance, defined as

I F = N_{m a x} / N_{m i n}

, where

N_{m a x}

and

N_{m i n}

denote the number of training samples in the most and least frequent classes, respectively. In our experiments, we configured

I F \in {100, 50, 10}

to conduct comprehensive evaluations. Classes were categorized into head, medium, and tail groups based on sample counts, with the classification thresholds set at 100 and 20.

ImageNet-LT. As a long-tailed version of the ImageNet dataset [8], ImageNet-LT was constructed by sampling a subset following the Pareto distribution with a power value

α = 6

[1]. It comprises 115.8 K images across 1000 categories, with the number of samples per class ranging from 5 to 1280.

iNaturalist 2018. iNaturalist 2018 [45] is a large-scale species classification dataset with extreme label imbalance, comprising 437.5 K images across 8142 categories. Beyond its severe class imbalance, the dataset presents fine-grained recognition challenges—significant visual variations may exist among different individuals or variants of the same species, substantially complicating the classification task. We selected iNaturalist 2018 as our experimental testbed to better simulate real-world species classification scenarios.

4.2. Experimental Setup

Evaluation Metrics. We employ the Top-1 accuracy across the entire dataset as our primary performance metric. Additionally, we report the class-average accuracy to provide a comprehensive evaluation of model performance, along with the accuracy breakdown for three category subsets partitioned by training sample size: head classes (including categories with >100 training samples), medium classes (including categories with sample sizes between 20 and 100), and tail classes (consisting of categories with fewer than 20 samples).

Experimental Basic Setup. For fair comparison with prior works, all experiments followed the same basic configurations as described in [8]. Additional experimental details are provided in Table 2.

We constructed the feature encoder based on the ResNet series backbone networks. For the CIFAR-LT dataset, ResNet-32 [46] was used as the backbone, with the downsampling layers removed from the original design to accommodate a 32 × 32 input resolution. For the ImageNet-LT dataset, we used standard ResNet-50 [46] and ResNeXt-50 [47] as the encoder backbones. For the iNaturalist 2018 dataset, standard ResNet-50 was used as the encoder backbone. Following MoCo [23,24], the projection head was a two-layer MLP (input dimension 128 → hidden layer 256 → output 128), with ReLU activation applied to the hidden layer, and the output layer normalized the features to the unit hypersphere using

L_{2}

normalization. The prototype vector parameter matrix had dimensions of K128, aligned with the feature dimensions to ensure the validity of similarity calculations.

We trained all the Reb-SupCon models using the SGD optimizer, with a momentum coefficient of 0.9 and weight decay set to 5 × 10⁻⁴. The learning rate was decayed from 0.02 to 0 using a cosine scheduler. Contrastive learning typically requires a long training time to converge, with MoCo [23,24] and SWAV [43] training for 800 epochs to achieve model convergence. Supervised Contrastive Learning [14] trains for 350 epochs for feature learning, with an additional 350 epochs for classifier learning. We trained Reb-SupCon for 400 epochs with a batch size of 128 on 4 GPUs. The temperature parameter

τ

was set to 0.2, and for CIFAR-100-LT, in order to strictly follow the settings of [23] for fair comparison, we used a lower temperature of 0.05.

For prototype-related parameters, the Sinkhorn algorithm iterated 3 times, the entropy regularization coefficient was set to

ψ = 0.05

, and the threshold hyperparameter was set to

δ = 1

. To balance the weights of the loss components, we set the coefficients as

λ_{1} = 0.5

,

λ_{2} = 0.3

, and

λ_{3} = 0.2

, with a gradient rebalancing strength of

α = 0.5

and a focal weight of

β = 0.1

. Furthermore, following [13], we adopted a differentiated augmentation strategy: the online branch used high-disturbance augmentation with RandAugment to uncover potential patterns in hard samples, while the momentum branch used low-disturbance augmentation with SimAugment to maintain the historical consistency of the negative sample queue. All images were normalized, with the mean and standard deviation parameters consistent with the original settings of the corresponding datasets.

4.3. Comparison with State-of-the-Art Methods

We evaluated Reb-SupCon against several state-of-the-art long-tailed recognition approaches on three benchmark datasets.

Results on CIFAR-100-LT. Table 3 presents the experimental results of Reb-SupCon on CIFAR-100-LT with

I F

of 100, 50, and 10. Our Reb-SupCon demonstrates consistent improvements across all imbalance scenarios, achieving significantly higher Top-1 accuracy than the CE baseline. Compared to the original SupCon method, Reb-SupCon shows accuracy gains of 6.9% and 4.5% under

I F = 100

and

I F = 50

, respectively, validating the effectiveness of integrating rebalancing strategies with contrastive learning. However, in near-balanced scenarios, the improvement reduces to 0.4%, suggesting that while our method specializes in extreme imbalance cases, conventional contrastive learning remains sufficiently effective under mild imbalance with diminishing marginal returns from rebalancing. Compared to the BCL [25], Reb-SupCon still achieves competitive results. Furthermore, it can be observed that the improvement brought by Balanced Softmax (BS) is limited, indicating that the static balancing strategy lacks sufficient adaptability to changes in the data distribution. In contrast, the dynamic gradient adjustment mechanism proposed in this chapter can adapt to varying degrees of class distribution skew, validating its effectiveness.

Results on ImageNet-LT. As shown in Table 4, the experimental results achieved with the large-scale dataset align with the observations from CIFAR-100-LT. Under RandAugment [48] with 400-epoch training, Reb-SupCon achieves 56.9% Top-1 accuracy, still surpassing both baseline and state-of-the-art methods. Crucially, it exhibits a 1.5% accuracy gain on tail classes and a 1.0% improvement on medium classes over BS, while preserving comparable head class performance. This demonstrates that Reb-SupCon significantly mitigates the optimization suppression of tail class samples while maintaining the representational ability of head classes, achieving global balanced optimization. Additionally, as shown in Figure 4, Reb-SupCon further increases the overall accuracy to 59.1% on the ResNeXt-50 backbone network, validating the method’s generalization ability on complex backbone networks. The consistent improvements across architectures demonstrate that our core innovations—dynamic gradient awareness and prototype semantic clustering—are architecture-agnostic, universally improving feature discriminability for long-tailed data.

Table 3. Top-1 accuracy on CIFAR-100-LT.

Imbalance Factor	100	50	10
Cross Entropy (CE)	38.6	44.0	56.4
CE-DRW	41.1	45.6	57.9
LDAM-DRW [9]	41.7	47.9	57.3
BBN [30]	42.6	47.1	59.2
CMO [27]	43.9	48.3	59.5
MoCo v2 [24]	44.6	50.2	63.1
SupCon [14]	45.8	52.0	64.4
Hybrid-SC [31]	46.7	51.8	63.0
ResLT [49]	48.2	52.7	62.0
BCL [25]	51.9	56.4	64.6
SBCL [50]	44.9	48.7	57.9
CC-SAM [51]	49.2	51.9	62.0
GLC-E [52]	47.9	52.4	62.2
GLC-E [52]	47.9	52.4	62.2
BS † [53]	50.8	54.2	63.0
ProCo † [26]	52.6	57.0	65.0
Reb-SupCon (Ours) †	52.7	56.5	64.8

† indicates that the model was trained with RandAugment [48] for 400 epochs.

Table 4. Top-1 accuracy on ImageNet-LT.

Methods	All	Many	Medium	Few
CE	41.6	64.0	33.8	5.8
Focal Loss [54]	43.7	64.3	37.1	8.2
LWS [55]	49.9	60.2	47.2	30.3
LADE [56]	51.9	62.3	49.3	31.2
SBCL [50]	52.5	-	-	-
TSC [10]	52.4	63.5	49.7	30.4
CC-SAM [51]	54.4	-	-	-
GLC-E [52]	53.6	-	-	-
DSCL [36]	57.5	68.3	54.9	35.2
BS † [53]	55.4	65.8	53.2	34.1
ProCo † [26]	57.5	-	-	-
Reb-SupCon (Ours) †	56.9	64.9	54.2	35.6

† indicates that the model was trained with RandAugment [48] for 400 epochs.

Results on iNaturalist 2018. As presented in Table 5, using the ResNet-50 backbone without knowledge distillation, Reb-SupCon demonstrates remarkable advantages in extreme long-tailed scenarios. Compared to the baseline method BS, Reb-SupCon achieves a 3.0% improvement in overall performance, while outperforming the recently proposed DSCL [36] by 1.0%. Fine-grained analysis reveals that Reb-SupCon attains 73.2% and 72.5% accuracy for medium and tail classes, respectively, representing improvements of 0.3% and 2.2% over DSCL—this particularly highlights its superior capability in learning tail class features. Notably, although SHIKE [41] shows marginally higher aggregate accuracy, it relies on complex multi-expert architectures and knowledge distillation strategies. In contrast, Reb-SupCon achieves lightweight yet high performance solely through enhanced supervised contrastive loss, further validating our method’s applicability and efficiency advantages for large-scale open datasets.

4.4. Ablation Study

To validate the effectiveness of the rebalancing factor and the aggregation–separation loss for feature space optimization in Reb-SupCon, we conducted systematic ablation experiments on the CIFAR-100-LT dataset. The experimental results in Table 6 demonstrate that both components contribute significantly to performance improvement. Quantitative analysis shows that with an imbalance factor of 100, introducing the rebalancing factor alone improves Top-1 accuracy by 4.4%, confirming its effectiveness in mitigating model optimization bias. Further incorporating the aggregation–separation module leads to additional performance gains, with the approach ultimately achieving a 6.9% Top-1 accuracy improvement over the baseline, which fully validates the importance of explicit global class distribution modeling for long-tailed learning.

To better visualize the improvements, we employed multi-dimensional analysis methods. For gradient distributions, Figure 2 compares the average gradient magnitude distributions on ImageNet-LT before and after applying the rebalancing factor, clearly showing that our gradient balancing strategy significantly reduces the disparity between head and tail classes. For feature space analysis, Figure 5 presents t-SNE visualizations on CIFAR-100-LT, revealing deeper insights into feature distributions. The visualization experiment selected 10 representative subclasses (evenly sampled by frequency across head, medium, and tail classes) and projected their 1024-dimensional feature vectors into 3-dimensional space. Comparative analysis shows the following: (1) the baseline model exhibits obvious feature overlap at head–tail class boundaries, indicating insufficient feature learning for tail classes; (2) our method demonstrates superior feature distribution characteristics, achieving local manifold compactness through prototype clustering while maintaining global equilibrium via optimal transport planning, effectively preserving head class discriminability while recovering the underlying manifold structures of tail classes.

4.5. Hyperparameter Sensitivity Analysis

Rebalancing Strength $α$ and Focal Weight $β$ . We conducted systematic experiments on CIFAR-100-LT with

I F

of 100, 50, and 10 using a ResNet-32 backbone while fixing other parameters. Table 7 presents the Top-1 and Class-Avg accuracy for combinations of

α \in {0.1, 0.5, 1.0}

and

β \in {0.05, 0.1, 0.5}

. The results show that the combination of

α = 0.5

and

β = 0.1

achieves optimal performance. Excessively high

α = 1.0

leads to accuracy degradation, while increasing

β

to 0.5 significantly deteriorates tail class performance, indicating that over-reliance on class-frequency weighting may compromise gradient stability.

Prototype count $K$ and loss weights $λ$ . As shown in Table 8, we evaluated the performance under different combinations of K and

λ

, using the ResNet-50 backbone with the other parameters fixed. The experimental results show that the model achieves optimal performance when

K = 3000

, with the overall accuracy improving by 1.8% compared to

K = 800

and the tail class accuracy increasing by 1.7%. Under this setting, sufficient prototypes effectively cover all 1000 classes in ImageNet-LT, avoiding the prototype assignment conflicts observed at

K = 800

, while

K = 5000

leads to performance saturation and increased computational cost, indicating that excessive K introduces redundant prototypes and reduces optimization efficiency.

For loss weights

λ

, the model achieves the best balance between inter-class separation and intra-class compactness when the aggregation loss weight

λ_{1} = 0.5

, the separation loss weight

λ_{2} = 0.3

, and the alignment loss weight

λ_{3} = 0.2

. Increasing

λ_{1}

to 0.6 (emphasizing intra-class aggregation) weakens tail class discriminability due to over-constrained intra-class variance, leading to decreased tail class performance. Conversely, raising

λ_{2}

to 0.4 (strengthening inter-class separation) reduces the head class accuracy by 0.4%, suggesting that excessive inter-class loss may compromise the representation stability of head classes. Furthermore, increasing the alignment loss weight

λ_{3}

to 0.3 causes slight performance degradation, indicating that excessive alignment constraints may suppress the adaptive optimization capability of prototype clustering.

In summary, the combination of

K = 3000

and

λ = (0.5, 0.3, 0.2)

achieves coordinated optimization of global–local features in long-tailed scenarios through balanced prototype granularity and loss weight allocation, providing a robust and efficient learning paradigm for large-scale imbalanced data.

5. Conclusions

To address the optimization bias in Supervised Contrastive Learning for long-tailed image recognition, this paper proposes a Rebalancing Supervised Contrastive Learning (Reb-SupCon) method that integrates gradient reweighting and prototype calibration. Through theoretical analysis, we first reveal the fundamental causes of performance degradation in conventional methods under long-tailed data. Subsequently, we design a dynamic gradient adjustment mechanism and global feature distribution constraints to achieve coordinated optimization of head and tail class representations. Visualization results of features align with experimental expectations, while comprehensive evaluations on multiple benchmark datasets demonstrate that our method achieves state-of-the-art performance.

Limitations. Although our method shows progress, it comes with some limitations. The rebalancing factor enhances tail class discriminability, but slightly reduces head class performance, reflecting long-tail learning’s inherent trade-off. At the same time, while the prototype calibration mechanism is effective, it introduces additional computational overhead and increases training time, limiting its applicability in resource-constrained environments. Future work will focus on exploring lightweight prototype allocation algorithms and adaptive prototype prediction mechanisms to further optimize feature learning for long-tailed data.

Author Contributions

Conceptualization, Methodology, J.L. (Jiahui Lv) and S.L.; Data Collection, J.L. (Jiahui Lv); Model Building, J.L. (Jiahui Lv) and J.L. (Jun Lei); Experiment, Data Analysis, and Writing—Original Draft Preparation, J.L. (Jiahui Lv) and C.C.; Resources, J.Z.; Writing—Review and Editing, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Laboratory of Big Data and Decision Making of the National University of Defense Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and dataset will be finalized and made publicly available online upon acceptance of the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, R.; Xu, L.; Lu, X.; Yu, Y.; Xu, M.; Zhao, H. Fastersal: Robust and real-time single-stream architecture for rgb-d salient object detection. IEEE Trans. Multimed. 2024, 27, 2477–2488. [Google Scholar] [CrossRef]
Khadraoui, A.; Zemmouri, E. Pyramid scene parsing network for driver distraction classification. In Proceedings of the International Conference on Artificial Intelligence and Smart Environment, Errachidia, Morocco, 23–25 November 2023; Springer Nature: Cham, Switzerland, 2023; pp. 189–194. [Google Scholar]
Elsken, T.; Metzen, J.H.; Hutter, F. Neural architecture search: A survey. J. Mach. Learn. Res. 2019, 20, 1–21. [Google Scholar]
Dong, X.; Liu, L.; Musial, K.; Gabrys, B. Nats-bench: Benchmarking nas algorithms for architecture topology and size. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3634–3646. [Google Scholar] [CrossRef] [PubMed]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part v 13. Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Liu, Z.; Miao, Z.; Zhan, X.; Wang, J.; Gong, B.; Yu, S.X. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 115–20 June 2019; pp. 2537–2546. [Google Scholar]
Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. Adv. Neural Inf. Process. Syst. 2019, 32, 1567–1578. [Google Scholar]
Li, T.; Cao, P.; Yuan, Y.; Fan, L.; Yang, Y.; Feris, R.S.; Indyk, P.; Katabi, D. Targeted supervised contrastive learning for long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6918–6928. [Google Scholar]
Mojahed, A.J.; Moattar, M.H.; Ghaffari, H. Supervised Density-Based Metric Learning Based on Bhattacharya Distance for Imbalanced Data Classification Problems. Big Data Cogn. Comput. 2024, 8, 109. [Google Scholar] [CrossRef]
Tan, J.; Wang, C.; Li, B.; Li, Q.; Ouyang, W.; Yin, C.; Yan, J. Equalization loss for long-tailed object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11662–11671. [Google Scholar]
Cui, J.; Zhong, Z.; Liu, S.; Yu, B.; Jia, J. Parametric contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 715–724. [Google Scholar]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
Rezvani, S.; Wang, X. A broad review on class imbalance learning techniques. Appl. Soft Comput. 2023, 143, 110415. [Google Scholar] [CrossRef]
Miao, W.; Pang, G.; Bai, X.; Li, T.; Zheng, J. Out-of-distribution detection in long-tailed recognition with calibrated outlier class learning. AAAI Conf. Artif. Intell. 2024, 38, 4216–4224. [Google Scholar] [CrossRef]
Lin, D. Probability guided loss for long-tailed multi-label image classification. AAAI Conf. Artif. Intell. 2023, 37, 1577–1585. [Google Scholar] [CrossRef]
Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng, J.; Kalantidis, Y. Decoupling representation and classifier for long-tailed recognition. arXiv 2019, arXiv:1910.09217. [Google Scholar]
Zhang, H.; Zhu, L.; Wang, X.; Yang, Y. Divide and retain: A dual-phase modeling for long-tailed visual recognition. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 13538–13549. [Google Scholar] [CrossRef]
Ye, C.; Tsuchida, R.; Petersson, L.; Barnes, N. Label shift estimation for class-imbalance problem: A bayesian approach. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1073–1082. [Google Scholar]
Rani, V.; Nabi, S.T.; Kumar, M.; Mittal, A.; Kumar, K. Self-supervised learning: A succinct review. Arch. Comput. Methods Eng. 2023, 30, 2761–2775. [Google Scholar] [CrossRef]
Gui, J.; Chen, T.; Zhang, J.; Cao, Q.; Sun, Z.; Luo, H.; Tao, D. A survey on self-supervised learning: Algorithms, applications, and future trends. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9052–9071. [Google Scholar] [CrossRef]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9729–9738. [Google Scholar]
Chen, X.; Fan, H.; Girshick, R.; He, K. Improved baselines with momentum contrastive learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
Zhu, J.; Wang, Z.; Chen, J.; Chen, Y.P.P.; Jiang, Y.G. Balanced contrastive learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6908–6917. [Google Scholar]
Du, C.; Wang, Y.; Song, S.; Huang, G. Probabilistic contrastive learning for long-tailed visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5890–5904. [Google Scholar] [CrossRef]
Park, S.; Hong, Y.; Heo, B.; Yun, S.; Choi, J.Y. The majority can help the minority: Context-rich minority oversampling for long-tailed classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6887–6896. [Google Scholar]
Li, J.; Yang, Z.; Hu, L.; Liu, J.; Tao, D. CRmix: A regularization by clip** images and replacing mixed samples for imbalanced classification. Digit. Signal Process. 2023, 135, 103951. [Google Scholar] [CrossRef]
Chen, Y.; Hong, Z.; Yang, X. Cost-sensitive online adaptive kernel learning for large-scale imbalanced classification. IEEE Trans. Knowl. Data Eng. 2023, 35, 10554–10568. [Google Scholar] [CrossRef]
Zhou, B.; Cui, Q.; Wei, X.S.; Chen, Z.M. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9719–9728. [Google Scholar]
Wang, P.; Han, K.; Wei, X.S.; Zhang, L.; Wang, L. Contrastive learning based hybrid networks for long-tailed image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 943–952. [Google Scholar]
Kang, B.; Li, Y.; Xie, S.; Yuan, Z.; Feng, J. Exploring balanced feature spaces for representation learning. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
He, X.; Fu, S.; Ding, X.; Cao, Y.; Wang, H. Uniformly distributed category prototype-guided vision-language framework for long-tail recognition. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October 2023; pp. 5027–5037. [Google Scholar]
Lin, X.; Wang, S.; Cai, R.; Liu, Y.; Fu, Y.; Yu, Z.; Tang, W.; Kot, A. Suppress and rebalance: Towards generalized multi-modal face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 211–221. [Google Scholar]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
Xuan, S.; Zhang, S. Decoupled contrastive learning for long-tailed recognition. AAAI Conf. Artif. Intell. 2024, 38, 6396–6403. [Google Scholar] [CrossRef]
Baik, J.S.; Yoon, I.Y.; Kim, K.H.; Choi, J.W. Distribution-Aware Robust Learning from Long-Tailed Data with Noisy Labels. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 160–177. [Google Scholar]
Zhu, L.; Yang, Y. Inflated episodic memory with region self-attention for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4344–4353. [Google Scholar]
Wei, X.-S.; Xu, S.-L.; Chen, H.; Xiao, L.; Peng, Y. Prototype-based classifier learning for long-tailed visual recognition. Sci. China Inf. Sci. 2022, 65, 160105. [Google Scholar] [CrossRef]
Guo, C.; Chen, W.; Huang, A.; Zhao, T. Prototype Alignment with Dedicated Experts for Test-Agnostic Long-Tailed Recognition. IEEE Trans. Multimed. 2024, 27, 455–465. [Google Scholar] [CrossRef]
Jin, Y.; Li, M.; Lu, Y.; Cheung, Y.M.; Wang, H. Long-tailed visual recognition via self-heterogeneous integration with knowledge excavation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 23695–23704. [Google Scholar]
Xue, W.; Zhang, L.; Mou, X.; Bovik, A.C. Gradient magnitude similarity deviation: A highly efficient perceptual image quality index. IEEE Trans. Image Process. 2013, 23, 684–695. [Google Scholar] [CrossRef] [PubMed]
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 2020, 33, 9912–9924. [Google Scholar]
Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar]
Van Horn, G.; Mac Aodha, O.; Song, Y.; Cui, Y.; Sun, C.; Shepard, A.; Adam, H.; Perona, P.; Belongie, S. The inaturalist species classification and detection dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8769–8778. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 702–703. [Google Scholar]
Cui, J.; Liu, S.; Tian, Z.; Zhong, Z.; Jia, J. Reslt: Residual learning for long-tailed recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3695–3706. [Google Scholar] [CrossRef]
Hou, C.; Zhang, J.; Wang, H.; Zhou, T. Subclass-balancing contrastive learning for long-tailed recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 5395–5407. [Google Scholar]
Zhou, Z.; Li, L.; Zhao, P.; Heng, P.A.; Gong, W. Class-conditional sharpness-aware minimization for deep long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 3499–3509. [Google Scholar]
Li, M.; Cheung, Y.M.; Lu, Y.; Hu, Z.; Lan, W.; Huang, H. Adjusting logit in Gaussian form for long-tailed visual recognition. IEEE Trans. Artif. Intell. 2024, 5, 5026–5039. [Google Scholar] [CrossRef]
Vu, D.Q.; Thu, M.T.H. Smooth Balance Softmax for Long-Tailed Image Classification. In Proceedings of the International Conference on Advances in Information and Communication Technology, Thai Nguyen, Vietnam, 16–17 November 2024; Springer Nature: Cham, Switzerland, 2024; pp. 323–331. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Mahajan, D.; Girshick, R.; Ramanathan, V.; He, K.; Paluri, M.; Li, Y.; Bharambe, A.; Van Der Maaten, L. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 181–196. [Google Scholar]
Hong, Y.; Han, S.; Choi, K.; Seo, S.; Kim, B.; Chang, B. Disentangling label distribution for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 6626–6636. [Google Scholar]
Chou, H.P.; Chang, S.C.; Pan, J.Y.; Wei, W.; Juan, D.C. Remix: Rebalanced mixup. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 95–110. [Google Scholar]
Wang, X.; Lian, L.; Miao, Z.; Liu, Z.; Yu, S.X. Long-tailed recognition by routing diverse distribution-aware experts. arXiv 2020, arXiv:2010.01809. [Google Scholar]
Sharma, S.; Xian, Y.; Yu, N.; Singh, A. Learning prototype classifiers for long-tailed recognition. arXiv 2023, arXiv:2302.00491. [Google Scholar]

Figure 1. Description of long-tailed visual recognition task: (a) model training on a long-tailed dataset and (b) evaluation on a balanced test set, simulating real-world data distribution to assess model performance.

Figure 2. Rebalancing in Supervised Contrastive Learning (Reb-SupCon). We measured the average L2-norm of weight gradients in the final classifier layer on ImageNet-LT, with classes indexed by their image counts. While conventional SCL exhibits a steep decline in gradient norms from most to least frequent classes [14]—particularly showing a sharp drop for the top 200 classes—our Reb-SupCon demonstrates significantly better-balanced gradient norms across all categories.

Figure 3. The general structure of our Reb-SupCon. The network consists of a query branch and a key branch. First, the similarity calculation is adjusted using a rebalancing factor

ω (c)

to enhance the participation of tail class samples. Then, a set of optimizable prototypes is used to strengthen the discriminability of the feature space. Similarly to in the MoCo framework, the key branch (in gray) maintains a historical queue through momentum updates, with the red dashed arrows indicating the gradient backpropagation path for the trainable components.

Figure 3. The general structure of our Reb-SupCon. The network consists of a query branch and a key branch. First, the similarity calculation is adjusted using a rebalancing factor

ω (c)

to enhance the participation of tail class samples. Then, a set of optimizable prototypes is used to strengthen the discriminability of the feature space. Similarly to in the MoCo framework, the key branch (in gray) maintains a historical queue through momentum updates, with the red dashed arrows indicating the gradient backpropagation path for the trainable components.

Figure 4. Comparison of Top-1 accuracy results on ImageNet-LT with ResNet-50 and ResNeXt-50 backbones.

Figure 5. t-SNE visualization comparison of feature space optimization before and after Reb-SupCon implementation.

Table 1. Basic information of the datasets.

Datasets	Number of Classes	Training/Test Samples	IF
CIFAR-LT	100	10.8 K/10 K	{100, 50, 10}
ImageNet-LT	1000	115.8 K/50 K	256
iNaturalist 2018	8142	437.5 K/24.4 K	500

Table 2. Basic information of the datasets.

Datasets	CIFAR-100-LT	ImageNet-LT	iNaturalist 2018
Backbone	ResNet-32	ResNet-50/ResNext-50	ResNet-50
Input resolution	32 × 32	224 × 224	224 × 224
Epochs	400	400	400
Batch size	128	256	512
Initial learning rate	0.02	0.1	0.1
Temperature $τ$	0.05	0.2	0.2

Table 5. Top-1 accuracy on iNaturalist 2018.

Methods	All	Many	Medium	Few
CE	61.0	73.9	63.5	55.5
LDAM-DRW [9]	66.1	-	-	-
BS [53]	70.0	70.0	70.2	69.9
Remix [57]	70.5	-	-	-
RIDE(3E) [58]	72.2	70.2	72.2	72.7
TSC [10]	69.7	72.6	70.6	67.8
PC [59]	70.6	71.6	70.6	70.2
SHIKE [41]	74.5	-	-	-
DSCL [36]	72.0	74.2	72.9	70.3
BS † [53]	71.8	-	-	-
Reb-SupCon (Ours) †	73.0	71.9	73.2	72.5

† indicates that the model was trained with RandAugment [48] for 400 epochs.

Table 6. Ablation study results for primary components of Reb-SupCon.

Imbalance Factor	100	50	10
SupCon	45.8	52.0	64.4
+ Rebalancing factor	50.2	55.1	64.5
+ Prototype module	48.7	54.7	64.5
Reb-SupCon	52.7	56.5	64.8

Table 7. Ablation study on rebalancing factor in Reb-SupCon.

$α$	$β$	Top-1 Accuracy	Class-Avg Accuracy
0.1	0.1	49.3	25.8
0.5	0.1	52.7	30.1
1.0	0.1	51.1	28.5
0.5	0.05	51.7	28.9
0.5	0.5	50.4	26.7

Table 8. Ablation study results for prototype-related parameters.

K	$λ$	All	Many	Medium	Few
800	(0.5, 0.3, 0.2)	55.1	64.3	52.7	33.9
5000	(0.5, 0.3, 0.2)	56.5	64.7	53.9	35.1
3000	(0.5, 0.3, 0.2)	56.9	64.9	54.2	35.6
3000	(0.6, 0.2, 0.2)	56.2	65.1	53.6	34.8
3000	(0.4, 0.4, 0.2)	55.9	64.5	53.2	34.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lv, J.; Lei, J.; Zhang, J.; Chen, C.; Li, S. Rebalancing in Supervised Contrastive Learning for Long-Tailed Visual Recognition. Big Data Cogn. Comput. 2025, 9, 204. https://doi.org/10.3390/bdcc9080204

AMA Style

Lv J, Lei J, Zhang J, Chen C, Li S. Rebalancing in Supervised Contrastive Learning for Long-Tailed Visual Recognition. Big Data and Cognitive Computing. 2025; 9(8):204. https://doi.org/10.3390/bdcc9080204

Chicago/Turabian Style

Lv, Jiahui, Jun Lei, Jun Zhang, Chao Chen, and Shuohao Li. 2025. "Rebalancing in Supervised Contrastive Learning for Long-Tailed Visual Recognition" Big Data and Cognitive Computing 9, no. 8: 204. https://doi.org/10.3390/bdcc9080204

APA Style

Lv, J., Lei, J., Zhang, J., Chen, C., & Li, S. (2025). Rebalancing in Supervised Contrastive Learning for Long-Tailed Visual Recognition. Big Data and Cognitive Computing, 9(8), 204. https://doi.org/10.3390/bdcc9080204

Article Menu

Rebalancing in Supervised Contrastive Learning for Long-Tailed Visual Recognition

Abstract

1. Introduction

2. Related Work

2.1. Long-Tailed Visual Recognition

2.2. Contrastive Learning

2.3. Prototype Learning for Long-Tailed Recognition

3. Method

3.1. Preliminaries

3.2. Analysis

3.3. Rebalancing Supervised Contrastive Learning

3.3.1. Rebalancing Factor

3.3.2. Prototype-Aware Discriminative Enhancement Module

3.4. Model Training

4. Experience

4.1. Dataset

4.2. Experimental Setup

4.3. Comparison with State-of-the-Art Methods

4.4. Ablation Study

4.5. Hyperparameter Sensitivity Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI