Saliency-Guided Local Semantic Mixing for Long-Tailed Image Classification

Lv, Jiahui; Lei, Jun; Zhang, Jun; Chen, Chao; Li, Shuohao

doi:10.3390/make7030107

Open AccessArticle

Saliency-Guided Local Semantic Mixing for Long-Tailed Image Classification

by

Jiahui Lv

,

Jun Lei

,

Jun Zhang

,

Chao Chen

and

Shuohao Li

^*

Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(3), 107; https://doi.org/10.3390/make7030107

Submission received: 2 August 2025 / Revised: 5 September 2025 / Accepted: 18 September 2025 / Published: 22 September 2025

Download

Browse Figures

Versions Notes

Abstract

In real-world visual recognition tasks, long-tailed distributions pose a widespread challenge, with extreme class imbalance severely limiting the representational learning capability of deep models. In practice, due to this imbalance, deep models often exhibit poor generalization performance on tail classes. To address this issue, data augmentation through the synthesis of new tail-class samples has become an effective method. One popular approach is CutMix, which explicitly mixes images from tail and other classes, constructing labels based on the ratio of the regions cropped from both images. However, region-based labels completely ignore the inherent semantic information of the augmented samples. To overcome this problem, we propose a saliency-guided local semantic mixing (LSM) method, which uses differentiable block decoupling and semantic-aware local mixing techniques. This method integrates head-class backgrounds while preserving the key discriminative features of tail classes and dynamically assigns labels to effectively augment tail-class samples. This results in efficient balancing of long-tailed data distributions and significant improvements in classification performance. The experimental validation shows that this method demonstrates significant advantages across three long-tailed benchmark datasets, improving classification accuracy by 5.0%, 7.3%, and 6.1%, respectively. Notably, the LSM framework is highly compatible, seamlessly integrating with existing classification models and providing significant performance gains, validating its broad applicability.

Keywords:

long-tailed distribution; image classification; data augmentation; imbalanced data

1. Introduction

In the field of computer vision, deep learning methods have achieved remarkable success in image classification tasks [1,2,3,4]. However, these accomplishments primarily rely on class-balanced datasets [2,5], whereas, in real-world applications, data often exhibit typical long-tailed distribution characteristics. A few categories possess abundant samples, while the majority of categories contain only limited instances [6], as shown in Figure 1. This imbalanced data distribution poses significant challenges for training deep neural networks as models tend to be biased toward sample-rich categories and demonstrate poor recognition performance for sample-scarce categories, severely limiting their reliability in open-world applications [7,8,9]. For instance, in the domain of medical imaging [10], common conditions such as retinal abnormalities caused by pneumonia or diabetes typically belong to the normal sample category and are thus more easily detected and identified. In contrast, samples of rare pathologies such as tumors are extremely scarce. When deep learning models trained on such medical imaging data are deployed in real-world clinical settings, they may lead to serious misdiagnosis issues.

To address the challenges posed by long-tailed visual data, resampling strategies have emerged as primary solutions at the data level [11,12]. Traditional methods, such as Random Over-Sampling (ROS) [13], exhibit notable limitations: simplistic replication operations fail to enhance semantic diversity while simultaneously inflating the training set size, increasing computational overhead, and potentially amplifying noise and labeling errors, thereby compromising model robustness. Conversely, Random Under-Sampling (RUS) [14] risks discarding critical discriminative features from high-frequency classes, undermining the model’s representational capacity. These simplistic “trade-off” strategies fail to resolve the core dilemma of long-tailed learning.

Subsequently proposed refined sampling mechanisms—such as decoupled frameworks relying on prior distributions [15], meta-learning sampling constrained by balanced validation sets [16,17,18,19], and multi-level resampling that may induce hierarchical conflicts [20]—still suffer from issues like high computational complexity and insufficient adaptability. Although these data resampling techniques have achieved some success in mitigating class imbalance, their fundamental limitation lies in the semantic boundaries of existing data. Whether through sample replication or probabilistic filtering, none transcend the original feature space of the data distribution. Such “limited adjustment” strategies struggle to generate discriminative novel semantic patterns, resulting in a bottleneck for improving the model’s representational capability for tail classes.

Figure 1. Illustration of long-tailed distribution and the concept of local semantic mixing. In real-world long-tailed datasets such as iNaturalist 2018 [21], there exists a significant disparity in sample quantities between head classes and tail classes. ROS [13] repeatedly generates images from tail classes with limited background variations. To address this, we propose a novel local semantic mixing approach that preserves the critical discriminative features of tail classes while fusing them with the background contexts of head classes, thereby synthesizing diverse tail-class images.

To address these fundamental constraints in data representation, recent investigations have shifted toward reconstructing feature spaces to overcome inherent data limitations. Within this paradigm, data augmentation techniques employing semantic mixing have emerged as particularly promising solutions [22,23,24,25]. The seminal work of MixUp [22] introduced linear interpolation of input pairs to synthesize training samples, whereas CutMix [23] advanced the field through its innovative region-level sample mixing methodology. Nevertheless, the application of these conventional global augmentation approaches to long-tailed image classification reveals three principal shortcomings: (1) the uniform allocation of mixing ratios inadequately safeguards discriminative features in tail classes, leading to the suppression of critical semantic information by dominant features from head classes. (2) Fixed mixing strategies lack the adaptability required to accommodate the complex inter-class distribution variations characteristic of long-tailed datasets. (3) Existing approaches relying on whole-image or rigid grid-based mixing fail to effectively model fine-grained local semantic relationships.

A further limitation stems from the labeling mechanism employed for mixed samples. The current implementations, including CutMix’s random selection of cropping regions and positions, determine label assignments purely based on the proportional area of source and target crops, completely disregarding the semantic significance of the cropped content. This uniform labeling scheme, which treats foreground and background regions equivalently, systematically introduces label noise during training, consequently compromising optimization stability and model performance.

To address these inherent challenges of long-tailed learning, we propose local semantic mixing (LSM), a novel data augmentation framework that simultaneously addresses class imbalance while preserving crucial local semantic structures. Our solution establishes a comprehensive processing pipeline consisting of four key stages: (1) feature decoupling, (2) semantic selection, (3) adaptive reconstruction, and (4) label refinement, which collectively achieve balanced data distribution while maintaining optimal feature discriminability. The main contributions are as follows:

We propose a differentiable local feature decoupling framework that maps input images into N independent semantic units through a designed parameterized patch operator, enabling pixel-level feature control. This breakthrough overcomes the destructive interference of traditional global mixing on fine-grained features.
We develop a local semantic mixing and dynamic label fusion strategy that obtains regional saliency weights through spatial alignment and patch averaging of gradient-weighted CAMs. This allows for semantic-aware blending of discriminative tail-class regions with head-class backgrounds while dynamically generating soft labels, effectively augmenting tail-class data.
Extensive experiments on three standard long-tailed benchmark datasets demonstrate that LSM achieves dual breakthroughs in both classification performance and training stability.

2. Related Works

2.1. Rebalancing Strategies

The long-tailed data distribution prevalent in real-world visual recognition tasks presents persistent challenges that have motivated extensive research. The current methodologies addressing this issue can be categorized into three research directions: class rebalancing, information augmentation, and module refinement [26]. Within class rebalancing, researchers have developed both resampling and reweighting techniques to mitigate distribution imbalances. Resampling methods, including oversampling of tail-class instances [14,27,28] and undersampling of head-class examples [29], operate directly on the data distribution to adjust class proportions. Meanwhile, reweighting approaches modify the learning objective through dynamic loss adjustment [8,30,31] or logit transformation [32,33], offering more nuanced solutions to distribution skewness.

While traditional methods often employ simple inverse frequency weighting (weighted softmax), recent advances have introduced more sophisticated balancing mechanisms. Notably, class-balanced loss [31] proposes the “effective number” concept using exponential decay functions, Balanced Softmax [17] incorporates label frequency directly into logit computations, and focal loss [34] shifts the focus to prediction confidence rather than class frequency. The field has further evolved toward data-driven adaptation, with Meta-Weight-Net [35] leveraging meta-learning to derive optimal weighting functions and BCL [36] implementing gradient-level balancing through class averaging operations. These diverse yet complementary approaches collectively enhance model performance on tail classes while preserving discriminability on head classes, representing significant progress in long-tailed visual recognition research.

2.2. Augmentation-Based Methods

Data augmentation is a suite of techniques aimed at enhancing the size and quality of a model’s training dataset by applying a series of transformations or manipulations to the existing data [37,38]. Its purpose is to increase the diversity of the data, thereby improving the model’s generalization ability. Conventional augmentation strategies in long-tailed learning focus on improving classical methods to address the challenges of class imbalance. In classic data augmentation methods for convolutional neural networks, the DeVries team pioneered the development of the Cutout [39] technique, which introduces feature occlusion by randomly removing square regions from images during training. Zhang et al. proposed the innovative MixUp [22], which uses a linear interpolation strategy between two samples in the training dataset. The pixel value fusion weights are controlled by

B e t a

distribution parameters, successfully alleviating the model’s hypersensitivity to outlier samples and mitigating overfitting issues. Subsequently, CutMix [23] fills the region removed in Cutout with a patch from another training image. In this approach, a corresponding area from the target image replaces a random local rectangular region in the source image, generating new training samples. The new labels are dynamically adjusted based on the area ratio between the patched and original regions. Compared to traditional region dropping or noise filling methods [40], this strategy of retaining relevant information through patching significantly enhances the effectiveness of model training. Following this, data augmentation techniques based on mixing multiple images to generate a single image have gained traction, with a series of studies attempting to improve CutMix by employing more sophisticated algorithmic strategies to select geometric parameters for the mixed regions [24,25,41].

In long-tailed learning scenarios, inspired by the region fusion concept of CutMix [23], various diversified feature enhancement strategies have rapidly been developed. For example, the context-rich minority (CMO) framework [42] integrates high-frequency category images as semantic backgrounds with low-frequency samples, effectively strengthening the feature richness of tail categories. Zhong et al. [43] revealed through systematic empirical research that data mixing has a dual effect in suppressing prediction bias and enhancing feature representation capabilities, although it may interfere with classifier optimization. Building on this, the MiSLAS framework uses MixUp as the baseline method and innovatively adopts a stage-wise mixing strategy, employing MixUp during Stage-1 feature learning. However, these methods apply MixUp without adjustments, and there has been little research exploring suitable data augmentation techniques for long-tailed datasets. Recently, Remix [44] was employed to dynamically adjust the mixing ratio to specifically enhance the data representation of tail categories. Li et al. [45] introduced a novel regularization method, cRmix, based on patch-based MixUp. This method selectively cuts and pastes minority class patches between training images to enhance data diversity while constraining the patch positions to cover more critical regions, thus effectively expanding the minority class samples. Unlike the aforementioned methods, our approach takes the long-tailed data distribution characteristics into careful consideration. We focus not only on balancing sample quantity but also on retaining more discriminative tail-class features in image mixing by utilizing a saliency mask-based object localization technique, which further improves the model’s performance on long-tailed datasets.

3. Method

In this section, we present LSM, a novel saliency-guided local semantic mixing approach designed to enhance image mixing by preserving discriminative features of tail classes while assigning more semantically meaningful target scores. As shown in Figure 1, LSM generates balanced training samples that enable more comprehensive feature learning. The section first introduces necessary preliminaries (Section 3.1), followed by our dual-branch balanced sampling mechanism (Section 3.2) that guarantees tail-class participation. We then detail the core components: (1) feature decoupling and reconstruction for localized semantic mixing (Section 3.3), and (2) saliency-guided mask generation (Section 3.4) for semantic-aware image blending and label assignment (Section 3.5). The complete LSM pipeline is illustrated in Figure 2.

3.1. Preliminaries

Data augmentation has garnered significant attention across various visual tasks [11,13,24]. CutMix [23] was originally developed as a data augmentation technique for class-imbalanced datasets and is recognized as both a powerful and intuitive method. By locally cropping and pasting regions between images and blending their corresponding labels based on the region’s proportion, CutMix ingeniously integrates the advantages of Cutout [39] (region dropout) and MixUp [22] (interpolation regularization) while effectively mitigating their respective limitations, such as information loss and visual blurring. Specifically, given a pair of training samples

(x_{a}, y_{a})

and

(x_{b}, y_{b})

, CutMix combines them with a randomly generated rectangular binary mask

M \in {0, 1}^{H \times W}

, resulting in a new training sample

(\tilde{x}, \tilde{y})

, as expressed below:

\begin{matrix} \tilde{x} = M ⊙ x_{a} + (1 - M) ⊙ x_{b} \\ \tilde{y} = λ y_{a} + (1 - λ) y_{b} \end{matrix}

(1)

In the above, M represents a binary mask generated from a Bernoulli distribution, which determines the regions for cutting and filling. Further, 1 means a binary mask filled with ones, and the operator ⊙ denotes element-wise multiplication. Similar to MixUp [22], CutMix sets the mixed target for the generated image as a linear combination of

y_{a}

and

y_{b}

, and the combination ratio

λ \in R

between two images is sampled from the beta distribution

B e t a (α, α)

.

However, when applied to long-tailed scenarios where tail classes suffer from severe sample scarcity, CutMix exhibits significant limitations [42,44]. Its randomly generated mask M lacks semantic awareness, making it highly susceptible to occluding discriminative features in tail-class samples. Under extreme data imbalance, these critical features are almost invariably obscured. Concurrently, the static label assignment

\tilde{y} = λ y_{a} + (1 - λ) y_{b}

fails to account for semantic importance: when discriminative features of tail-class samples are compromised, this results in severe misalignment between mixed labels and visual content. Furthermore, conventional sampling leads to insufficient representation of tail-class samples in mini-batches, substantially limiting their participation in mixing operations.

3.2. Dual-Branch Balanced Sampling Mechanism

In long-tailed image classification tasks, the training set

D = {\{(x_{i}, y_{i})\}}_{i = 1}^{N}

exhibits a significantly imbalanced class distribution, where the sample size of head classes

N_{h e a d}

far exceeds that of tail classes

N_{t a i l}

[26]. Traditional data mixing methods such as MixUp and CutMix typically sample image pairs from the original data distribution

P (x) \propto N_{c}

, resulting in an extremely low participation probability

P_{t a i l}

for tail classes, which further exacerbates the model’s neglect of tail classes. To enhance the participation of tail classes in the mixing process, we propose a dual-branch balanced sampling framework, with the following core design:

First, as shown in Figure 2, following CMO [42], we construct a joint sampling space

S = P \cup Q

, where Q represents a reweighted tail-class distribution based on inverse class frequency. The original distribution P preserves the natural long-tailed characteristics, with sampling probability

p (c) = \frac{N_{c}}{\sum_{k = 1}^{K} N_{k}}, \forall c \in {1, \dots, K}

(2)

The reweighted distribution Q adjusts the sampling probability for tail classes through inverse frequency weighting, defined as

q (c) = \frac{w_{c}}{\sum_{k = 1}^{K} w_{k}}, w_{c} = \frac{N_{max}}{N_{c}} \cdot I (N_{c} < τ),

(3)

where

N_{max} = max \{N_{1}, \dots, N_{K}\}

denotes the maximum class sample size. Based on the characteristics of skewed distributions in statistics, the mean serves as an effective indicator for distinguishing primary and secondary modes [46]. Therefore, we set the tail-class threshold as

τ = \frac{1}{K} \sum_{c = 1}^{K} N_{c}

, where

I (\cdot)

is the indicator function that applies reweighting only to classes satisfying

N_{c} < τ

.

Furthermore, to ensure tail classes dominate the mixing process, we enforce the following pairing constraint in each training batch

B

:

\forall (x_{b}, y_{b}) \sim Q, \exists (x_{a}, y_{a}) \sim P, s . t . (x_{a}, x_{b}) \in B,

(4)

This guarantees that each tail-class sample

x_{b}

sampled from Q must be paired with at least one head-class sample

x_{a}

sampled from P.

3.3. Feature Decoupling

To address the limitations of conventional mixing methods where random rectangular region replacement may compromise discriminative feature preservation for tail classes, we perform patch-aware feature decoupling on sample pairs

(x_{a}, y_{a}) \sim P

and

(x_{b}, y_{b}) \sim Q

drawn from a joint space, as illustrated in Figure 2.

Given an input pair, we employ a differentiable splitting operator

F_{split} : R^{H \times W \times D} \to R^{N \times p^{2} \times D}

to decouple local semantic features by partitioning the image into

N = \frac{H}{p} \times \frac{W}{p}

non-overlapping patches. This operation is implemented via a parameterized convolutional layer prepended to the input:

F_{split} (x) = {Conv}_{K} (x; p, p), K \in R^{p \times p \times D \times N},

(5)

where p denotes the patch size, controlling the granularity of local regions. The convolutional kernel

K

adopts identical stride and kernel size (both set to p) to ensure non-overlapping sampling. The output tensor with dimensions

N \times p^{2} \times D

represents a regular grid partition, where each patch

P_{i} \in R^{p^{2} \times D}

corresponds to a

p \times p

pixel region in the original image. This structured spatial decomposition establishes an efficient tensor operation foundation for subsequent semantic mixing and gradient propagation.

3.4. Patch-Level Saliency-Guided Mask Generation

Traditional random mask generation methods often fail to preserve the discriminative features of tail classes due to their lack of semantic awareness. To address this limitation, we propose a saliency mask generation strategy guided by the gradient-weighted class activation map (Grad-CAM) [47] to enhance semantic consistency in mixed samples by focusing on critical regions of tail classes.

3.4.1. Gradient-Weighted Class Activation Map Generation

Given a tail-class sample

(x_{b}, y_{b})

, we first extract feature maps using the last convolutional layer of a pretrained model

f_{p r e}

:

A_{b} = f_{p r e} (x) \in R^{C \times H^{'} \times W^{'}}

(6)

Following Grad-CAM [47], we then compute the gradient of the target class logit

{\hat{y}}_{b}

with respect to the feature map

A

as

\frac{\partial {\hat{y}}_{b}}{\partial A}

. Through Global Average Pooling (GAP), we aggregate spatial gradients to obtain channel importance weights:

α_{k}^{({\hat{y}}_{b})} = \frac{1}{H^{'} W^{'}} \sum_{i = 1}^{H^{'}} \sum_{j = 1}^{W^{'}} \frac{\partial {\hat{y}}_{b}}{\partial A_{k}^{(i, j)}},

(7)

where

A_{k}

represents the feature map for channel k of

A_{b}

. These weights are used to perform a channel-wise weighted summation of features, followed by ReLU activation to filter negative responses, generating the final class activation map:

C A M_{b} = ReLU (\sum_{k = 1}^{C} α_{k}^{({\hat{y}}_{b})} A_{k}) \in R^{H^{'} \times W^{'}}

(8)

3.4.2. Patch-Level Mask Generation

To precisely preserve tail-class critical regions during mixing, we propose a patch-level saliency mask generation method. The low-resolution CAM is upsampled to match the input resolution

H \times W

, then divided into

N = \frac{H}{p} \times \frac{W}{p}

non-overlapping patches (aligned with the feature decoupling module in Section 3.3). The average activation value for each

{patch}_{i}

is computed as

{CAM}_{b}^{(i)} = \frac{1}{p^{2}} \sum_{(u, v) \in {Patch}_{i}} {CAM}_{b}^{up} (u, v)

(9)

The preservation probability for each patch is determined by normalized activation intensity scaled by a dynamic coefficient, generating the final mask

M_{b}

:

P (M_{i} = 1) = \frac{{CAM}_{b}^{(i)}}{{max}_{j} {CAM}_{b}^{(j)}} \cdot β_{t}

(10)

Here,

M_{b}

values are scaled to [0, 1] to prioritize high-response regions. The dynamic coefficient

β_{t}

follows a cosine annealing schedule to balance training stability and mixing strength:

β_{t} = β_{min} + \frac{1}{2} (β_{max} - β_{min}) (1 + cos (π t / T)),

(11)

where t represents the current training step, and T is the total number of steps, with

β_{min} = 0.1

and

β_{max} = 0.9

. This scheduling ensures stable training in early stages while gradually strengthening knowledge transfer as training progresses.

3.5. Semantic-Aware Mixed Image Generation and Label Assignment

In Section 3.3, we generated

N = \frac{H}{p} \times \frac{W}{p}

non-overlapping local patches, and, in Section 3.4, we produced the semantic-aware mask M. As illustrated in Figure 3, the patch-wise features of head- and tail-class samples are then fused through mask-weighted blending:

x_{mixed} = F_{merge} ((1 - M) ⊙ F_{split} (x_{a}) + M ⊙ F_{split} (x_{b})),

(12)

where ⊙ denotes element-wise multiplication. The inverse patching operator

F_{merge} : R^{N \times p^{2} \times D} \to R^{H \times W \times D}

reconstructs the processed patch features into a complete image:

F_{merge} ({\{P_{i}\}}_{i = 1}^{N}) = PixelShuffle (P_{1} \oplus P_{2} \oplus \dots \oplus P_{N}),

(13)

where ⊕ represents spatial concatenation. PixelShuffle [48] is a spatial reorganization operation that redistributes channel-dimension information into spatial dimensions, commonly employed in upsampling tasks such as image super-resolution. The detailed reconstruction process is as follows:

(1) Reshape the patch results

F_{split} (x)

:

R^{B \times N \times p^{2} \times D} \overset{reshape}{\to} R^{B \times \frac{H}{p} \times \frac{W}{p} \times p \times p \times D}

(14)

(2) Permute dimensions to merge

p^{2}

with spatial dimensions:

R^{B \times \frac{H}{p} \times \frac{W}{p} \times p \times p \times D} \overset{permute}{\to} R^{B \times \frac{H}{p} \times \frac{W}{p} \times (p^{2} \times D)}

(15)

(3) Apply PixelShuffle with upsampling factor

r = p

, mapping

p^{2}

channel information to spatial dimensions:

Y = PixelShuffle (X) \in R^{B \times H \times W \times D}

(16)

The key step involves arranging every

p^{2}

channel value into

p \times p

local blocks in spatial order and reassembling them into a lossless complete image.

Finally, the class activation maps are multiplied with the mask matrix to generate semantically informed soft labels for the mixed image:

y_{mixed} = (1 - M) ⊙ {CAM}_{a} + M ⊙ {CAM}_{b}

(17)

Our proposed method not only effectively incorporates category semantics but also guides the model to focus on inter-class similarities through soft labels during training, particularly enhancing feature sharing between head and tail classes.

4. Experiments

4.1. Datasets

We assess the performance of our method on three well-established long-tailed recognition benchmarks: CIFAR-100-LT [31], ImageNet-LT [6], and iNaturalist 2018 [21], as detailed in Table 1. Unlike the balanced datasets (CIFAR-100 and ImageNet-2012), both CIFAR-100-LT and ImageNet-LT feature class distributions that have been artificially skewed to reflect long-tailed imbalances. Following the standard evaluation protocol for long-tailed recognition [31], we trained using the original long-tailed training data, and the performance is evaluated on uniformly sampled test sets. This evaluation strictly adheres to the official train–test splits provided for each dataset, ensuring consistency with previous work in the field.

CIFAR-100-LT. CIFAR-LT is a benchmark dataset created based on the original CIFAR dataset. The original CIFAR dataset [49] consists of 50,000 training images and 10,000 test images. CIFAR-10 serves as the benchmark version of this dataset, while CIFAR-100 is a fine-grained version developed from CIFAR-10. It subdivides the data into 100 categories, with the number of samples per category reduced accordingly, resulting in a training set of 500 samples per class and a test set of 100 samples per class. Building on these two datasets, Cui et al. [31] created long-tailed versions by introducing an imbalance factor (

I F

) to control the level of data imbalance. This factor is used to downsample the training set in an exponential decay manner, while the test set remains unchanged. The imbalance factor

I F

is defined as

I F = N_{m a x} / N_{m i n}

, where

N_{m a x}

and

N_{m i n}

represent the number of training samples in the most and least frequent classes, respectively. In our experiments, we evaluate with

I F \in 100, 50, 10

to conduct a thorough assessment. The classes are categorized into head, medium, and tail groups based on sample counts, with classification thresholds set at 100 and 20.

ImageNet-LT. As a long-tailed version of the ImageNet dataset [1], ImageNet-LT was created by sampling a subset of the original data according to the Pareto distribution with a power value of

α = 6

[6]. This results in a dataset containing 115.8 K images distributed across 1000 categories, with the number of samples per class varying significantly, ranging from as few as 5 to as many as 1280. This construction introduces a substantial class imbalance, making it an ideal benchmark for testing algorithms in long-tailed recognition tasks.

iNaturalist 2018. iNaturalist 2018 [21] is a large-scale dataset designed for species classification, characterized by an extreme label imbalance. It contains 437.5 K images spread across 8142 categories. In addition to the significant class imbalance, the dataset poses substantial fine-grained recognition challenges as there are often considerable visual variations between individuals or different variants of the same species. These complexities make the classification task particularly difficult. We choose iNaturalist 2018 as our experimental testbed because it closely mirrors real-world scenarios in species classification, where such imbalances and visual variations are common, thereby providing a more authentic and challenging environment for evaluating recognition models.

4.2. Experimental Setup

Evaluation Metrics. For each dataset, we follow the problem definition of long-tailed classification tasks. The model is trained on long-tailed imbalanced datasets and evaluated on the original balanced validation set. The Top-1 accuracy, which is most commonly used in long-tailed classification tasks, is employed to assess the proposed method. Additionally, in consideration of the characteristics of long-tailed data, three evaluation intervals are further defined: head classes (comprising classes with more than 100 training samples), medium classes (including classes with sample counts between 20 and 100), and tail classes (comprising classes with fewer than 20 samples).

Basic Experimental Setup. To ensure a fair comparison with prior works, all experiments are conducted with the same basic configurations as those in [6]. The models are implemented using PyTorch (v3.10) and trained on 4 GeForce GTX 3090 GPUs (NVIDIA Corporation, Santa Clara, CA, USA). For the CIFAR100-LT dataset, ResNet32 is chosen as the backbone network, paired with a linear classifier, and trained for 200 epochs. For the ImageNet-LT dataset and iNaturalist 2018 dataset, we use standard ResNet-50 [49] as the encoder backbones. The learning rate follows a cosine schedule, starting from 0.01 and gradually decaying to 0. For ImageNet-LT, ResNet-50 is employed with an initial learning rate of 0.1 and trained for 100 epochs. The learning rate is decayed by a factor of 0.1 at the 60th and 80th epochs. All experiments use stochastic gradient descent (SGD) with a momentum of 0.9. Additionally, the initial block size p is set to 4 and 28, respectively. The specific settings are shown in Table 2.

4.3. Comparison with State-of-the-Art Methods

To comprehensively evaluate the performance of LSM, we evaluate Reb-SupCon against several state-of-the-art long-tailed recognition approaches on three benchmark datasets.

Results on CIFAR-100-LT. As shown in Table 3, we conduct experiments on the CIFAR-100-LT dataset with different

I F

(10, 50, and 100) to evaluate the overall classification accuracy under various imbalance settings. The experimental results indicate that LSM combined with CE loss performs excellently across all imbalance scenarios, with performance improvements of 5.0%, 3.9%, and 2.5% compared to baseline methods, respectively. To further validate the generalization ability of LSM, our experiment compares LSM with class-reweighted loss (LDAM [7]) and two-stage algorithms (DRW [7]). The results show that LSM significantly improves the model’s performance, especially when handling long-tailed datasets, effectively enhancing the model’s recognition ability and demonstrating its broad applicability across different algorithm frameworks.

Results on ImageNet-LT. As shown in Table 4, LSM is compared with the main technical methods in the current long-tailed learning field on the ImageNet-LT dataset to evaluate its generalization ability and classification accuracy on large-scale datasets. The experimental results similarly show that LSM significantly improves performance when handling long-tailed classification problems. Specifically, LSM combined with CE loss performs excellently, with an overall performance improvement of 7.3% compared to baseline methods. This confirms that LSM can effectively address the class imbalance issue in long-tailed datasets while remaining competitive in terms of classification accuracy with existing methods.

Similarly, to further validate the generalization ability of LSM, it is compared with class-reweighted loss (LDAM [18]) and two-stage algorithms (DRW [18]). The experimental results indicate that LSM not only has theoretical advantages but also provides significant performance improvements in practical applications. It helps the model to more accurately recognize minority class samples without severely compromising the performance of the head classes, leading to a notable increase in classification accuracy.

Results on iNaturalist 2018. As shown in Table 5, when tested on the naturally skewed iNaturalist 2018 dataset, the application of LSM to the simple training scheme of CE-DRW outperforms most of the current state-of-the-art methods. Specifically, on this dataset, similar to the results observed on ImageNet-LT, LSM enhances the performance of the cross-entropy loss (CE) by an impressive 6.1% (improving from 61.0% to 67.1%). This improvement highlights the robustness of LSM in handling long-tailed distributions and balancing class representations effectively. Additionally, LSM demonstrates significant gains in the performance of the few-shot classes, where traditional methods often struggle to generalize. Furthermore, the integration of LSM with the RIDE framework leads to a new state-of-the-art performance, pushing the boundaries of what is achievable in long-tailed recognition tasks.

4.4. Discussion

To better validate the effectiveness of our LSM, we compare the classification performance of different loss functions combined with various data augmentation strategies on two datasets. As shown in Table 6 and Table 7, for the three loss functions, the experimental results after integrating the data augmentation strategies are presented, with performance improvements marked in green and performance decreases marked in red. The experimental results indicate that the baseline model incorporating the LSM method consistently achieves significant performance improvements across all the settings, surpassing the other augmentation methods, further demonstrating the superiority of LSM. In contrast, traditional oversampling methods (ROS [14]) resulted in significant performance degradation in all the experiments, suggesting that simply replicating tail-class samples easily leads to model overfitting. On the other hand, Remix’s [44] balanced mixing strategy, although yielding some gains, showed limited effectiveness. The analysis reveals that Remix fails to adequately consider the importance of local semantics, potentially leading to the loss of crucial region information. The proposed LSM method in this chapter effectively preserves the discriminative regions of tail classes through Grad-CAM-based saliency masking and achieves a good balance between augmentation strength and training stability using a cosine annealing strategy, resulting in substantial performance improvements.

4.5. Hyperparameter Sensitivity Analysis

The block size parameter p of local semantic units is a core hyperparameter in the LSM method, and its value directly affects the granularity of semantic decoupling and the effectiveness of mixed augmentation. To determine an appropriate value for p and evaluate its sensitivity to different data characteristics, experiments were conducted on the CIFAR-100-LT dataset (with

I F = 100

) using

p \in {2, 4, 6, 8, 12}

, and on the ImageNet-LT dataset using

p \in {14, 28, 56, 112}

, accounting for the scale differences between the datasets.

Our method partitions the input image

I \in R^{H \times W \times C}

into

N = \frac{H}{p} \times \frac{W}{p}

local semantic units using a parameterized block operator, where p is the block size. A smaller p results in finer semantic units but increases computational complexity, while a larger p creates coarser mixed regions that might disrupt local discriminative features. Figure 4 and Figure 5 present the comparison of experimental results with different values of p on the CIFAR-100-LT and ImageNet-LT datasets, respectively. It can be seen that, when

p = 4

, LSM achieves the highest Top-1 accuracy on CIFAR-100-LT, while, when

p = 28

, LSM achieves the highest Top-1 accuracy on ImageNet-LT.

Through analysis, it can be concluded that the block size parameter p of LSM must closely align with the scale characteristics of the target dataset. For small-scale datasets like CIFAR-100-LT, a smaller value of p should be chosen to capture fine-grained features, but excessive segmentation that leads to semantic fragmentation must be avoided. In contrast, medium- to large-scale datasets like ImageNet-LT are suited to medium block sizes. This experiment reveals the adaptation rule between block size and dataset characteristics, providing a basis for parameter selection in practical applications.

5. Conclusions

To address the issue that traditional data augmentation techniques based on semantic mixing can damage the discriminative features of tail classes, further exacerbating data imbalance, we propose a novel data augmentation method based on local semantic mixing. The core idea of this method is to intelligently locate and preserve the key visual features of the tail classes, enabling more reasonable image mixing and label allocation, with the aim of effectively expanding the tail-class samples in high quality. The experimental validation shows that this method significantly improves the model’s classification performance on long-tailed data and effectively reduces the training instability caused by improper label allocation. It offers a more robust and efficient solution for long-tailed learning tasks.

Limitations. The method is effective at preserving the key visual features of tail classes, but accurately locating and extracting these features in practical applications remains a challenge. This requires precise feature extraction techniques, and any missteps in their implementation could impact the overall performance. Furthermore, the trade-off between computational complexity and feature preservation in real-time applications is another issue that needs to be addressed. Label noise also poses a significant concern, particularly in environments with high noise levels, which can degrade the model’s accuracy and robustness. In future work, we aim to explore strategies to mitigate the impact of label noise and optimize feature extraction methods to enhance the model’s overall performance.

Author Contributions

Conceptualization, methodology, J.L. (Jiahui Lv) and S.L.; data collection, J.L. (Jiahui Lv) and J.Z.; model building, J.L. (Jiahui Lv) and J.L. (Jun Lei); experiment, data analysis, and writing—original draft preparation, J.L. (Jiahui Lv) and C.C.; writing—review and editing, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Laboratory of Big Data and Decision Making of National University of Defense Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and dataset will be finalized and made publicly available online upon acceptance of the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Lu, D.; Weng, Q. A survey of image classification methods and techniques for improving classification performance. Int. J. Remote Sens. 2007, 28, 823–870. [Google Scholar] [CrossRef]
Vailaya, A.; Jain, A.; Zhang, H.J. On image classification: City images vs. landscapes. Pattern Recognit. 1998, 31, 1921–1935. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer International Publishing: Cham, Switzerland, 2014; Volume 13, pp. 740–755. [Google Scholar]
Liu, Z.; Miao, Z.; Zhan, X.; Wang, J.; Gong, B.; Yu, S.X. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2537–2546. [Google Scholar]
Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. Adv. Neural Inf. Process. Syst. 2019, 32, 1567–1578. [Google Scholar]
Li, T.; Cao, P.; Yuan, Y.; Fan, L.; Yang, Y.; Feris, R.S.; Indyk, P.; Katabi, D. Targeted supervised contrastive learning for long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6918–6928. [Google Scholar]
Tan, J.; Wang, C.; Li, B.; Li, Q.; Ouyang, W.; Yin, C.; Yan, J. Equalization loss for long-tailed object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11662–11671. [Google Scholar]
Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Van Der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
Rezvani, S.; Wang, X. A broad review on class imbalance learning techniques. Appl. Soft Comput. 2023, 143, 110415. [Google Scholar] [CrossRef]
Miao, W.; Pang, G.; Bai, X.; Li, T.; Zheng, J. Out-of-distribution detection in long-tailed recognition with calibrated outlier class learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 4216–4224. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Zhou, B.; Cui, Q.; Wei, X.S.; Chen, Z.M. Bbn: Bilateral-branch network with cumulativelearning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9719–9728. [Google Scholar]
Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng, J.; Kalantidis, Y. Decoupling representation and classifier for long-tailed recognition. arXiv 2019, arXiv:1910.09217. [Google Scholar]
Hospedales, T.; Antoniou, A.; Micaelli, P.; Storkey, A. Meta-learning in neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5149–5169. [Google Scholar] [CrossRef]
Ren, J.; Yu, C.; Ma, X.; Zhao, H.; Yi, S. Balanced meta-softmax for long-tailed visual recognition. Adv. Neural Inf. Process. Syst. 2020, 33, 4175–4186. [Google Scholar]
Vu, D.Q.; Thu, M.T.H. Smooth Balance Softmax for Long-Tailed Image Classification. In Proceedings of the International Conference on Advances in Information and Communication Technology, Phu Tho, Vietnam, 16–17 November 2024; Springer Nature: Cham, Switzerland, 2024; pp. 323–331. [Google Scholar]
Zang, Y.; Huang, C.; Loy, C.C. Fasa: Feature augmentation and sampling adaptation for long-tailed instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3457–3466. [Google Scholar]
Wang, T.; Li, Y.; Kang, B.; Li, J.; Liew, J.; Tang, S.; Hoi, S.; Feng, J. The devil is in classification: A simple framework for long-tail instance segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 728–744. [Google Scholar]
Van Horn, G.; Mac Aodha, O.; Song, Y.; Cui, Y.; Sun, C.; Shepard, A.; Adam, H.; Perona, P.; Belongie, S. The inaturalist species classification and detection dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8769–8778. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 17 October–2 November 2019; pp. 6023–6032. [Google Scholar]
Cao, C.; Zhou, F.; Dai, Y.; Wang, J.; Zhang, K. A survey of mix-based data augmentation: Taxonomy, methods, applications, and explainability. Acm Comput. Surv. 2024, 57, 1–38. [Google Scholar] [CrossRef]
Qin, H.; Jin, X.; Zhu, H.; Liao, H.; El-Yacoubi, M.A.; Gao, X. Sumix: Mixup with semantic and uncertain information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 70–88. [Google Scholar]
Zhang, Y.; Kang, B.; Hooi, B.; Yan, S.; Feng, J. Deep long-tailed learning: A survey. arXiv 2021, arXiv:2110.04596. [Google Scholar] [CrossRef]
Buda, M.; Maki, A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef]
Guo, H.; Wang, S. Long-tailed multi-label visual recognition by collaborative training on uniform and re-balanced samplings. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15089–15098. [Google Scholar]
Drummond, C.; Holte, R.C. C4. 5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In Imbalanced Datasets II; ICML: Washington, DC, USA, 2003; Volume 11, pp. 1–8. [Google Scholar]
Byrd, J.; Lipton, Z. What is the effect of importance weighting in deep learning? In International Conference on Machine Learning; PMLR: New York, NY, USA, 2019; pp. 872–881. [Google Scholar]
Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar]
Menon, A.K.; Jayasumana, S.; Rawat, A.S.; Jain, H.; Veit, A.; Kumar, S. Long-tail learning via logit adjustment. arXiv 2020, arXiv:2007.07314. [Google Scholar]
Alexandridis, K.P.; Luo, S.; Nguyen, A.; Deng, J.; Zafeiriou, S. Inverse image frequency for long-tailed image recognition. IEEE Trans. Image Process. 2003, 32, 5721–5736. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Shu, J.; Xie, Q.; Yi, L.; Zhao, Q.; Zhou, S.; Xu, Z.; Meng, D. Meta-weight-net: Learning an explicit mapping for sample weighting. In Proceedings of the International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 1919–1930. [Google Scholar]
Zhu, J.; Wang, Z.; Chen, J.; Chen, Y.P.; Jiang, Y.G. Balanced contrastive learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6908–6917. [Google Scholar]
Iqbal, F.; Abbasi, A.; Javed, A.R.; Almadhor, A.; Jalil, Z.; Anwar, S.; Rida, I. Data augmentation-based novel deep learning method for deepfaked images detection. Acm Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–15. [Google Scholar] [CrossRef]
Wang, Z.; Wang, P.; Liu, K.; Wang, P.; Fu, Y.; Lu, C.T.; Aggarwal, C.C.; Pei, J.; Zhou, Y. A comprehensive survey on data augmentation. arXiv 2024, arXiv:2405.09591. [Google Scholar] [CrossRef]
DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural net works with cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Salehin, I.; Kang, D.K. A review on dropout regularization approaches for deep neural networks within the scholarly domain. Electronics 2023, 12, 3106. [Google Scholar] [CrossRef]
Li, L.; Li, A. A2-aug: Adaptive automated data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2267–2274. [Google Scholar]
Park, S.; Hong, Y.; Heo, B.; Yun, S.; Choi, J.Y. The majority can help the minority: Context-rich minority oversampling for long-tailed classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6887–6896. [Google Scholar]
Zhong, Z.; Cui, J.; Liu, S.; Jia, J. Improving calibration for long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16489–16498. [Google Scholar]
Chou, H.P.; Chang, S.C.; Pan, J.Y.; Wei, W.; Juan, D.C. Remix: Rebalanced mixup. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 95–110. [Google Scholar]
Li, J.; Yang, Z.; Hu, L.; Liu, J.; Tao, D. CRmix: A regularization by clipping images and replacing mixed samples for imbalanced classification. Digit. Signal Process. 2023, 135, 103951. [Google Scholar] [CrossRef]
Doane, D.P.; Seward, L.E. Measuring skewness: A forgotten statistic? J. Stat. Educ. 2011, 19, n2. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Hong, Y.; Han, S.; Choi, K.; Seo, S.; Kim, B.; Chang, B. Disentangling label distribution for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6626–6636. [Google Scholar]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
Wang, P.; Han, K.; Wei, X.S.; Zhang, L.; Wang, L. Contrastive learning based hybrid networks for long-tailed image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 943–952. [Google Scholar]
Hou, C.; Zhang, J.; Wang, H.; Zhou, T. Subclass-balancing contrastive learning for long-tailed recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 5395–5407. [Google Scholar]
Zhou, Z.; Li, L.; Zhao, P.; Heng, P.A.; Gong, W. Class-conditional sharpness-aware minimization for deep long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3499–3509. [Google Scholar]
Mahajan, D.; Girshick, R.; Ramanathan, V.; He, K.; Paluri, M.; Li, Y.; Bharambe, A.; Van Der Maaten, L. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 181–196. [Google Scholar]
Wang, X.; Lian, L.; Miao, Z.; Liu, Z.; Yu, S.X. Long-tailed recognition by routing diverse distribution-aware experts. arXiv 2020, arXiv:2010.01809. [Google Scholar]
Sharma, S.; Xian, Y.; Yu, N.; Singh, A. Learning prototype classifiers for long-tailed recognition. arXiv 2023, arXiv:2302.00491. [Google Scholar] [CrossRef]

Figure 2. General structure of our LSM. In the (a) Data Preprocessing stage, first, image pairs are sampled from the original long-tailed distribution P and the tail-class weighted distribution Q. Then, the images are decoupled using the block operator

F_{split}

, dividing them into N local patches. Next, a pretrained model is used to generate a saliency-guided mask to locate key regions. Finally, the block features of head- and tail-class samples are weighted and fused according to the mask, generating soft labels that integrate semantic information. In the subsequent (b) training stage, the augmented images are fed into the model to perform the classification task.

Figure 2. General structure of our LSM. In the (a) Data Preprocessing stage, first, image pairs are sampled from the original long-tailed distribution P and the tail-class weighted distribution Q. Then, the images are decoupled using the block operator

F_{split}

, dividing them into N local patches. Next, a pretrained model is used to generate a saliency-guided mask to locate key regions. Finally, the block features of head- and tail-class samples are weighted and fused according to the mask, generating soft labels that integrate semantic information. In the subsequent (b) training stage, the augmented images are fed into the model to perform the classification task.

Figure 3. Mixed image generation and label assignment framework. The block features of head- and tail-class samples are weighted and fused according to the mask to form the mixed image and soft labels that integrate semantic information.

Figure 4. The impact of block size p on CIFAR-100-LT.

Figure 5. The impact of block size p on ImageNet-LT.

Table 1. Basic information of the datasets.

Datasets	Number of Classes	Training/Test Samples	$IF$
CIFAR-LT	100	10.8 K/10 K	{100, 50, 10}
ImageNet-LT	1000	115.8 K/50 K	256
iNaturalist 2018	8142	437.5 K/24.4 K	500

Table 2. Basic information of experimental setup.

Datasets	CIFAR-100-LT	ImageNet-LT	iNaturalist 2018
Backbone	ResNet-32	ResNet-50	ResNet-50
Epochs	200	100	200
Batch size	128	256	512
Initial learning rate	0.01	0.1	0.1
SGD momentum	0.9	0.9	0.9
block size p	4	28	28

Table 3. Top-1 accuracy on CIFAR-100-LT.

Imbalance Factor	100	50	10
Cross Entropy (CE)	38.6	44.0	56.4
CE-DRW	41.1	45.6	57.9
LDAM-DRW [7]	41.7	47.9	57.3
BBN [14]	42.6	47.1	59.2
CMO [42]	43.9	48.3	59.5
BalancedSoftmax [17]	45.1	49.9	61.6
LDAE [50]	45.4	50.5	61.7
SupCon [51]	45.8	52.0	64.4
Hybrid-SC [52]	46.7	51.8	63.0
Remix [44]	45.8	49.5	59.2
SBCL [53]	44.9	48.7	57.9
CC-SAM [54]	49.2	51.9	62.0
CE + LSM	43.6	47.9	58.9
CE-DRW + LSM	46.8	51.6	61.0
LDAM-DRW + LSM	46.9	52.2	58.6

Table 4. Top-1 accuracy on ImageNet-LT.

Methods	All	Many	Medium	Few
CE	41.6	64.0	33.8	5.8
Focal Loss [34]	43.7	64.3	37.1	8.2
Decouple-cRT [16]	47.3	58.8	44.0	26.1
Decouple-LWS [16]	47.7	57.1	45.2	29.3
LWS [55]	49.9	60.2	47.2	30.3
Remix [44]	48.6	60.4	46.9	30.7
CMO [42]	49.1	67.0	42.3	20.5
LDAM-DRW [7]	49.8	60.4	46.9	30.7
CE-DRW [7]	50.1	61.7	47.3	28.8
BalancedSoftmax [17]	51.0	60.9	48.8	32.1
LADE [50]	51.9	62.3	49.3	31.2
CE + LSM	48.9	58.6	45.8	35.8
CE-DRW + LSM	50.9	60.2	47.3	36.4
LDAM-DRW + LSM	50.7	60.0	47.4	34.3

Table 5. Top-1 accuracy on iNaturalist 2018.

Methods	All	Many	Medium	Few
CE	61.0	73.9	63.5	55.5
IB Loss	65.4	-	-	-
LDAM-DRW [7]	66.1	-	-	-
Decouple-cRT [16]	68.2	73.2	68.8	66.1
Decouple-LWS [16]	69.5	71.0	69.8	68.8
BBN [14]	69.6	-	-	-
BalancedSoftmax [17]	70.0	70.0	70.2	69.9
LADE [50]	70.0	-	-	-
Remix [44]	70.5	-	-	-
MiSLAS [43]	71.6	73.2	72.4	72.7
RIDE (3E) [56]	72.2	70.2	72.2	72.7
TSC [9]	69.7	72.6	70.6	67.8
PC [57]	70.6	71.6	70.6	70.2
CE + LSM	67.1	76.5	69.0	66.9
CE-DRW + LSM	70.5	67.8	70.0	72.5
LDAM-DRW + LSM	68.9	75.3	69.1	67.2
RIDE (3E) + LSM	72.6	69.0	72.2	73.0

Table 6. Comparison of LSM with similar data augmentation strategies on CIFAR-100-LT.

	Vanilla	+ROS [14]	+Remix [44]	+LSM
CE	38.6 (+0.0)	32.3 (−5.3)	40.0 (+1.4)	43.6 (+5.0)
CE-DRW [7]	41.1 (+0.0)	35.9 (−5.2)	45.8 (+4.7)	46.8 (+5.7)
LDAM-DRW [7]	41.7 (+0.0)	32.6 (−9.1)	45.3 (+3.6)	46.9 (+5.2)

Table 7. Comparison of LSM with similar data augmentation strategies on ImageNet-LT.

	Vanilla	+Remix [44]	+LSM
CE	41.6 (+0.0)	41.7 (+0.1)	48.9 (+7.3)
CE-DRW [7]	50.1 (+0.0)	48.6 (−1.5)	50.9 (+0.8)
BalancedSoftmax [17]	51.0 (+0.0)	49.2 (−1.8)	52.1 (+1.1)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lv, J.; Lei, J.; Zhang, J.; Chen, C.; Li, S. Saliency-Guided Local Semantic Mixing for Long-Tailed Image Classification. Mach. Learn. Knowl. Extr. 2025, 7, 107. https://doi.org/10.3390/make7030107

AMA Style

Lv J, Lei J, Zhang J, Chen C, Li S. Saliency-Guided Local Semantic Mixing for Long-Tailed Image Classification. Machine Learning and Knowledge Extraction. 2025; 7(3):107. https://doi.org/10.3390/make7030107

Chicago/Turabian Style

Lv, Jiahui, Jun Lei, Jun Zhang, Chao Chen, and Shuohao Li. 2025. "Saliency-Guided Local Semantic Mixing for Long-Tailed Image Classification" Machine Learning and Knowledge Extraction 7, no. 3: 107. https://doi.org/10.3390/make7030107

APA Style

Lv, J., Lei, J., Zhang, J., Chen, C., & Li, S. (2025). Saliency-Guided Local Semantic Mixing for Long-Tailed Image Classification. Machine Learning and Knowledge Extraction, 7(3), 107. https://doi.org/10.3390/make7030107

Article Menu

Saliency-Guided Local Semantic Mixing for Long-Tailed Image Classification

Abstract

1. Introduction

2. Related Works

2.1. Rebalancing Strategies

2.2. Augmentation-Based Methods

3. Method

3.1. Preliminaries

3.2. Dual-Branch Balanced Sampling Mechanism

3.3. Feature Decoupling

3.4. Patch-Level Saliency-Guided Mask Generation

3.4.1. Gradient-Weighted Class Activation Map Generation

3.4.2. Patch-Level Mask Generation

3.5. Semantic-Aware Mixed Image Generation and Label Assignment

4. Experiments

4.1. Datasets

4.2. Experimental Setup

4.3. Comparison with State-of-the-Art Methods

4.4. Discussion

4.5. Hyperparameter Sensitivity Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI