ACVM: An Adaptive Combination Validation Mechanism for Long-Tailed Image Recognition

Sun, Tianci; He, Wanqiu; Shao, Changbin; Zheng, Shang; Yu, Hualong

doi:10.3390/info17050455

Open AccessArticle

ACVM: An Adaptive Combination Validation Mechanism for Long-Tailed Image Recognition

by

Tianci Sun

^†,

Wanqiu He

^†

,

Changbin Shao

,

Shang Zheng

and

Hualong Yu

^*

School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212003, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Information 2026, 17(5), 455; https://doi.org/10.3390/info17050455

Submission received: 13 March 2026 / Revised: 3 May 2026 / Accepted: 4 May 2026 / Published: 8 May 2026

(This article belongs to the Special Issue Machine Learning in Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

In real-world scenarios, large-scale datasets often exhibit a long-tailed data distribution. Training deep neural networks on such data typically leads to a bias towards head classes. Existing studies have demonstrated that the reweighting strategy is an effective means to alleviate the long-tailed issue. Recent studies suggest that incorporating class difficulty into reweighting can yield superior results. However, the method of quantifying class difficulty by an independent validation set has shown limitations in practical applications, i.e., wasting training samples and inaccurate estimations. To address this issue, this study proposes a novel model based on K-fold cross-validation, called the adaptive combination validation model, which contains two main innovations: first, both class and sample difficulty are quantified by using a more comprehensive and authentic estimation strategy, i.e., K-fold cross-validation, to obtain accurate and robust estimations; second, we extract the prediction probability distributions of samples, which reflect sample difficulty, from different model branches and design a distribution-harmonized loss to simultaneously focus on the effects of reweighted and original distributions. Extensive experiments on several popular long-tailed image recognition datasets (CIFAR10-LT and CIFAR100-LT, with several varying imbalance rates, and ImageNet-LT) demonstrate that the proposed method can effectively alleviate the long-tailed issue and achieve state-of-the-art performance on most datasets.

Keywords:

deep learning; class imbalance learning; long-tailed distribution; image recognition; neural networks

1. Introduction

In recent years, the rapid development of deep neural networks (DNNs) has driven significant advancements in the field of computer vision [1,2]. Their superior performance largely relies on the use of high-quality and balanced datasets, such as COCO [3] and ImageNet [4]. However, in real-world scenarios, datasets often exhibit a long-tailed distribution, where a small number of classes (called head classes) contain a majority of the samples, while most classes (called tail classes) have only a few samples [5,6]. This disparity in distribution tends to make DNN models trained on long-tailed datasets biased towards those head classes [7,8]. Therefore, training a classifier on those severely imbalanced data is a critical challenge [9,10,11].

To address the issue of long-tailed distributions, existing methods often aim to alleviate imbalance by encouraging DNN models to focus more on those tail classes during training [12]. Current popular solutions can be roughly categorized into two groups: resampling and reweighting. Resampling is a simple and widely used approach to re-balance the training set, and it alleviates imbalance by either undersampling the head classes (randomly discarding some samples belonging to the head classes) or oversampling the tail classes (replicating samples belonging to the tail-classes) [13,14]. Reweighting alleviates imbalance by assigning higher loss weights to tail classes, further encouraging the model to pay more attention to them. The original reweighting method was designed with complex hyper-parameters [15,16]. However, recent studies have shown that manually designating weights often results in suboptimal training performance [17]. As a result, research has increasingly shifted towards several new perspectives, such as focusing on effective samples [18] and sample feature distribution [19], to design more flexible solutions to the long-tailed distribution issue.

Existing methods often assume that tail classes are naturally the most difficult due to their limited sample sizes. However, a recent study demonstrates that this does not always hold true [20]. It proposed a method to quantify class difficulty using class accuracy, calculated on a small validation set. Although it provides an intuitive quantification of class difficulty, its reliance on the small validation set often fails to accurately reflect the real difficulty of a specific class, especially for those tail-classes which contain very limited samples. Additionally, the adoption of validation set tends to waste training instances.

Motivated by the aforementioned issue, we propose an adaptive combination validation model (ACVM) by introducing the K-fold cross-validation technique [21]. Specifically, ACVM constitutes a single main model and multiple sub-models. On the one hand, we dynamically tune the weight of each class by evaluating the classification performance of each sample belonging to that class when it serves as a validation sample in the sub-models. On the other hand, we feed the prediction probability distributions obtained from all sub-models into the main model to harmonize its training process. This enables the main model to integrate prior probability distribution information when it is trained. By effectively integrating the representativeness of cross-validation with the original distributions, ACVM offers a novel solution to deal with the long-tailed distribution issue.

The main contributions of this study can be summarized as follows:

We use K-fold cross-validation to replace independent validation sets with a comprehensive and authentic difficulty estimation tool for providing difficulty estimations at both the class and sample levels.
We harmonize the learning process of the main model by using sample probability distribution information, which reflects the sample difficulty level, which is obtained from all sub-models. This not only improves the quality of the main model in learning tail classes, but also improves the overall performance.
Extensive experiments on several popular long-tailed image recognition datasets (CIFAR10-LT and CIFAR100-LT, with several varying imbalance rates, and ImageNet-LT) demonstrate that the proposed method can effectively alleviate the long-tailed issue and achieve state-of-the-art performance on most datasets.

The rest of this paper is organized as follows. Section 2 simply reviews related work about the resampling and reweighting strategies used to solve long-tail image recognition issues. In Section 3, the proposed method is described in detail. The experimental settings, results, and discussions are presented in Section 4. Finally, Section 5 concludes this paper.

2. Related Work

2.1. Resampling

Resampling aims to approximately balance the sample distribution throughout different classes during the data preprocessing stage. The main resampling methods include random oversampling [13,22] and random undersampling [13,23].

Oversampling methods typically generate new tail-class samples by duplication or synthesis [24], such as SMOTE [25]. However, simple duplication can lead to overfitting due to the repetitive nature of the generated samples, and synthetic methods always fail to capture the unique characteristics of tail classes, resulting in suboptimal performance [26]. As for undersampling methods, they alleviate imbalance by reducing the number of head-class samples. Although these approaches effectively alleviate the model’s bias towards tail classes, they may discard critical information from head classes [19], leading to underfitting. Some balance resampling methods, such as class-aware sampling or square-root sampling [27], have been shown to be more effective than traditional oversampling or undersampling methods in such scenarios. However, since these methods still involve repetitive sampling, they still tend to overfit on tail classes [28].

2.2. Reweighting

The other commonly used strategy is reweighting, which aims to assign different weights for the training losses of different classes/samples [29], further making the model focus more on tail classes to alleviate the impact of imbalance. Reweighting methods can be roughly divided into sample-level reweighting and class-level reweighting.

The sample-level reweighting method assigns a specific weight to each sample. For instance, focal loss [15] assigns higher weights to those hard-to-classify samples. However, sample-level reweighting often fails to alleviate the imbalance caused by the discrepancy in terms of sample quantities between head and tail classes. Because head classes always contain more samples, they may also contain more hard-to-classify samples. Therefore, these methods may ultimately assign more weights to head classes, thereby failing to effectively enhance the performance of tail classes [20,30]. EQ loss [31] addresses this issue by introducing weight factors that simultaneously consider both class imbalance and sample difficulty. Although this approach alleviates class imbalance, the difference in the assignment of weights for different samples is still large, resulting in higher weights being assigned to the head classes as well.

In contrast, the class-level reweighting method assigns different weights to different classes, enabling DNN models to focus more on those tail classes. A straightforward strategy involves reweighting based on the inverse of class sample sizes [32]. However, this strategy performs poorly on large-scale datasets [33], as only using the sample size cannot adequately reflect the difficulty of a class. CDB loss [20] addresses this problem by dynamically calculating class difficulty on a small balanced validation set. However, as this validation set only takes a small portion of the dataset, the class difficulty derived from it often fails to represent the real difficulty of each class. Additionally, the use of a validation set tends to waste training instances, further decreasing the quality of the learning model. Difficulty-Net [34] proposes using a meta-model to predict class difficulty. Nonetheless, this method still relies on a small validation set, and the introduction of a meta-model will hinder the model’s convergence to the optimal solution within a few training rounds.

Therefore, it is urgent to develop a more comprehensive and authentic strategy to provide accurate and robust difficulty estimations to enhance the quality of reweighting long-tailed learning models.

3. Methods

3.1. Description About the Long-Tail Issue

Long-tailed learning attempts to perform classification tasks by learning a high-performance classification model

F (x, θ)

on a training set

D = {x_{i}, y_{i}}_{i = 1}^{N}

with a long-tailed distribution, where

θ

denotes the network parameter, x_i represents the i-th training sample,

y_{i} \in {0, 1}^{C}

is the one-hot encoding label corresponding to x_i, N is the number of samples in the entire training set, and C denotes the number of classes. For a specific class j, n_j denotes the number of samples, while

\sum_{j = 1}^{C} n_{j} = N

. IF = max(n_j)/min(n_j) is often used to measure the imbalance level of a long-tailed dataset. In general, for a long-tailed learning task,

IF ≫ 1

.

In long-tailed learning,

F (x, θ)

is always combined with the DNN architecture [2]. Therefore, we also adopted DNN as our classification model. Generally speaking, the optimal model parameters

θ^{*}

can be extracted by minimizing the following loss, calculated on the training set:

θ^{*} = \underset{θ}{a r g m i n} \frac{1}{N} \sum_{i = 1}^{N} l (F (x_{i}; θ), y_{i}) .

(1)

The loss function

l (F (x; θ), y)

is used to measure the difference between the model’s predicted label

F (x; θ)

and the real label y. The most commonly used loss functions are cross-entropy (CE) loss [35] or local loss (FL) [15]. In this study, we use the CE loss, which is calculated as follows:

l (F (x; θ), y) = - \frac{1}{N} \sum_{i = 1}^{N} y_{i} l o g (F (x_{i}; θ)) .

(2)

{\hat{P}}_{F}

is used to represent the predicted probability distribution of the model F and E (·,·) is used to represent the CE loss. Then, the CE loss can be briefly expressed as

L = - \frac{1}{N} \sum_{i = 1}^{N} y_{i} l o g (F (x_{i}; θ)) = E ({\hat{P}}_{F}, y) .

(3)

On an approximately balanced dataset, solving Equation (1) can produce a balanced classifier, but it would perform poorly if the distribution of the training data is biased. The reweighting method aims to improve the robustness of the model by imposing a weight v_i on the training sample x_i, and then calculating the optimal parameter

θ^{*}

by minimizing the following weighted loss:

θ^{*} = \underset{θ}{a r g m i n} \frac{1}{N} \sum_{i = 1}^{N} v_{i} l (F (x_{i}; θ), y_{i}) .

(4)

3.2. The Overall Framework of ACVM

As shown in Figure 1, ACVM consists of two main components, which are called the “main learning branch” and the “sub-validation branch”, respectively. The main learning branch consists of a regular learning model (later, we will use F to represent the learning model), which is used to perform classification tasks, while the sub-validation branch consists of K validation models (later, we will use f to represent these, in which f_k represents the k-th validation model, where

k \in {1,2, \dots, K}

). Both F and f_k use the same residual network structure, but they are learned separately during the learning process, and their model weights are independent of each other. All these models use traditional random sampling to obtain training samples. The difference between them is that for F, we apply weighted CE loss, which can achieve the best learning effect, while for f, we use traditional CE loss, which can provide a more realistic probability distribution.

The details of the sub-validation branch are shown in Figure 2. Specifically, we train F on the regular training set and f on the K-fold cross-validation training sets. We adopted stratified K-fold cross-validation to divide the training set D into K subsets D₁, D₂,…, D_K—hat is,

D = {D_{k}}_{k = 1}^{K}

. The division was stratified by class label to ensure that the class distribution in each fold exactly matches the original training set distribution. This is particularly important for long-tailed datasets, as it guarantees that even tail classes with very few samples are represented in all folds, preventing unstable difficulty estimation. For any one subset D_k, the sample size

|D_{k}| \approx N / K

. Note that the divided K subsets should also meet the following conditions:

\{\begin{matrix} D = ⋃_{k = 1}^{K} D_{k}, \\ D_{k} \cap D_{l} = \emptyset, (\forall k \neq l) . \end{matrix}

(5)

For f_k, we use D_k as the validation set, and the remaining K − 1 sub-training sets as the training set T_k for training the model. In each epoch, each f in the sub-validation branch is first independently trained. Then, for f_k, the probability distribution of D_k on f_k is calculated. Combining the probability distribution obtained from all sub-validation models, we can acquire the probability distribution of training set D from which effective information can be extracted to guide the learning of F. Since the prediction distributions from different sub-models correspond to completely non-overlapping sample sets, we aggregate the probability distributions of sub-validation models by concatenation.

3.3. Methods in ACVM

Our methods consist of two main components: Adaptive Difficulty Validation Weighting (ADVW) and Distributed Harmonic Loss (DHL).

3.3.1. ADVW

ADVW aims to reweight the loss of F based on the class difficulty calculated by the sub-validation branch. Inspired by CDB loss, we use class difficulty to quantify the weight that should be applied to each class in F. If the model’s classification accuracy for the class j is a_j, then the difficulty d_j of this class can be calculated as

d_{j} = 1 - a_{j} .

(6)

Then, for the weight of the class j can be calculated as follows:

w_{j} = {(d_{j})}^{τ} = {(1 - a_{j})}^{τ},

(7)

where

τ

is a hyperparameter that controls the weighting quantity in difficult classes.

In Equation (6), the original d_j can be dynamically calculated based on the classification accuracy a_j of F on a small-scale, independent and balanced validation set [20]. Since training instances are generally precious for training models, the adoption of an independent validation set tends to lower the quality of learning model. Furthermore, using a small independent validation set cannot reflect the real difficult distribution. In addition, we note that with training process progresses, a_j, which reflects the learning efficiency of F on the class j, cannot truly reflect the actual difficulty of that class. Therefore, at the end of training, it is difficult for F to focus more on tail classes. Using the classification accuracy acquired from the sub-validation branch to quantify class difficulty can effectively solve these two problems.

We use

θ_{k}

to represent the network parameters of f_k. After training on T_k, the predicted probability distribution of f_k on validation set D_k is as follows:

{\hat{P}}_{f_{k}} = ⋃_{i = 1}^{|D_{k}|} f_{k} (x_{i}; θ_{k}) .

(8)

Combining all

{\hat{P}}_{f_{k}}

, we can further obtain the predicted probability distribution of each sample in the training set D in the sub-validation branch as follows:

{\hat{P}}_{f} = ⋃_{k = 1}^{K} {\hat{P}}_{f_{k}} = ⋃_{k = 1}^{K} (⋃_{i = 1}^{|D_{k}|} f_{k} (x_{i}; θ_{k})) .

(9)

Then, we can calculate the classification accuracy a_j of the class j based on

{\hat{P}}_{f}

. Next, feeding this into Equation (6), we can further calculate the weight of each class.

After acquiring the weights of each class, we can integrate them into the weighted CE loss of F. Combining Equations (4) and (7), the new weighted CE loss is drawn as:

L = - \frac{1}{N} \sum_{i = 1}^{N} w_{j} y_{i} \log (F (x_{i}; θ)) = {(1 - a)}^{τ} E ({\hat{P}}_{F}, y) .

(10)

Since the class difficulty comes from the sub-validation branch, the sample size which is used for calculating class difficulty is much larger than that obtained when only using a small validation set, but the difficulty calculated by our method can be more accurate and robust due to the adoption of a more comprehensive and authentic estimation strategy, i.e., K-fold cross-validation. In addition, as f uses traditional CE loss, our method can better reflect the real class difficulty at the end of the training process, in comparison to previous methods.

3.3.2. DHL

However, ADVW’s limitation is also obvious: while it effectively improves tail class performance by assigning higher weights to difficult classes, it inevitably introduces artificial distribution bias. This bias can distort the model’s understanding of the intrinsic discriminative patterns of the data, leading to a degraded performance on head classes and reduced overall generalization ability. To address this issue, we propose the Distribution-Harmonized Loss (DHL). The sub-models in the sub-validation branch are trained using standard cross-entropy loss without any long-tailed-specific processing. Their prediction distributions naturally reflect the intrinsic discriminative patterns of the original data without artificial shifts, serving as a reliable, unbiased distribution anchor for the main model.

DHL aims to harmonize the biased optimization of the main model with the unbiased distribution information from the sub-models. Inspired by focal loss,

L_{F L} = {(1 - {\hat{P}}_{F})}^{γ} E ({\hat{P}}_{F}, y),

(11)

DHL adopts a more conservative strategy, based on the weighted CE loss, and uses the original probability distribution predicted by a sample on f to fine-tune the loss on F and reserve the original information of that sample. First, for a specific sample x_i belonging to class j, if its probability distribution on F is

{\hat{P}}_{F} (x_{i})

, then its weighted CE loss on F is:

L_{i} = w_{j} E ({\hat{P}}_{F} (x_{i}), y_{i}) .

(12)

This loss is already considered the class imbalance issue, but the weighted probability distribution acquired from F using only this loss tends to destroy the original distribution information. Fortunately, the probability distribution

{\hat{P}}_{f} (x_{i})

predicted on f can satisfactorily represent the original distribution information. Therefore, we combine the weighted CE loss calculated by

{\hat{P}}_{f} (x_{i})

with the loss of F to make F focus on the original sample probability distribution. The integrated loss is described as follows:

L_{i} = w_{j} E ({\hat{P}}_{F} (x_{i}), y_{i}) + w_{j} E ({\hat{P}}_{f} (x_{i}), y_{i}) .

(13)

Although the integrated loss focuses on both the class difficulty and the original probability distribution, it is not flexible enough, as both the main learning branch and the sub-validation branch contribute equally to the loss of F. In order to better tune the tradeoff between the contributions of these two branches to the loss, we introduce a smoothing factor α. The overall loss can be represented as follows:

L_{A C V M} = α w E ({\hat{P}}_{F}, y) + (1 - α) w E ({\hat{P}}_{f}, y)

(14)

= α \sum_{i = 1}^{N} w_{j} E ({\hat{P}}_{F} (x_{i}), y_{i}) + (1 - α) \sum_{i = 1}^{N} w_{j} E ({\hat{P}}_{f} (x_{i}), y_{i}) .

(15)

During the training of the main model, the parameters of all sub-models are fixed. Therefore, the DHL term only contributes gradients to the main model parameters

θ

, acting as a soft distribution regularizer. It guides the main model to learn feature representations that simultaneously satisfy two objectives: (1) achieving high classification accuracy on tail classes through the ADVW term; and (2) remaining as close as possible to the unbiased data distribution learned by the sub-models through the DHL term. Hyperparameter α provides a flexible way to tune the tradeoff between these two objectives.

3.4. The Procedure of ACVM Algorithm

The procedure of the proposed ACVM Algorithm 1 is briefly described as follows:

Algorithm 1: ACVM

Input: Training set

D = {(x_{i}, y_{i})}_{i = 1}^{N}

, K (the number of cross-validation folds),

τ

(difficulty weight exponent), α (distribution harmonization weight), MaxEpochs.
Output: Main model F with the optimized parameters

θ

.
Procedure:

$Initialize the main model F and K sub-models {f_{k}}_{k = 1}^{K}$ ;
$Split D into K disjoint subsets {D_{k}}_{k = 1}^{K}$ ;
for epoch = 1 to MaxEpochs do
▷ Sub-validation branch training
for k = 1 to K do
$T_{k} \leftarrow D \ D_{k}$ ;
Train f_k on T_k with standard CE loss;
$Calculate probability distribution {\hat{P}}_{f_{k}}$ on D_k;
end for
$Combine {\hat{P}}_{f_{k}}_{k = 1}^{K} \to {\hat{P}}_{f}$ for D;
▷ Category difficulty calculation
for each class j in C do
$a_{j} \leftarrow accuracy of the class j using {\hat{P}}_{f}$ ;
$d_{j} \leftarrow 1 - a_{j}$ ;
$w_{j} \leftarrow {(d_{j})}^{τ}$ ;
end for
▷ Main branch training
for each batch (x_b, y_b) in D do
▷ Forward pass;
${\hat{P}}_{F} \leftarrow F (x_{b}; θ)$ ;
▷ Loss calculation;
$L_{A D V W} = \sum_{i \in b a t c h} w_{j} E ({\hat{P}}_{F} (x_{i}), y_{i})$ ;
$L_{D H L} = \sum_{i \in b a t c h} w_{j} E ({\hat{P}}_{f} (x_{i}), y_{i})$ ;
$Total loss L_{A C V M} = α L_{A D V W} + (1 - α) L_{D H L}$ ;
▷ Backward pass
$Update θ via SGD using L_{A C V M}$ ;
end for
end for
return the optimized model F.

4. Experiments

4.1. Datasets

We conducted experiments on three widely used long-tailed image recognition datasets: CIFAR10/100-LT [18] and ImageNet-LT [36]. The detailed information is described in Table 1.

CIFAR10/100-LT: CIFAR10/100 is a balanced object-centric classification dataset composed of small images belonging to 10 or 100 classes [18]. Their long-tailed versions (CIFAR10/100-LT) are artificially constructed by applying an exponential decay function to reduce the number of training samples in each class. In our experiments, we used CIFAR10/100-LT with IF ranging from 10 to 200.

ImageNet-LT: ImageNet is a large-scale image classification dataset containing 1000 classes with a relatively balanced original distribution. This dataset is not inherently class-imbalanced. However, its long-tailed version has been constructed in [36], which includes a long-tailed training set with a total of 115,800 images. The number of samples in each class ranges from 5 to 1280, resulting in a IF of 256, which exhibits significant class imbalance.

4.2. Experimental Details

On CIFAR10/100-LT dataset, we used ResNet-32 [1] as the backbone network, trained for 200 epochs on a single NVIDIA GeForce RTX 3090 GPU with a batch size of 64. The SGD optimizer was employed with default settings: a momentum of 0.9 and a weight decay of 0.0005. To ensure an impartial comparison, we followed the LDAM [19] settings and adopted only basic data augmentation strategies, including RandomCrop and RandomHorizontalFlip, without additional enhancements. The initial learning rate was set to 0.1 and scaled down at epochs 160 and 180.

On the ImageNet-LT dataset, we adopted ResNet-50 [1] as the backbone network, trained for 120 epochs on two NVIDIA GeForce RTX 3090 GPUs with a batch size of 256. Similar to CIFAR10/100-LT, we used the SGD optimizer with a momentum of 0.9, but the weight decay was tuned to 0.0002. The initial learning rate was set to 0.1 and scaled down at epochs 60 and 80.

On all datasets, the fundamental architecture of ACVM remains consistent, as described in Section 3.2 and Section 3.3. To adapt to the differences in the number of classes on different datasets, we only adjusted the dimensions of the hidden and output layers. For the choice of

τ

, the conclusion of [20] shows that different datasets require different values of

τ

to achieve the best performance. Therefore, we followed the conclusion in [20] and selected

τ

individually for each dataset. For CIFAR10-LT, which was not covered in [20], we conducted a grid search and found that

τ = 1.0

(IF = 10 and 200) and 1.5 (IF = 50 and 100) achieve the best performance. The choice of parameter α will be discussed in Section 4.5. Unless stated otherwise, we used a default setting K = 3 in the experiments, as we have found that this setting performs well enough in most cases. The optimal hyperparameter values for all datasets are summarized in Table 2.

To ensure reproducibility, all experiments used a fixed random seed of 42 for Python 3.6, PyTorch 1.5.0 and Opencv-python 4.1.2. For CIFAR10-LT and CIFAR100-LT, training on a single RTX 3090 GPU took approximately 2 h per run (10 h total for 5 runs) for 200 epochs, with a peak memory usage of 8 GB. For ImageNet-LT, training on two RTX 3090 GPUs took approximately 18 h per run (90 h total for 5 runs) for 120 epochs, with a peak memory usage of 18 GB per GPU. Compared to a standard single-model pipeline, ACVM with K = 3 introduces approximately 2.5× total training time overhead, while ACVM with K = 5 introduces approximately 4× total training time overhead.

4.3. Compared Methods

Given the rapid advancements in long-tailed image recognition, we compared the proposed method with various SOTA approaches. Specifically, the comparison methods include: (1) baseline method, with CE loss; (2) sample-level reweighting methods, with focal loss [15], EQ loss [31], and MWN [17]; (3) class-level reweighting methods with CB loss [18], LDAM [19], CDB loss [20] and DN [34]; and (4) hybrid methods (combining class-level and sample-level reweighting) with LDAM-DRW [19] and IB [37]. In the experiments, all compared models used the default parameter settings in the corresponding references.

4.4. Main Results

CIFAR10/100-LT: From the results in Table 3, we can observe several phenomena. First, compared to the most relevant and best-performing method, CDB loss, our proposed ACVM approach achieved a superior performance on most datasets. These results demonstrate that the class difficulties calculated by using ACVM are more accurate and robust. Secondly, compared to the recently proposed DN method, ACVM achieved a better performance under the same experimental conditions, further validating its effectiveness. Finally, we observe that with the increase in IF, ACVM tends to achieve more significant performance improvements compared to existing methods, which indicates that ACVM has stronger adaptability when dealing with extreme long-tailed data.

ImageNet-LT: The conclusions drawn above are also true for other datasets with different long-tailed distributions. From Table 4, which presents the most accurate result of various algorithms on the ImageNet-LT dataset, we can observe that ACVM achieved an accuracy of 47.70%, showing an 8.82% improvement in comparison with the traditional CE loss model. Obviously, this is the best one among all reweighting methods. In addition, compared to the most relevant and best-performing method, i.e., CDB loss, ACVM improves accuracy by 1.14%. Under the same experimental conditions, ACVM outperforms the recently proposed DN method by 3.01%. These results demonstrate that our proposed ACVM approach can effectively enhance model performance and further validate its ability to address extreme long-tailed image recognition issues.

4.5. Ablation Study

4.5.1. The Effect of α

In Equation (15), we introduce an important parameter, α, to tune the tradeoff between the contributions of the main learning branch and the sub-validation branch to the loss of F. Specifically, α indicates the contribution weight of the main learning branch, and thus 1 − α represents the contribution weight of the sub-validation branch. Next, we varied the value of α from 0.1 to 0.9 to explore its impact on the proposed model and further provide guidance for future applications in real-world long-tail learning tasks. The results are presented in Figure 3 and Table 5.

From the experimental results in Figure 3 and Table 5, we observe that on all datasets, the accuracy of the tail classes generally increases as α grows. This trend indicates that ACVM can effectively help to enhance the performance of tail classes. On CIFAR10-LT, the overall performance reaches its peak when α lies between 0.2 and 0.6, while on CIFAR100-LT, the optimal performance was achieved when α lies between 0.6 and 0.9. In the ImageNet-LT dataset, the accuracy of its tail-classes persistently improves with the increase in α, but the optimal overall performance emerges when α = 0.5. Also, we observe that when α approaches 0, the performance of F declines significantly on most datasets. This occurs because the loss is completely dominated by the sub-validation branch, but ignores the effect of the main model. These findings show that both the main and sub-validation branches contribute to the model’s enhanced performance, and the best tradeoff between them should be fine-tuned to determine the optimal configuration. The optimal α values for all datasets are summarized in Table 2 in Section 4.2. In general, datasets with higher imbalance rates tend to benefit from larger α values, as they require stronger emphasis on tail class performance. However, a large α value also tends to cause the model to excessively incline toward the tail class, reducing the α value to a certain extent, which tends to make the model pay more attention to the overall performance.

4.5.2. The Effect of K

As we know, our proposed ACVM leverages parameter K to control the value of f in the sub-validation branch. We evaluated the impact of K on CIFAR100-LT (IF = 100) as a representative case, and verified that the performance trend is consistent across all other datasets (CIFAR10-LT and ImageNet-LT). Specifically, performance improves significantly from K = 2 to K = 3, but the marginal gain becomes negligible for K > 3. Intuitively speaking, a higher K is helpful for improving the learning performance of f, as it means that more training samples are used to train f, further resulting in a more accurate estimation. However, a larger K also means a longer training time. Next, we took the CIFAR100-LT (IF = 100) dataset as an example to detect the impact of varying parameter K on difficulty estimations using head, middle, and tail classes, respectively (see Figure 4).

From the results in Figure 4, we can observe that with the increase in K, the difficulties with the head and middle classes first gradually decrease and then tend to become stable, while the difficulty in the tail classes remains relatively invariable. This phenomenon means that larger K values can more effectively capture the differences in difficulties among classes. In addition, a larger K helps improve the accuracy of difficulty calculation owing to the augmentation of each training subset. However, we also note that it is unnecessary to designate an excessively large K value, as the accuracy gained in difficulty estimations might be limited, but the time cost could be significantly augmented. To seek a tradeoff between accuracy and time consumption, we suggest setting K to 3~5 in real-world practical applications.

4.5.3. The Effect of Each Module

To validate the effectiveness of the two proposed modules, we also investigated their impacts on the performance of F.

Using the CIFAR100-LT (IF = 100) dataset as an example, we analyzed the effectiveness of each module with the three following conditions: (1) adjustment of the loss function of f, (2) inclusion or exclusion of ADVW, and (3) inclusion or exclusion of DHL.

From the experimental results in Table 6, we can draw the following conclusions: (1) ADVW significantly improves the performance of F, demonstrating its significant role in long-tailed classification tasks. (2) Although solely using DHL can obtain a limited improvement in the performance of F, its integration with ADVW can further improve classification accuracy. This integration effectively reduces the loss of raw sample distribution information, which is caused by only using ADVW. (3) F achieves the best performance when f is trained with the traditional CE loss, which indicates that CE loss treats all classes equally, leading to the more accurate estimation of class difficulty information. In contrast, other loss functions tend to introduce biases into the training procedure of f, making it difficult to obtain real class difficulty information, and thus further affecting the performance of F. This result empirically validates our theoretical design: the DHL term effectively mitigates the distribution bias introduced by ADVW, preserving the overall discriminative ability of the model while improving tail class performance.

4.5.4. Isolating the Effect of K-Fold Cross-Validation

To make it clear whether the performance gains of ACVM stem from the K-fold cross-validation mechanism itself or are merely a result of increased model capacity and implicit ensembling, we conducted a series of controlled ablation experiments on the CIFAR100-LT (IF = 100) dataset. We compared ACVM (K = 3) against three alternative configurations under matched computational or parameter constraints: (1) we trained K = 3 sub-models, but instead of the stratified cross-validation partition, each model was trained on the entire training set D. The class difficulties were then estimated using the average training accuracy of these models. As shown in Table 7, the accuracy of this configuration (40.68%) is significantly lower than that of ACVM. This is because sub-models trained on the full training set tend to overfit, leading to an inaccurate and over-optimistic estimation of class difficulty that fails to represent the true generalization gap; (2) We evaluated a standard ensemble baseline [39] (averaging predictions of three independent models) and a knowledge distillation [40] baseline where a single model was trained using the ensemble as a teacher. Although these methods showed improved performance over the CE baseline, they were still outperformed by ACVM by 1.59% and 1.92%. This demonstrates that the ADVW and DHL mechanisms, guided by cross-validation-based difficulty estimation, are more effective for long-tailed learning than simple feature or prediction fusion. (3) Given that ACVM (K = 3) increases training time by approximately 2.5 times, we extended the training of the single-model CE and CDB loss baselines to 500 epochs. The results (39.15% and 43.26%) show that simply increasing training iterations does not resolve the fundamental bias towards head classes, confirming that ACVM’s success is due to its algorithmic design rather than extra computation.

These results provide strong causal evidence that the stratified K-fold cross-validation mechanism is the primary driver of performance improvement, as it provides a comprehensive and authentic difficulty estimation tool.

4.5.5. Quality of the Estimated Class Difficulty

The central claim of this study is that K-fold cross-validation provides a superior and more accurate and authentic estimation of class difficulty compared to independent validation sets. To validate this, we quantitatively evaluated the quality of the estimated difficulty d_j from two perspectives: correlation with generalization error and estimation stability.

First, we calculated the Pearson correlation coefficient (r) between the estimated class difficulty d_j on the training set and the actual error rate (1 − a_j) on the test set for CIFAR100-LT (IF = 100). A higher correlation indicates that the estimated difficulty more accurately reflects the real generalization gap. Second, to assess stability, we calculated the average variance in the estimated difficulty for the Head, Medium, and Tail classes across different runs or folds. The results are summarized in Table 8.

As shown in Table 8, the difficulty estimated by ACVM achieves the highest correlation (0.89) with the actual test error. This confirms that the prediction probabilities extracted from the sub-validation branch via cross-validation can reliably represent the true difficulty of each class without over-optimistic bias, aligning with the perspective that well-calibrated probabilities are crucial for reflecting true generalization errors [41].

Furthermore, the variance in the estimated difficulty using ACVM remains remarkably low across all class frequencies, including the extreme tail classes. This high stability is attributed to the stratified sampling strategy utilized in our cross-validation partition, which ensures that even tail classes with very limited samples are represented consistently across all folds. Consequently, by providing high-fidelity and stable difficulty estimations, the sub-validation branch effectively ensures the robustness of the calculated ADVW weights, further preventing the main model from being misled by the noise often found in small-scale independent validation sets.

5. Concluding Remarks

This study proposes a novel reweighting method called ACVM for dealing with the long-tailed learning issue. It takes advantage of K-fold cross-validation as an accurate and authentic tool to simultaneously acquire accurate and robust difficulty information from both the class and sample levels to guide the learning process. Experimental results on three widely used long-tailed image recognition datasets showed that the proposed ACVM method can significantly improve classification performance on most datasets in comparison with baseline and state-of-the-art learning algorithms.

The proposed ACVM method has two main merits, as follows: (1) More Accurate and Robust Difficulty Estimation: ADVW calculates both class and sample difficulty levels by using information acquired from the K-fold sub-validation branch, further addressing the lack of accuracy caused by using a small independent validation set; (2) Distribution-Harmonized Loss Optimization: By feeding the prediction probability distribution information obtained from the sub-validation branch into the main learning branch, the loss of the main learning branch can be dynamically tuned, further improving the overall classification performance of the learning model.

In future work, we plan to further explore how to enhance reweighting efficiency without compromising the accuracy of difficulty estimations. In addition, we will investigate how to better harmonize class imbalance and sample difficulty to improve long-tailed learning performance.

Author Contributions

Conceptualization, T.S.; methodology, T.S. and C.S.; software, T.S.; validation, C.S.; formal analysis, T.S. and W.H.; investigation, W.H.; resources, H.Y.; data curation, T.S. and W.H.; writing—original draft preparation, T.S., W.H. and C.S.; writing—review and editing, S.Z. and H.Y.; project administration, H.Y.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This study was partially supported by National Natural Science Foundation of China under grant No. 62176107.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The original CIFAR10 and CIFAR100 datasets can be found at https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 17 June 2025). The original ImageNet dataset is available at https://image-net.org/ (accessed on 17 June 2025). The standard long-tailed data splits and generation protocols used to construct CIFAR10/100-LT and ImageNet-LT can be accessed in the OpenLongTailRecognition benchmark repository at https://github.com/zhmiao/OpenLongTailRecognition-OLTR (accessed on 21 June 2025). The code used in this study is available at GitHub: https://github.com/sun371009757-ops/ACVM-main.git (accessed on 13 April 2026). The repository includes all training and evaluation scripts, and configuration files for reproducing the results reported in this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DNN	Deep Neural Network
ACVM	Adaptive Combination Validation Mechanism
ADVW	Adaptive Difficulty Validation Weighting
DHL	Distributed Harmonic Loss

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference of Computer Vision (ECCV 2014), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Wang, J.; Zhang, W.; Zang, Y.; Cao, Y.; Pang, J.; Gong, T.; Chen, K.; Liu, Z.; Loy, C.C.; Lin, D. Seesaw loss for long-tailed instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9695–9704. [Google Scholar] [CrossRef]
Zhang, Y.; Kang, B.; Hooi, B.; Yan, S.; Feng, J. Deep long-tailed learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10795–10816. [Google Scholar] [CrossRef] [PubMed]
Ren, J.; Yu, C.; Ma, X.; Zhao, H.; Yi, S. Balanced meta-softmax for long-tailed visual recognition. Adv. Neural Inf. Process. Syst. 2020, 33, 4175–4186. [Google Scholar] [CrossRef]
Malisiewicz, T.; Gupta, A.; Efros, A.A. Ensemble of exemplar-svms for object detection and beyond. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 89–96. [Google Scholar] [CrossRef]
Alshammari, S.; Wang, Y.-X.; Ramanan, D.; Kong, S. Long-tailed recognition via weight balancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 6897–6907. [Google Scholar] [CrossRef]
Jamal, M.A.; Brown, M.; Yang, M.-H.; Wang, L.; Gong, B. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–24 June 2020; pp. 7610–7619. [Google Scholar] [CrossRef]
Li, M.; Cheung, Y.-M.; Lu, Y. Long-tailed visual recognition via gaussian clouded logit adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 6929–6938. [Google Scholar] [CrossRef]
Zhou, B.; Cui, Q.; Wei, X.-S.; Chen, Z.-M. BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 9719–9728. [Google Scholar] [CrossRef]
Buda, M.; Maki, A.; Mazurowski, M.A. A Systematic Study of the Class Imbalance Problem in Convolutional Neural Networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef] [PubMed]
He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Zhang, Z.; Pfister, T. Learning Fast Sample Re-Weighting Without Reward Data. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 725–734. [Google Scholar] [CrossRef]
Shu, J.; Xie, Q.; Yi, L.; Zhao, Q.; Zhou, S.; Xu, Z.; Meng, D. Meta-Weight-Net: Learning an Explicit Mapping for Sample Weighting. Adv. Neural Inf. Process. Syst. 2019, 32, 1917–1928. [Google Scholar]
Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; Belongie, S. Class-Balanced Loss Based on Effective Number of Samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–21 June 2019; pp. 9268–9277. [Google Scholar] [CrossRef]
Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; Ma, T. Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss. Adv. Neural Inf. Process. Syst. 2019, 32, 1567–1578. [Google Scholar]
Sinha, S.; Ohashi, H.; Nakamura, K. Class-Difficulty Based Methods for Long-Tailed Visual Recognition. Int. J. Comput. Vis. 2022, 130, 2517–2531. [Google Scholar] [CrossRef]
Anguita, D.; Ghelardoni, L.; Ghio, A.; Oneto, L.; Ridella, S. The ’K’ in K-Fold Cross Validation. In Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges, Belgium, 25–27 April 2012; pp. 441–446. Available online: https://www.esann.org/sites/default/files/proceedings/legacy/es2012-62.pdf (accessed on 9 April 2025).
Shen, L.; Lin, Z.; Huang, Q. Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks. In Proceedings of the European Conference of Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 11–14 October 2016; pp. 467–482. [Google Scholar] [CrossRef]
Japkowicz, N.; Stephen, S. The Class Imbalance Problem: A Systematic Study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
Byrd, J.; Lipton, Z. What Is the Effect of Importance Weighting in Deep Learning? In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 872–881. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Wu, T.; Huang, Q.; Liu, Z.; Wang, Y.; Lin, D. Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets. In Proceedings of the European Conference of Computer Vision (ECCV 2020), Glasgow, UK, 23–28 August 2020; pp. 162–178. [Google Scholar] [CrossRef]
Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng, J.; Kalantidis, Y. Decoupling Representation and Classifier for Long-Tailed Recognition. arXiv 2019, arXiv:1910.09217. [Google Scholar] [CrossRef]
Chen, X.; Zhou, Y.; Wu, D.; Zhang, W.; Zhou, Y.; Li, B.; Wang, W. Imagine by Reasoning: A Reasoning-Based Implicit Semantic Data Augmentation for Long-Tailed Classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022; pp. 356–364. [Google Scholar] [CrossRef]
Wang, T.; Zhu, Y.; Zhao, C.; Zeng, W.; Wang, J.; Tang, M. Adaptive Class Suppression Loss for Long-Tail Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 3103–3112. [Google Scholar] [CrossRef]
Li, B.; Yao, Y.; Tan, J.; Zhang, G.; Yu, F.; Lu, J.; Luo, Y. Equalized Focal Loss for Dense Long-Tailed Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 6990–6999. [Google Scholar] [CrossRef]
Tan, J.; Wang, C.; Li, B.; Li, Q.; Ouyang, W.; Yin, C.; Yan, J. Equalization Loss for Long-Tailed Object Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–24 June 2020; pp. 11662–11671. [Google Scholar] [CrossRef]
Huang, C.; Li, Y.; Loy, C.C.; Tang, X. Learning Deep Representation for Imbalanced Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5375–5384. [Google Scholar] [CrossRef]
Mahajan, D.; Girshick, R.; Ramanathan, V.; He, K.; Paluri, M.; Li, Y.; Bharambe, A.; Van Der Maaten, L. Exploring the Limits of Weakly Supervised Pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 181–196. [Google Scholar] [CrossRef]
Sinha, S.; Ohashi, H. Difficulty-Net: Learning to Predict Difficulty for Long-Tailed Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 28 February–4 March 2023; pp. 6444–6453. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Liu, Z.; Miao, Z.; Zhan, X.; Wang, J.; Gong, B.; Yu, S.X. Large-Scale Long-Tailed Recognition in an Open World. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–21 June 2019; pp. 2537–2546. [Google Scholar] [CrossRef]
Park, S.; Lim, J.; Jeon, Y.; Choi, J.Y. Influence-Balanced Loss for Imbalanced Visual Classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 735–744. [Google Scholar] [CrossRef]
Zhang, X.; Fang, Z.; Wen, Y.; Li, Z.; Qiao, Y. Range Loss for Deep Face Recognition with Long-Tailed Training Data. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5409–5418. [Google Scholar] [CrossRef]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Adv. Neural Inf. Process. Syst. 2017, 30, 3069–3079. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, NSW, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar] [CrossRef]

Figure 1. The model structure of ACVM. It consists of two following branches: the main learning branch and the sub-validation branch. (1) The main learning branch consists of a learning model F. The whole training set is used for model training, from which the sample probability distribution

{\hat{P}}_{F}

can be obtained. (2) The sub-validation branch consists of K validation models f_k. The K-fold cross-validation method is used to calculate the prediction probability distribution of each sample in the model, which serves as the validation set, and is integrated into

{\hat{P}}_{f}

.

Figure 1. The model structure of ACVM. It consists of two following branches: the main learning branch and the sub-validation branch. (1) The main learning branch consists of a learning model F. The whole training set is used for model training, from which the sample probability distribution

{\hat{P}}_{F}

can be obtained. (2) The sub-validation branch consists of K validation models f_k. The K-fold cross-validation method is used to calculate the prediction probability distribution of each sample in the model, which serves as the validation set, and is integrated into

{\hat{P}}_{f}

.

Figure 2. Specific details of sub-validation branches. Here, K = 3 is taken as an example, and the structure of other K values is similar to this. The training set is divided into three subsets, and then three validation models with the same network structure are initialized. Each model uses two of the subsets for training, and the remaining subset is used as the validation set to calculate the probability distribution of each sample. Combining the probability distributions obtained by each validation model, the probability distribution of each sample in the training set on the sub-validation branch can be obtained.

Figure 3. The impact of parameter α to the accuracy of All, Head, Medium and Tail classes on CIFAR10/100-LT datasets with different IFs. (a) The result on CIFAR10-LT; (b) the result on CIFAR100-LT. We divide each dataset evenly into Head, Middle, and Tail classes based on the number of samples in the classes.

Figure 4. The difficulty of the Head, Middle and Tail classes, calculated by the sub-validation branch under different K values on the CIFAR100-LT (IF = 100) dataset.

Table 1. Details of CIFAR10/100-LT, and ImageNet-LT, where IF denotes the imbalance rate.

Datasets	Number of Classes	IF	Number of Samples in the Maximum Class	Number of Samples in the Minimum Class
CIFAR10-LT	10	10, 50, 100, 200	5000	25~500
CIFAR100-LT	100	10, 50, 100, 200	500	2~50
ImageNet-LT	1000	256	1280	5

Table 2. Optimal hyperparameter values for different datasets.

Dataset	IF	K	$τ$	α
CIFAR10-LT	10	3	1.0	0.4
	50	3	1.5	0.2
	100	3	1.5	0.4
	200	3	1.0	0.6
CIFAR100-LT	10	3	1.0	0.8
	50	3	1.5	0.8
	100	3	1.5	0.6
	200	3	1.0	0.6
ImageNet-LT	156	3	2.0	0.5

Table 3. Top-1 accuracy of various algorithms on CIFAR10-LT and CIFAR100-LT datasets; the best results on each dataset are highlighted in bold.

Dataset	CIFAR10-LT				CIFAR100-LT
IF	10	50	100	200	10	50	100	200
CE loss	86.39	74.81	70.36	65.68	55.65	43.85	38.21	34.84
Focal loss [15]	86.66	76.71	70.38	65.29	55.78	44.32	38.41	35.62
EQ loss [31]	-	-	-	-	58.32	-	40.54	-
MWN [17]	87.84	80.06	75.21	68.91	58.46	46.74	42.09	37.91
CB loss [18]	87.49	79.27	74.57	68.89	57.89	45.32	39.60	36.23
LDAM [19]	86.97	-	73.35	-	56.91	-	39.60	-
CDB loss [20]	88.21	81.06	77.91	73.93	59.47	47.09	42.70	37.99
DN [34]	87.97	80.65	77.93	74.17	55.50	44.89	40.93	36.87
LDAM-DRW [19]	88.16	-	77.03	-	57.99	-	42.04	-
IB [37]	88.25	81.70	78.26	73.96	57.13	46.22	42.14	37.31
ACVM (ours)	87.76	81.71	78.79	75.15	59.08	49.03	43.44	39.58

Note: - indicates that the corresponding method did not report results for this specific imbalance factor in its original publication.

Table 4. Top-1 accuracy of various algorithms on ImageNet-LT dataset; the best results are highlighted in bold.

Method	Accuracy
CE loss	38.88
Focal loss [15]	30.50
Range loss [38]	30.70
EQ loss [31]	36.40
BS [7]	41.80
CB loss [18]	40.85
LDAM [19]	41.86
LDAM-DRW [19]	45.74
CDB loss [20]	46.56
DN [34]	44.69
ACVM (ours)	47.70

Table 5. The impact of parameters α on the accuracy of All, Head, Medium and Tail classes on the ImageNet-LT dataset. We divide the classes into three groups based on the number of samples in these classes: head-shot (>100), medium-shot (20–100), and tail-shot (<20). The best results are highlighted in bold.

α	0.1	0.3	0.5	0.7	0.9
All	44.70	46.01	47.70	46.97	47.11
Head	62.90	64.18	65.99	65.18	64.87
Medium	38.87	40.17	41.96	41.23	41.45
Tail	13.68	15.15	16.18	16.33	16.76

Table 6. Top accuracy of F, obtained using different combinations of modules on the dataset CIFAR100 (IF = 100). ✔ means the corresponding module is used; ✘ means the corresponding module is not used. ADVW or using DHL does not mean only CE loss is used to train a single classification. The bold denotes the best result.

#	Sub-Model Loss	ADVW	DHL	Accuracy
1	Focal loss	✔	✔	42.47
2	CB loss	✔	✔	41.87
3	EQ loss	✔	✔	41.95
4	CDB loss	✔	✔	41.56
5	CE loss	✘	✘	38.21
6	CE loss	✘	✔	40.65
7	CE loss	✔	✘	43.04
8	CE loss	✔	✔	43.44

Table 7. Ablation study isolating the effect of K-fold cross-validation on CIFAR100-LT (IF = 100). All multi-model baselines use three sub-models to match the computational cost of ACVM (K = 3). The best result is highlighted in bold.

Method	Configuration Details	Accuracy
CE Baseline	Single model, 200 epochs	38.21
CE Baseline	Single model, 500 epochs	39.15
CDB loss	Single model, 500 epochs	43.26
Independent Models	3 models trained on D for difficulty estimation	40.68
Standard Ensemble	Averaged accuracy of 3 independent CE models	41.85
Teacher-Student	Single model distilled from a 3-model ensemble	41.52
ACVM (K = 3)	Proposed K-fold sub-validation + ADVW + DHL	43.44

Table 8. Quantitative evaluation of the estimated class difficulty on CIFAR100-LT (IF = 100). Correlation (r) denotes the Pearson correlation between the estimated difficulty and actual test error. “Variance” (scaled by 10⁻³) denotes the average variance in the estimated difficulty across three independent runs or folds. The best results are highlighted in bold.

Method	r	Variance (Head)	Variance (Medium)	Variance (Tail)
CE Loss	0.52	1.4	2.1	3.5
CDB Loss	0.78	1.1	2.9	6.7
ACVM	0.89	0.9	1.8	3.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, T.; He, W.; Shao, C.; Zheng, S.; Yu, H. ACVM: An Adaptive Combination Validation Mechanism for Long-Tailed Image Recognition. Information 2026, 17, 455. https://doi.org/10.3390/info17050455

AMA Style

Sun T, He W, Shao C, Zheng S, Yu H. ACVM: An Adaptive Combination Validation Mechanism for Long-Tailed Image Recognition. Information. 2026; 17(5):455. https://doi.org/10.3390/info17050455

Chicago/Turabian Style

Sun, Tianci, Wanqiu He, Changbin Shao, Shang Zheng, and Hualong Yu. 2026. "ACVM: An Adaptive Combination Validation Mechanism for Long-Tailed Image Recognition" Information 17, no. 5: 455. https://doi.org/10.3390/info17050455

APA Style

Sun, T., He, W., Shao, C., Zheng, S., & Yu, H. (2026). ACVM: An Adaptive Combination Validation Mechanism for Long-Tailed Image Recognition. Information, 17(5), 455. https://doi.org/10.3390/info17050455

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ACVM: An Adaptive Combination Validation Mechanism for Long-Tailed Image Recognition

Abstract

1. Introduction

2. Related Work

2.1. Resampling

2.2. Reweighting

3. Methods

3.1. Description About the Long-Tail Issue

3.2. The Overall Framework of ACVM

3.3. Methods in ACVM

3.3.1. ADVW

3.3.2. DHL

3.4. The Procedure of ACVM Algorithm

4. Experiments

4.1. Datasets

4.2. Experimental Details

4.3. Compared Methods

4.4. Main Results

4.5. Ablation Study

4.5.1. The Effect of α

4.5.2. The Effect of K

4.5.3. The Effect of Each Module

4.5.4. Isolating the Effect of K-Fold Cross-Validation

4.5.5. Quality of the Estimated Class Difficulty

5. Concluding Remarks

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI