You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

26 July 2023

A Long-Tailed Image Classification Method Based on Enhanced Contrastive Visual Language

,
and
1
Beijing Key Laboratory of Internet Culture and Digital Dissemination, Beijing Information Science and Technology University, Beijing 100101, China
2
Beijing Advanced Innovation Center for Materials Genome Engineering, Beijing Information Science and Technology University, Beijing 100101, China
3
Software Engineering College, Zhengzhou University of Light Industry, Zhengzhou 450002, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue AI-Driven Sensing for Image Processing and Recognition

Abstract

To solve the problem that the common long-tailed classification method does not use the semantic features of the original label text of the image, and the difference between the classification accuracy of most classes and minority classes are large, the long-tailed image classification method based on enhanced contrast visual language trains the head class and tail class samples separately, uses text image to pre-train the information, and uses the enhanced momentum contrastive loss function and RandAugment enhancement to improve the learning of tail class samples. On the ImageNet-LT long-tailed dataset, the enhanced contrasting visual language-based long-tailed image classification method has improved all class accuracy, tail class accuracy, middle class accuracy, and the F1 value by 3.4%, 7.6%, 3.5%, and 11.2%, respectively, compared to the BALLAD method. The difference in accuracy between the head class and tail class is reduced by 1.6% compared to the BALLAD method. The results of three comparative experiments indicate that the long-tailed image classification method based on enhanced contrastive visual language has improved the performance of tail classes and reduced the accuracy difference between the majority and minority classes.

1. Introduction

Image classification [1] is the earliest application of machine learning in the field of computer vision, and is the foundation of other visual tasks such as object detection and instance segmentation. Due to the rich semantic information contained in images (such as multiple targets, scenes, and behaviors), the characteristics closest to human perception and expression ability, and the gradual optimization of the performance and cost of visual sensors (mainly cameras), image classification and its derived detection, segmentation and other visual algorithms are gradually being applied in fields such as healthcare, transportation, and signal processing [2]. However, in the application process, due to the unique nature of the actual environment, some difficult to solve problems are gradually encountered.
In image classification tasks, input data are manually collected and annotated, and through human intervention, the amount of data in each category are balanced as much as possible, with no significant difference in sample size among different categories. The manually balanced dataset simplifies the requirements for algorithm robustness; but with the gradual increase in the focus categories, maintaining the balance among various categories will bring exponential growth in acquisition costs. For example, if an animal classification dataset is to be built, it is easier to collect millions of pictures from common data such as cats and dogs. However, considering the balance of the dataset, it is also necessary to collect the same amount of samples for rare animals such as snow leopards. With the increase in the rarity of the category, the collection volume tends to show exponential growth, as shown in Figure 1.
Figure 1. Schematic diagram of the long-tailed distribution of natural animal species.
In practical applications, such as facial recognition, species classification, autonomous driving, medical diagnosis, drone detection, and other fields, there is a problem of long-tailed class imbalance [3]. For example, for autonomous driving, the data on normal driving will account for the vast majority, while there are very little data on actual abnormal situations/car accident risks. For medical diagnosis, the number of people with specific diseases is also extremely uneven compared to the normal population. However, this type of imbalance problem often makes the training of deep neural networks very difficult. Classification and recognition systems that directly use long-tailed distribution data for training often tend to lean towards the head class data, making them insensitive to tail class features during prediction and affecting the correct judgment of the system [3]. In traditional methods, a series of common methods to mitigate performance degradation caused by long-tailed distribution data are based on categories re-balancing strategy, including re-sampling training data and re-weighting to redesign the loss function [3]. These methods can effectively reduce the bias of the model to the head class in the training process, thus producing a more accurate classification decision boundary. However, because the distribution of the original data is unbalanced, and the over parameterized deep networks are easy to fit to this composite distribution, they often face the risk of tail class overfitting and head class under-fitting.
Given that the problem of class imbalance in long-tailed distribution datasets is very widespread in practical tasks, it is crucial to train high-performance network models from many images that follow the long-tailed distribution. Moreover, the difference in class distribution between training and testing data will greatly limit the practical application of neural networks. This research topic has important practical significance and is an important paradigm for promoting the implementation of deep neural networks in model implementation. How to effectively utilize long-tailed data to train a balanced classifier is a key issue. From a practical implementation perspective, this study will improve the speed of data collection and reduce collection costs. This article explores effective contrastive learning strategies to learn better image representations from imbalanced data, in order to better apply them to long-tailed image classification. We hope to provide better development ideas for the application of image classification in today’s gradually developing image technology. The ECVL method proposed in this article utilizes text label information of images for pre-training to assist in image classification, transforming image recognition problems into visual language matching problems. After pre-training, model training is conducted separately for the head class and tail class, which can improve the performance of the tail class without sacrificing the performance of the head class.

3. Method

Real data often follow a long-tailed distribution, with the head class dominating the training and the tail class having only a few samples, which is a major challenge in the field of image classification. The existing methods either use manually balanced datasets (such as ImageNet) or develop more robust algorithms to process data, such as class re-balancing strategies and network module improvements.
Although the above methods are effective for long-tailed distribution datasets, they sacrifice the performance of the head class at different levels. To address these limitations, researchers have turned to exploring new network architecture training paradigms. Long-tailed classification models typically include two key parts: feature extractors and classifiers. For each component, there are corresponding methods, either designing better classifiers [37,54], or learning reliable representations [55,56]. In terms of the new training framework, existing work attempts to divide one stage of training into two stages. For example, the learning process of decoupling training method [57] is decoupled into representation learning and classifier training. In addition, the ensemble learning scheme [51,53] first learns multiple experts with different data subsets, and then combines them to deal with the long-tailed distribution image classification problems. However, these methods all use a limited set of predefined labels to train the model, ignoring the availability of semantic feature information in the original label text of the image. After research, it was found that previous work was most limits to a predetermined approach when dealing with imbalanced datasets, which relied entirely on visual models and completely ignored the semantic features of the original label text rich in the image itself. This may be a promising solution to impose additional supervision on insufficient data sources.
The large-scale visual-language pre-training model provides a new approach for image classification. Through open vocabulary supervision, pre-trained visual-language models can learn powerful multi-modal representations (input information can be expressed in multiple ways). Utilize semantic similarity between visual input and text input to transform visual recognition into a visual-language matching problem. Comparative visual language models such as CLIP [58] and ALIGN [59] provide new ideas for long-tailed classification tasks. The feature extractors of these models integrate image and text modalities, focusing on learning feature matching between different modalities. They have strong robustness, but lack the ability to model complex interactions between images and text.
Due to the significant difference in classification accuracy between majority and minority classes in commonly used long-tailed classification algorithms, the failure to utilize the semantic features of the original image label text, and the inability of existing contrastive visual-language models to model complex interactions between images and text, this paper proposes an enhanced contrastive visual language long-tailed image classification algorithm (ECVL). The algorithm uses a two-stage training method, designs the loss function for text and image retrieval, respectively, uses enhanced momentum to compare the loss function to measure the learning degree of samples, and applies random enhancement to the categories with insufficient learning degree to further strengthen the learning of the model for minority samples.

3.1. The Overall Framework

Similar to common contrastive visual-language models, the ECVL long-tailed image classification algorithm uses a two-stage training approach to transform visual recognition into a visual-language matching problem through similarity between visual and text inputs. The first stage mainly uses the visual features of the image and the semantic features of the original label text to train for most categories. The second stage first uses class balance for a few categories, and then uses linear adapters to carry out differentiated training. Finally, use the enhancement momentum to compare the loss function to measure the memory of the model for samples. For samples with insufficient memory, use the RandAugment [60] to select random enhancement methods. Enhancing breadth can further enrich feature representation.

3.2. Contrasting Visual-Language Pre-Training Model

Compare visual language models with a dual encoder architecture, including a language encoder L e n c and a visual encoder V e n c . Given an input image I , use V e n c extracts the visual features of image I using the equation shown in (1). Similarly, use L e n c encodes the input text sequence T as its corresponding text feature, as shown in the Equation (2).
f v = V e n c ( I ) R d v
f l = L e n c ( T ) R d l
After extracting the features of each modality, use two transformation matrices W v R d v × d and W l R d 1 × d project the original visual and textual features into a shared embedding space, where v and u are d-dimensional normalized vectors, as shown in Equation (3).
v = W v f v W v f v , u = W l f l W l f l
In the pre-training stage, for text–image pairs in a batch, the training goal is to shorten the distance between the same category and different categories, L v l for text retrieval, L l v for image retrieval, where τ indicates that the temperature exceeds the parameter, and τ represents the number of text–image pairs in a batch. L v l and L l v are as shown in Equations (4) and (5).
L v l = 1 N i N l o g exp ( v i u i / τ ) j = 1 N e x p ( v i u j / τ )
L l v = 1 N i N l o g exp ( u i v i / τ ) j = 1 N e x p ( u i v j / τ )
By converting the category labels of an image into a text sequence of “A photo of a {Class}”, the matching score between the target image and the text sequence of all categories can be obtained. The category with the highest score is selected as the final predicted category. The normalized test image features are represented as v , and the normalized text features are represented as { u 1 , , u K } . Therefore, the category probability of the test image is shown in Equations (5) and (6), where p i represents the probability of class i , and K represents the total number of candidate classes. Finally, the text label with the highest probability will be selected as the prediction result.
p i = e x p ( v u i ) / τ j = 1 K e x p ( v u j ) / τ

3.3. Balanced Linear Adapters

The performance of contrastive visual-language models on the head and tail classes is balanced, while traditional contrastive learning methods such as PaCo [61] have lower performance on the tail classes due to a lack of training samples. Inspired by the zero-shot classification ability of visual-language comparison models, improvements were made on the basis of CLIP. The training of long-tailed data is divided into two stages. The first stage fully utilizes existing training data and ensures the performance of most categories, while the second stage focuses on improving the learning ability of a few categories. These two stages aim at the long-tailed and balance training samples, respectively, and refine the comparison loss function.
According to the research results proposed by Gururangan [62] et al., in Phase I, model pre-training with domain adaptation and task adaptation can greatly improve the performance of the target NLP task. Similarly, this applies equally to image classification tasks. In stage one, pre-training using the contrastive visual-language backbone model on the long-tailed target dataset is also beneficial for learning most class samples, making full use of available training data. Since the input of the model in Phase I is to process image category labels into text sequences, the contrastive loss function used in the pre-training is Equation (4). The parameters of the text encoder and image encoder are updated instantly during training. After stage one training, most classes usually achieve good results, while minority class samples require stage two balance training. The processing of the stage model is shown in Figure 2.
Figure 2. The model processing flow chart of Phase I.
Due to the insufficient sample size and limited data for the tail category, direct training on the backbone in Phrase II will result in overfitting. Therefore, in this stage, pre-training is not conducted on the backbone, but instead, linear adapters and enhancements are used to optimize the visual language representation of a few category samples for momentum contrast loss. As shown in Figure 3, the processing of the semantic features of the original label text is the same as that of Stage I. Assuming the original image feature is f , the weight matrix of the linear adapter is W R d × d . The offset is R d , and the processed image features can be expressed as Equation (7).
f = λ R e L U ( W f + b ) L D C V L + ( 1 λ ) f
where λ , the residual factor, is used to dynamically combine the image features after fine-tuning in the second stage with the original image features in the first stage.
Figure 3. The model processing flow chart of Phase II.
The enhanced momentum contrastive loss function is used to measure the learning of the model for samples. Assuming x i is the training sample on the long-tailed dataset, x i The comparison loss is expressed as L i . { L i , 0 , , L i , t , , L i , T } represents the tracking loss value L i among T Epochs. Based on this, define the moving average momentum loss, as shown in Equation (8).
L i , 0 m = L i , 0 , L i , t m = β L i , t 1 m + ( 1 β ) L i , t
The β is a hyperparameter that represents the smoothness of the loss. After training T Epochs using the above moving average momentum loss, the set of momentum losses for each sample can be obtained as { L 0 , t m , , L i , t m , , L N , t m } , where N is the number of training samples in the dataset. Finally, the definition of momentum loss is normalized as follows, as shown in Equation (9):
M i , t = 1 2 L i , t m L t - m max { L i , t m L t - m } i = 0 , , N + 1
where L t - m represents the average momentum loss of the t Epoch. The range of M i normalized values is 0 , 1 , with an average value of 0.5, reflecting the model’s level of sample memory. To promote model learning, use M i to control the occurrence and intensity of enhancement indicators. The specific approach follows RandAugment [61], randomly selecting k types of enhancements and using probability M i and intensity 0 ,   M i apply each enhancement. Assuming that the enhancement set defined by RandAugment is A = A 1 , , A j , , A K , where K is the enhancement amount, k enhancements are applied in each step. On this basis, define a memory enhancement function, as shown in Equation (10).
Ψ ( x i ; A , M i ) = a 1 ( x i ) a k ( x i ) , a j ( x i ) = A j x i ; M i ζ x i   u U ( 0,1 ) & u < M i other
where ζ sampling is from uniformly distributed U 0,1 . A j x i ; M i ζ represent x i undergoes the j enhancement with a strength of M i ζ . Apply the selected k enhancements in sequence in A . For simplicity, use Ψ x i to represent Ψ ( x i ; A , M i ) . In this paper, the enhanced momentum loss function is shown in Equation (11).
L D C V L = 1 N i N l o g exp ( f ( Ψ ( x i ) ) f ( Ψ ( x i + ) ) τ ) Σ x i X e x p ( f ( Ψ ( x i ) ) f ( Ψ ( x i ) ) τ )
where X represents X x i + , x i and x i + represents two views of a sample, x i X is a view of other samples. Intuitively, the enhanced momentum contrastive Loss function is used to measure the memory of the model for the samples, and adaptively allocate appropriate enhanced strength for the samples with insufficient memory.
In the training process of stage 2, to avoid the model deviating from the head class, a class balance sampling strategy [8] is still used to construct a balanced training sample set. Assuming there are K classes in the target dataset to form a total of N training samples. The number of training samples for class j is expressed as n j . Then, use Equation (12) to represent N .
N = j = 1 K n j
Assuming that classes are sorted in descending order, the long-tailed distribution means n i n j ( i < j and n 1 n K ). For class balanced sampling, the probability of sampling each data point from class j is defined as q j = 1 / K . In other words, to construct a balanced training sample set, first select a class from K candidate objects, and then sample a data point from the selected class. Finally, through stage two, use L v l to fine tune the balanced training data.

3.4. Algorithm Description

Based on the introduction of the ECVL long-tailed image classification algorithm in the previous text, this section mainly introduces the training process of the long-tailed image classification algorithm based on enhanced contrastive visual language in two different stages: stage one and stage two, as shown in Algorithms 1 and 2.
Algorithm 1 is the training process for model stage one, which simultaneously trains the visual and language branches of the visual language model. In each Epoch, it is preferred to input images and corresponding category text information; Afterwards, the visual features of the image and the semantic features of the original label text are extracted using Equations (1) and (2), respectively. Additionally, then use L v l to perform text retrieval using L l v to perform image retrieval to obtain associated image and text information. Finally, use Equation (6) to predict the image category, and evaluate the prediction results using evaluation indicators after the classification is completed.
Algorithm 1: Phrase I
input: Iinput = {images, labels}, Tinput = {texts, labels}
output: modelweight
 1: for epoch = 1 to max_epoch do
 2:  T = Encode(labels, text)
 3:  I = Encode(labels, images)
 4:   train(model, I)
 5:   Eval(model, images, labels)
 6:   Logits(I, T)
 7:   pthepoch = {weight}
 8: end for
Algorithm 2 is the training process for model stage 2. The model first balances a few types of samples, and then fine tunes the linear adapter. After fine tuning, it uses the enhanced momentum loss function described according to Equation (11) to evaluate the sample learning situation. For samples with insufficient representation of learning features, RA random enhayncement is used. Finally, the features learned in these two stages are dynamically fused and output.
Algorithm 2: Phrase II
input: Iinput = {images, labels}, Tinput = {texts, labels}, modelstage1
output: weight
 1: model = load(best_model)
 2: for epoch = 1 to max_epoch do
 3:  if epoch >= 2 then
 4:     I = Rebalance(Momentum)
 5:  end if
 6:  Momentum = model(I, labels, epoch)
 7:  train(model, I)
 8:  eval(model, images, labels)
 9:  Logit(model, I, T)
 10:   pthepoch = {weight}
 11: end for

4. Experiments

The ECVL algorithm takes 229 s to infer 100 images on a single NVIDIA A100 40 G GPU. In order to verify the performance of the proposed ECVL long-tailed image classification algorithm, experiments were carried out on three common long-tailed distribution datasets CIFAR100-LT, Places-LT and ImageNet-LT to analyze the performance of this algorithm, and ablation experiments were conducted to prove the role of enhanced momentum in comparison with the loss function and random enhancement. This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.

4.1. Long-Tailed Image Datasets

The dataset used in the comparative experiment of this paper is a common dataset in the field of long-tailed image classification, including CIFAR100-LT [27], Places-LT [63], and ImageNet-LT [63]. The composition of each dataset is introduced below.

4.1.1. CIFAR100-LT

CIFAR100-LT [27] is the dataset obtained by long-tailed of the dataset CIFAR100. It is created by reducing the number of training samples of each class through the exponential function, and the test set remains unchanged.

4.1.2. Places-LT

Places-LT [63] is a dataset obtained by long-tailed transformation based on the Places [55] dataset. The Places dataset contains 10 million images classified by scene, and the label of the sample represents the meaning of the scene. It is currently the largest scene dataset in the world with the largest sample size, as shown in Figure 2. The long-tailed rate of the training set in the Places-LT dataset is 996, and the number of categories is 365. The total sample size in the training set is 62,500, and the sample size in the test set is 7300. The category with the largest sample size in the training set is 4980, while the category with the smallest sample size is 5. The ratio of the maximum to minimum sample size is 996, making it the dataset with the largest long-tailed rate used in this article.

4.1.3. ImageNet-LT

ImageNet-LT [63] was obtained through the long-tailed ImageNet dataset, with a total of 1000 categories. The total number of samples in the dataset exceeds 186 K, with 116 K training samples, 20 K validation samples, and 50 K testing samples. In ImageNet-LT, the long-tailed rate in the training set is 256, the maximum class sample size is 1280, and the minimum class sample size is 5. This dataset simulates the distribution of long-tailed data commonly found in real life. The data in the training set are divided into three parts. The header category contains categories with a sample size greater than 100, the middle category contains categories with a sample size greater than 20 but less than 100, and the tail category contains categories with a sample size smaller than 20.

4.2. Experimental Design and Validation

All experiments in this article are based on Python implementation, version 1.7.1. The server system used in the experiment is Ubuntu 20.04, CUDA version 10.1, and the AdamW optimizer and 300 Epochs are used to train the model. The experiment was trained on a NVIDIA A100 40 G × 8 GPU device. The configuration details of the experimental environment are shown in Table 2.
Table 2. Experimental environment.

4.2.1. Experimental Results and Analysis of CIFAR100-LT

In this experiment, the backbone network used by ECVL was ResNet-50, and the experiment was conducted on the long-tailed distribution dataset CIFAR100-LT. The experimental results are shown in Table 3. The enhanced contrastive visual language long-tailed classification algorithm proposed in this paper has an accuracy of 20.5% and 17.2% higher in tail categories than RIDE [46] and TADE [53], and an accuracy of 6.7% and 6.0% higher in all categories compared to RIDE [46] and TADE [53], respectively. The F1 values are 14.3% and 11.8% higher than RIDE [46] and TADE [53], respectively. ECVL improves the accuracy difference between majority and minority classes, not only improving the performance of majority classes but also improving the recognition accuracy of minority classes. It also proves that using the semantic features of the original label text as supplementary information for classification is helpful in improving the performance of the model.
Table 3. Experimental results of ECVL on CIFAR100-LT.

4.2.2. Experimental Results and Analysis of ImageNet-LT

In this experiment, the comparative experimental results are shown in Table 4. Compared with the long-tailed image classification algorithm that only uses contrastive learning, the enhanced contrastive visual language proposed in this paper has an accuracy of 29.2% higher in tail classes than PaCo [61], 13.6% higher in all classes than PaCo [61], and a F1 value of 14.9% higher than PaCo [61]. The accuracy of CWTA in tail classes is 7.9% higher than that of BALLAD [64], 3.4% higher in all classes, and 11.2% higher in F1 values than BALLAD [64]. This not only proves that the proposed enhanced momentum contrastive loss function is more effective than only using contrast loss, but also proves that using text–image pairs for pre-training is helpful for improving model performance.
Table 4. Experimental results of ECVL on ImageNet-LT.

4.2.3. Experimental Results and Analysis of Places-LT

In this experiment, the ECVL algorithm uses ResNet-50 as the backbone network and conducts experiments on the long-tailed distribution dataset Places-LT. The comparative experimental results are shown in Table 5. Compared with the long-tailed image classification algorithm that only uses contrastive learning, the ECVL long-tailed image classification algorithm has an accuracy of 10.1% higher on tail classes than PaCo [61], an accuracy of 6.0% higher on all classes than PaCo [61], and an F1 value of 7.3% higher than PaCo [61]; Compared with the comparative visual language model BALLAD [64], the accuracy on the tail class is improved by 1.3%. The experiment shows that the enhanced momentum contrastive loss function in ECVL is more effective than only using the contrast loss function, and it is helpful to train the model by randomly enhancing the samples with insufficient learning after processing the enhanced momentum contrast loss function.
Table 5. Experimental results of ECVL on Places-LT.

4.3. Experimental Design and Validation

The ECVL long-tailed image classification algorithm proposed in this paper uses the visual characteristics of the image itself and the semantic characteristics of the original label text, the enhanced momentum contrastive loss function and RandAugment to complete the long-tailed classification, and performs well on the public long-tailed dataset. In order to verify the effectiveness of enhanced momentum vs. the loss function and random enhancement in the model, this section conducts ablation experimental analysis on them on different public long-tailed distribution datasets, and the experimental results are shown in Table 6, Table 7 and Table 8.
Table 6. Ablation experiment of ECVL on CIFAR100-LT.
Table 7. Ablation experiment of ECVL on ImageNet-LT.
Table 8. Ablation experiment of ECVL on Places-LT.
On CIFAR100-LT, the difference in classification accuracy between most categories and minority categories decreased by 1.8% compared with only using enhanced momentum to compare the loss function and neither using enhanced momentum to compare the loss function nor using random enhancement; With the enhanced momentum contrastive loss function and the random enhancement module, the classification accuracy of most categories and minority categories increased by 2.5% and 3.4%, respectively, than without the random enhancement module. On ImageNet-LT, compared with using only the enhanced momentum contrastive loss function module and neither the enhanced momentum contrastive loss function nor the random enhancement module, the difference between the classification accuracy of most classes and minority classes decreased by 0.7%; compared with the loss function and the random enhancement module with enhanced momentum, the classification accuracy of most categories and minority categories increased by 0.7% and 1.2%, respectively. Through analysis, it is found that although the accuracy of all classes is improved by not using the enhanced momentum contrastive loss function or the random enhancement module, there is still a large difference in the accuracy difference between the majority of categories and the minority in the final fine-tuning process; after adding the enhanced momentum contrastive loss function, the accuracy difference between the majority and minority classes has improved, but in some cases there is degradation (such as Places-LT dataset). The enhanced momentum comparison between the loss function and the random enhancement module can improve the overall accuracy and reduce the accuracy difference between the majority and minority.

5. Conclusions

This article first analyzes the advantages and disadvantages of existing long-tailed image classification methods, proposes a long-tailed classification algorithm based on enhanced contrastive visual-language, and then elaborates on the algorithm framework, algorithm design details, algorithm design process, and comparative experimental analysis. In addition, this article conducts comparative experiments and ablation research analysis on three long-tailed datasets: CIFAR100-LT, ImageNet-LT, and Places-LT.
Compared with BALLAD method, ECVL on CIFAR100-LT reduces the difference in classification accuracy between majority and minority classes by 5.7%, and increases F1 by 8.5%. Compared with BALLAD, ECVL on ImageNet-LT reduces the difference in classification accuracy between majority and minority classes by 1.7%, and increases F1 by 11.2%. Compared with BALLAD, the F1 of ECVL on Places-LT has increased by 5.8%. On Places-LT, compared with using only the enhanced momentum contrastive loss function module and neither the enhanced momentum contrastive loss function nor the random enhancement module, the difference in classification accuracy between most classes and minority classes decreased by 1.8%. Compared with the non-random enhancement module, the accuracy rate of minority classification and F1 of the enhanced momentum contrastive loss function and random enhancement module increased by 0.7% and 1.3%, respectively. The classification accuracy, difference in accuracy between majority and minority categories, F1, and convergence of the model in different quantity categories in the experiment have demonstrated the effectiveness of the algorithm proposed in this paper.
The ECVL method can effectively improve the classification accuracy of tail classes and reduce the difference in classification accuracy between head and tail classes. However, due to the use of pre-training mechanisms, the computational complexity of this method is slightly higher. In the future, improvements can be made in accuracy and complexity to further improve the performance of long-tailed classification models.

Author Contributions

Conceptualization, Y.S., M.L. and B.W.; methodology, M.L.; formal analysis, M.L.; investigation, M.L.; writing—original draft preparation, M.L.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China (Grant No. 61872043) and the State Key Laboratory of Computer Architecture (ICT, CAS) under Grant No. CARCHA202103.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Tas, S.; Sari, O.; Dalveren, Y.; Pazar, S.; Kara, A.; Derawi, M. Deep learning-based vehicle classification for low quality images. Sensors 2022, 22, 4740. [Google Scholar] [CrossRef] [PubMed]
  2. Berwo, M.A.; Khan, A.; Fang, Y.; Fahim, H.; Javaid, S.; Mahmood, J.; Abideen, Z.U.; M.S., S. Deep Learning Techniques for Vehicle Detection and Classification from Images/Videos: A Survey. Sensors 2023, 23, 4832. [Google Scholar] [CrossRef] [PubMed]
  3. Wang, Z.; Shen, H.; Xiong, W.; Zhang, X.; Hou, J. Method for Diagnosing Bearing Faults in Electromechanical Equipment Based on Improved Prototypical Networks. Sensors 2023, 23, 4485. [Google Scholar] [CrossRef] [PubMed]
  4. Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng, J.; Kalantidis, Y. Decoupling representation and classifier for long-tailed recognition. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
  5. Wang, T.; Li, Y.; Kang, B.; Li, J.; Liew, J.; Tang, S.; Hoi, S.; Feng, J. The devil is in classification: A simple framework for long-tail instance segmentation. In Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 728–744. [Google Scholar]
  6. Park, M.; Song, H.J.; Kang, D.O. Imbalanced Classification via Feature Dictionary-Based Minority Oversampling. IEEE Access 2022, 10, 34236–34245. [Google Scholar] [CrossRef]
  7. Li, T.; Wang, Y.; Liu, L.; Chen, L.; Chen, C.L.P. Subspace-based minority oversampling for imbalance classification. Inf. Sci. 2023, 621, 371–388. [Google Scholar] [CrossRef]
  8. Lee, Y.S.; Bang, C.C. Framework for the Classification of Imbalanced Structured Data Using Under-Sampling and Convolutional Neural Network. Inf. Syst. Front. 2021, 24, 1795–1809. [Google Scholar] [CrossRef]
  9. Lehmann, D.; Ebner, M. Subclass-Based Undersampling for Class-Imbalanced Image Classification. In Proceedings of the 17th International Conference on Computer Vision Theory and Applications, Online, 6–8 February 2022; pp. 493–500. [Google Scholar]
  10. Farshidvard, A.; Hooshmand, F.; MirHassani, S.A. A novel two-phase clustering-based under-sampling method for imbalanced classification problems. Expert Syst. Appl. 2023, 213, 119003. [Google Scholar] [CrossRef]
  11. Ding, H.; Wei, B.; Gu, Z.; Zheng, H.; Zheng, B. KA-Ensemble: Towards imbalanced image classification ensembling under-sampling and over-sampling. Multimed. Tools Appl. 2020, 79, 14871–14888. [Google Scholar] [CrossRef]
  12. Swana, E.F.; Doorsamy, W.; Bokoro, P. Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors 2022, 22, 3246. [Google Scholar] [CrossRef] [PubMed]
  13. Gupta, A.; Dollar, P.; Girshick, R. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5356–5364. [Google Scholar]
  14. Peng, J.; Bu, X.; Sun, M.; Zhang, Z.; Tan, T.; Yan, J. Large-scale object detection in the wild from imbalanced multi-labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9709–9718. [Google Scholar]
  15. Hu, X.; Jiang, Y.; Tang, K.; Chen, J.; Miao, C.; Zhang, H. Learning to segment the tail. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14045–14054. [Google Scholar]
  16. Wu, J.; Song, L.; Wang, T.; Zhang, Q.; Yuan, J. Forest r-cnn: Large-vocabulary long-tailed object detection and instance segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1570–1578. [Google Scholar]
  17. Zhou, B.; Cui, Q.; Wei, X.S.; Chen, Z.-M. BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9719–9728. [Google Scholar]
  18. Zang, Y.; Huang, C.; Loy, C.C. FASA: Feature augmentation and sampling adaptation for long-tailed instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3457–3466. [Google Scholar]
  19. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  20. Hermans, A.; Beyer, L.; Leibe, B. In Defense of the Triplet Loss for Person Re-Identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
  21. Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar]
  22. Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; p. 32. [Google Scholar]
  23. Wu, T.; Huang, Q.; Liu, Z.; Wang, Y.; Lin, D. Distribution-balanced loss for multi-label classification in long-tailed datasets. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 162–178. [Google Scholar]
  24. Tan, J.; Wang, C.; Li, B.; Li, Q.; Ouyang, W.; Yin, C.; Yan, J. Equalization loss for long-tailed object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11662–11671. [Google Scholar]
  25. Tan, J.; Lu, X.; Zhang, G.; Yin, C.; Li, Q. Equalization loss v2: A new gradient balance approach for long-tailed object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1685–1694. [Google Scholar]
  26. Wang, J.; Zhang, W.; Zang, Y.; Cao, Y.; Pang, J.; Gong, T.; Chen, K.; Liu, Z.; Loy, C.C.; Lin, D. Seesaw loss for long-tailed instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9695–9704. [Google Scholar]
  27. Hong, Y.; Han, S.; Choi, K.; Seo, S.; Kim, B.; Chang, B. Disentangling label distribution for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6626–6636. [Google Scholar]
  28. Ren, J.; Yu, C.; Ma, X.; Ma, X.; Zhao, H.; Yi, S.; Li, H. Balanced meta-softmax for long-tailed visual recognition. Adv. Neural Inf. Process. Syst. 2020, 33, 4175–4186. [Google Scholar]
  29. Deng, Z.; Liu, H.; Wang, Y.; Wang, C.; Yu, Z.; Sun, X. PML: Progressive margin loss for long-tailed age classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10503–10512. [Google Scholar]
  30. Wu, T.; Liu, Z.; Huang, Q.; Wang, Y.; Lin, D. Adversarial robustness under long-tailed distribution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8659–8668. [Google Scholar]
  31. Xiao, L.; Xu, J.; Zhao, D.; Shang, E.; Zhu, Q.; Dai, B. Adversarial and Random Transformations for Robust Domain Adaptation and Generalization. Sensors 2023, 23, 5273. [Google Scholar] [CrossRef] [PubMed]
  32. Park, S.; Kim, J.; Jeong, H.-Y.; Kim, T.-K.; Yoo, J. C2RL: Convolutional-Contrastive Learning for Reinforcement Learning Based on Self-Pretraining for Strong Augmentation. Sensors 2023, 23, 4946. [Google Scholar] [CrossRef] [PubMed]
  33. Zhong, Z.; Cui, J.; Liu, S.; Jia, J. Improving calibration for long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16489–16498. [Google Scholar]
  34. Li, S.; Gong, K.; Liu, C.H.; Wang, Y.; Qiao, F.; Cheng, X. Metasaug: Meta semantic augmentation for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5212–5221. [Google Scholar]
  35. Wang, Y.; Pan, X.; Song, S.; Zhang, H.; Wu, C.; Huang, G. Implicit semantic data augmentation for deep networks. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; p. 32. [Google Scholar]
  36. Yin, X.; Yu, X.; Sohn, K.; Liu, X.; Chandraker, M. Feature transfer learning for face recognition with under-represented data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5704–5713. [Google Scholar]
  37. Liu, J.; Sun, Y.; Han, C.; Dou, Z.; Li, W. Deep representation learning on long-tailed data: A learnable embedding augmentation perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2970–2979. [Google Scholar]
  38. Chu, P.; Bian, X.; Liu, S.; Ling, H. Feature space augmentation for long-tailed data. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 694–710. [Google Scholar]
  39. Cui, Y.; Song, Y.; Sun, C.; Howard, A.; Belongie, S. Large scale fine-grained categorization and domain-specific transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4109–4118. [Google Scholar]
  40. Yang, Y.; Xu, Z. Rethinking the value of labels for improving class-imbalanced learning. Adv. Neural Inf. Process. Syst. 2020, 33, 19290–19301. [Google Scholar]
  41. He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
  42. Li, T.; Wang, L.; Wu, G. Self-supervision to distillation for long-tailed visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 630–639. [Google Scholar]
  43. Wei, H.; Tao, L.; Xie, R.; Feng, L.; An, B. Open-Sampling: Exploring Out-of-Distribution data for Re-balancing Long-tailed datasets. In Proceedings of the International Conference on Machine Learning (PMLR), Baltimore, MA, USA, 17–23 July 2022; pp. 23615–23630. [Google Scholar]
  44. Changpinyo, S.; Sharma, P.; Ding, N.; Soricut, R. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3558–3568. [Google Scholar]
  45. Xiang, L.; Ding, G.; Han, J. Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part V 16; Springer International Publishing: Cham, Switzerland, 2020; pp. 247–263. [Google Scholar]
  46. Wang, X.; Lian, L.; Miao, Z.; Liu, Z.; Yu, S.X. Long-tailed Recognition by Routing Diverse Distribution-Aware Experts. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
  47. He, Y.Y.; Wu, J.; Wei, X.S. Distilling virtual examples for long-tailed recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 235–244. [Google Scholar]
  48. Wei, C.; Sohn, K.; Mellina, C.; Yuille, A.; Yang, F. Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10857–10866. [Google Scholar]
  49. Zhang, C.; Pan, T.Y.; Li, Y.; Hu, H.; Xuan, D.; Changpinyo, S.; Gong, B.; Chao, W.-L. MosaicOS: A simple and effective use of object-centric images for long-tailed object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 417–427. [Google Scholar]
  50. Guo, H.; Wang, S. Long-tailed multi-label visual recognition by collaborative training on uniform and re-balanced samplings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15089–15098. [Google Scholar]
  51. Cai, J.; Wang, Y.; Hwang, J.N. Ace: Ally complementary experts for solving long-tailed recognition in one-shot. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 112–121. [Google Scholar]
  52. Cui, J.; Liu, S.; Tian, Z.; Zhong, Z.; Jia, J. Reslt: Residual learning for long-tailed recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3695–3706. [Google Scholar] [CrossRef] [PubMed]
  53. Zhang, Y.; Hooi, B.; Hong, L.; Feng, J. Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision. arXiv 2021, arXiv:2107.09249. [Google Scholar]
  54. Tang, K.; Huang, J.; Zhang, H. Long-tailed classification by keeping the good and removing the bad momentum causal effect. Adv. Neural Inf. Process. Syst. 2020, 33, 1513–1524. [Google Scholar]
  55. Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1452–1464. [Google Scholar] [CrossRef] [PubMed]
  56. Zhu, L.; Yang, Y. Inflated episodic memory with region self-attention for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4344–4353. [Google Scholar]
  57. Kang, B.; Li, Y.; Xie, S.; Feng, J. Exploring balanced feature spaces for representation learning. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
  58. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  59. Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.V.; Sung, Y.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
  60. Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 702–703. [Google Scholar]
  61. Cui, J.; Zhong, Z.; Liu, S.; Yu, B.; Jia, J. Parametric contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 715–724. [Google Scholar]
  62. Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. arXiv 2020, arXiv:2004.10964. [Google Scholar]
  63. Liu, Z.; Miao, Z.; Zhan, X.; Wang, J.; Gong, B.; Yu, S.X. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2537–2546. [Google Scholar]
  64. Ma, T.; Geng, S.; Wang, M.; Shao, J.; Lu, J.; Li, H.; Gao, P.; Qiao, Y. A Simple Long-Tailed Recognition Baseline via Vision-Language Model. arXiv 2021, arXiv:2111.14745. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.